thr3ads.net - CentOS - [CentOS] Result: hostbyte=DID_ERROR driverbyte=DRIVER

If this information is useful, please help other people find it:
Share via:

lakhera2017

2017-Jan-03 19:59 UTC

[CentOS] Result: hostbyte=DID_ERROR driverbyte=DRIVER_OK

I am trying to copy(~7TB of data using rsync) between two server in same
data center in the backend its using EMC VMAX3

After copying ~30-40GB of data multipath start failing

Dec 15 01:57:53 test.example.com multipathd:
360000970000196801239533037303434: Recovered to normal mode
Dec 15 01:57:53 test.example.com multipathd:
360000970000196801239533037303434: remaining active paths: 1
Dec 15 01:57:53 test.example.com kernel: sd 1:0:2:20: [sdeu]  Result:
hostbyte=DID_ERROR driverbyte=DRIVER_OK 

[root at test log]# multipath -ll |grep -i fail
 |- 1:0:0:15 sdq  65:0   failed ready running
  - 3:0:0:15 sdai 66:32  failed ready running

We are using default multipath.conf

HBA driver version  8.07.00.26.06.8-k

HBA model QLogic Corp. ISP8324-based 16Gb Fibre Channel to PCI Express
Adapter

OS: CentOS 64-bit/2.6.32-642.6.2.el6.x86_64
Hardware:Intel/HP ProLiant DL380 Gen9

Already verified this solution and checked with EMC everything looks good
https://access.redhat.com/solutions/438403

Some more info

- There is no drop/error packet on the network side.

    Filesystem is mounted with noatime,nodiratime
    Filesystem ext4(Already tried xfs but same error)
    LVM is in striped mode(Started with linear option and then converted to
striped)

    Already disabled THP

    echo never > /sys/kernel/mm/redhat_transparent_hugepage/enabled
    Whenever multipath start failing process goes to D state
    System firmware upgraded
    Tried with latest version of qlogic driver
    Tried with different scheduler(noop,deadline,cfq)
    Tried with different tuned profile(enterprise-storage)

Vmcore collected during the time of issue

I am able to collect vmcore during the time of issue

  KERNEL: /usr/lib/debug/lib/modules/2.6.32-642.6.2.el6.x86_64/vmlinux
DUMPFILE: vmcore  [PARTIAL DUMP]
    CPUS: 36
    DATE: Fri Dec 16 00:11:26 2016
  UPTIME: 01:48:57
  LOAD AVERAGE: 0.41, 0.49, 0.60
   TASKS: 1238
NODENAME: test.example.com
 RELEASE: 2.6.32-642.6.2.el6.x86_64
 VERSION: #1 SMP Wed Oct 26 06:52:09 UTC 2016
 MACHINE: x86_64  (2297 Mhz)
  MEMORY: 511.9 GB
   PANIC: "BUG: unable to handle kernel NULL pointer dereference at
0000000000000018"
     PID: 15840
 COMMAND: "kjournald"
    TASK: ffff884023446ab0  [THREAD_INFO: ffff88103def4000]
     CPU: 2
   STATE: TASK_RUNNING (PANIC)





--
View this message in context:
http://centos.1050465.n5.nabble.com/Result-hostbyte-DID-ERROR-driverbyte-DRIVER-OK-tp5746449.html
Sent from the CentOS mailing list archive at Nabble.com.

Steven Tardy

2017-Jan-05 01:48 UTC

head link

[CentOS] Result: hostbyte=DID_ERROR driverbyte=DRIVER_OK

> On Jan 3, 2017, at 2:59 PM, lakhera2017 <plakhera at salesforce.com>
wrote:
> 
> |- 1:0:0:15 sdq  65:0   failed ready running
>  - 3:0:0:15 sdai 66:32  failed ready running
Does the same SAN target fail each time?
What brand/model/firmware SAN switch is between initiator and target?
Does the HBA show any SCSI aborts?

lakhera2017

2017-Jan-05 18:42 UTC

head link

[CentOS] Result: hostbyte=DID_ERROR driverbyte=DRIVER_OK

Hi Steven

Please find my answer inline

On Wed, Jan 4, 2017 at 5:48 PM, Steven Tardy-2 [via CentOS] <
ml-node+s1050465n5746476h5 at n5.nabble.com> wrote:
>
> > On Jan 3, 2017, at 2:59 PM, lakhera2017 <[hidden email]
> <http:///user/SendEmail.jtp?type=node&node=5746476&i=0>>
wrote:
> >
> > |- 1:0:0:15 sdq  65:0   failed ready running
> >  - 3:0:0:15 sdai 66:32  failed ready running
>
> Does the same SAN target fail each time?
>
>> Nope ever time its different target
> What brand/model/firmware SAN switch is between initiator and target?
>
>> Cisco MDS 9710NX-OS Version 6.2.15
8 Gb SFP end to end connectivity

VMAX3
Enginuity Build Version : 5977.813.785

> Does the HBA show any SCSI aborts?
>
>> Reply from EMC


*ENG can see the ab3e/cc3e error logs on a write of 0x180 blocks that spans
tracks from head B to head C.*

*First 0x100 blocks transferred okay.*
*But when we send receiver ready for remaining 80 blocks the hosts sends an
abts so we need to find out why the host is aborting the write.*


> _______________________________________________
> CentOS mailing list
> [hidden email]
<http:///user/SendEmail.jtp?type=node&node=5746476&i=1>
> https://lists.centos.org/mailman/listinfo/centos
>
>
> ------------------------------
> If you reply to this email, your message will be added to the discussion
> below:
> http://centos.1050465.n5.nabble.com/Result-hostbyte-
> DID-ERROR-driverbyte-DRIVER-OK-tp5746449p5746476.html
> To unsubscribe from Result: hostbyte=DID_ERROR driverbyte=DRIVER_OK, click
> here
>
<http://centos.1050465.n5.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=5746449&code=cGxha2hlcmFAc2FsZXNmb3JjZS5jb218NTc0NjQ0OXwxMjE5NjMzMTE2>
> .
> NAML
>
<http://centos.1050465.n5.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>



--
View this message in context:
http://centos.1050465.n5.nabble.com/Result-hostbyte-DID-ERROR-driverbyte-DRIVER-OK-tp5746449p5746490.html
Sent from the CentOS mailing list archive at Nabble.com.

Possibly Parallel Threads

Search for more apparently analagous threads

CentOS - Jan 2017 - Result: hostbyte=DID_ERROR driverbyte=DRIVER_OK

[CentOS] Result: hostbyte=DID_ERROR driverbyte=DRIVER_OK

[CentOS] Result: hostbyte=DID_ERROR driverbyte=DRIVER_OK

[CentOS] Result: hostbyte=DID_ERROR driverbyte=DRIVER_OK

Possibly Parallel Threads