thr3ads.net - freebsd stable - siis_timeout with port multiplier on 9.0R [May 2012]

If this information is useful, please help other people find it:
Share via:

Matthew Gamble

2012-May-22 01:05 UTC

siis_timeout with port multiplier on 9.0R

We have a box with 3 SiI3124 SATA controllers and 9 CFI-B53PM 5 Port Backplane
port multipliers (the "backblaze storage pod").  Under intense IO (ZFS
rebuild, presently) the system will lock up all IO for 3-4 minutes and the
following entry appears in the dmesg:

siisch11: Timeout on slot 30
siisch11: siis_timeout is 00040000 ss 65000000 rs 65000000 es 00000000 sts
80192000 serr 00000000
siisch11:  ... waiting for slots 25000000
siisch11: Timeout on slot 26
siisch11: siis_timeout is 00040000 ss 65000000 rs 65000000 es 00000000 sts
80192000 serr 00000000
siisch11:  ... waiting for slots 21000000
siisch11: Timeout on slot 29
siisch11: siis_timeout is 00040000 ss 65000000 rs 65000000 es 00000000 sts
80192000 serr 00000000
siisch11:  ... waiting for slots 01000000
siisch11: Timeout on slot 24
siisch11: siis_timeout is 00040000 ss 65000000 rs 65000000 es 00000000 sts
80192000 serr 00000000

The errors are on different siisch devices so its not likely to be a SATA cable
issue unless multiple cables all went bad at the same time.  On the advice of
some other posts to the mailing list I've already tried locking the SATA rev
to one with the following in /boot/loader.conf which didn't

hint.siisch.0.sata_rev=1
hint.siisch.1.sata_rev=1
hint.siisch.2.sata_rev=1
hint.siisch.3.sata_rev=1
hint.siisch.4.sata_rev=1
hint.siisch.5.sata_rev=1
hint.siisch.6.sata_rev=1
hint.siisch.7.sata_rev=1
hint.siisch.8.sata_rev=1
hint.siisch.9.sata_rev=1
hint.siisch.10.sata_rev=1
hint.siisch.11.sata_rev=1

From time to time this is also causing one of the attached drives to go offline:

siisch0: siis_timeout is 00040000 ss 40000000 rs 40000000 es 00000000 sts
801f2000 serr 00000000
(ada0:siisch0:0:0:0): lost device
(ada0:siisch0:0:0:0): removing device entry
ada0 at siisch0 bus 0 scbus0 target 0 lun 0
ada0: <WDC WD30EZRX-00MMMB0 80.00A80> ATA-8 SATA 3.x device
ada0: 150.000MB/s transfers (SATA 1.x, UDMA6, PIO 8192bytes)
ada0: Command Queueing enabled
ada0: 2861588MB (5860533168 512 byte sectors: 16H 63S/T 16383C)
ada0: Previously was known as ad4
siisch11: Timeout on slot 30

When the drive goes offline that causes the ZFS rebuild to restart, and so
it's never finishing the rebuild of the array.  Does anyone have any insight
into what could be causing the timeouts and what we can do to resolve them? 
Right now my priority is to get the system a bit more stable so the current ZFS
rebuild can complete ? right now it's been doing the same rebuild for just
over 6 days and the timeouts and drive drop offs are causing it to restart
constantly.





________________________________

 This electronic message contains information from Primus Telecommunications
Canada Inc. ("PRIMUS") , which may be legally privileged and
confidential. The information is intended to be for the use of the individual(s)
or entity named above. If you are not the intended recipient, be aware that any
disclosure, copying, distribution or use of the contents of this information is
prohibited. If you have received this electronic message in error, please notify
us by telephone or e-mail (to the number or address above) immediately. Any
views, opinions or advice expressed in this electronic message are not
necessarily the views, opinions or advice of PRIMUS. It is the responsibility of
the recipient to ensure that any attachments are virus free and PRIMUS bears no
responsibility for any loss or damage arising in any way from the use
thereof.The term "PRIMUS" includes its affiliates.

________________________________
 Pour la version en fran?ais de ce message, veuillez voir
http://www.primustel.ca/fr/legal/cs.htm

Mike Tancsa

2012-May-23 14:23 UTC

head link

siis_timeout with port multiplier on 9.0R

On 5/21/2012 9:04 PM, Matthew Gamble wrote:> We have a box with 3 SiI3124 SATA controllers and 9 CFI-B53PM 5 Port
Backplane port multipliers (the "backblaze storage pod").  Under
intense IO (ZFS rebuild, presently) the system will lock up all IO for 3-4
minutes and the following entry appears in the dmesg:
> 
> siisch11: Timeout on slot 30
> siisch11: siis_timeout is 00040000 ss 65000000 rs 65000000 es 00000000 sts
80192000 serr 00000000
> siisch11:  ... waiting for slots 25000000
> siisch11: Timeout on slot 26
> siisch11: siis_timeout is 00040000 ss 65000000 rs 65000000 es 00000000 sts
80192000 serr 00000000
> siisch11:  ... waiting for slots 21000000
> siisch11: Timeout on slot 29
> siisch11: siis_timeout is 00040000 ss 65000000 rs 65000000 es 00000000 sts
80192000 serr 00000000
> siisch11:  ... waiting for slots 01000000
> siisch11: Timeout on slot 24
> siisch11: siis_timeout is 00040000 ss 65000000 rs 65000000 es 00000000 sts
80192000 serr 00000000
> 
> The errors are on different siisch devices so its not likely to be a SATA
cable issue unless multiple cables all went bad at the same time.  On the advice
of some other posts to the mailing list I've already tried locking the SATA
rev to one with the following in /boot/loader.conf which didn't
If they are on different siisch devices then yes, it does not sound like
a bad cable. However, I have had that issue with similar errors above
that were fixed by using new cables.  If you are using 9.0R, I would
suggest upgrading to stable. There have been a few bug fixes /
improvements to the drivers as well as various parts of the disk
subsystem. I have RELENG8 right now and its quite stable for me on a
25TB system which is for the most part similar to 9.x

# zpool status
  pool: zbackup1
 state: ONLINE
  scan: scrub repaired 0 in 11h11m with 0 errors on Mon Jul 25 19:51:11 2011
config:

        NAME        STATE     READ WRITE CKSUM
        zbackup1    ONLINE       0     0     0
          raidz1-0  ONLINE       0     0     0
            ada14   ONLINE       0     0     0
            ada16   ONLINE       0     0     0
            ada13   ONLINE       0     0     0
            ada15   ONLINE       0     0     0
          raidz1-1  ONLINE       0     0     0
            ada0    ONLINE       0     0     0
            ada1    ONLINE       0     0     0
            ada2    ONLINE       0     0     0
            ada3    ONLINE       0     0     0
          raidz1-2  ONLINE       0     0     0
            ada4    ONLINE       0     0     0
            ada5    ONLINE       0     0     0
            ada6    ONLINE       0     0     0
            ada7    ONLINE       0     0     0
          raidz1-3  ONLINE       0     0     0
            ada9    ONLINE       0     0     0
            ada10   ONLINE       0     0     0
            ada11   ONLINE       0     0     0
            ada12   ONLINE       0     0     0

errors: No known data errors
# zpool get all zbackup1
NAME      PROPERTY       VALUE       SOURCE
zbackup1  size           25.4T       -
zbackup1  capacity       68%         -
zbackup1  altroot        -           default
zbackup1  health         ONLINE      -
zbackup1  guid           917659042733882722  default
zbackup1  version        28          default
zbackup1  bootfs         -           default
zbackup1  delegation     on          default
zbackup1  autoreplace    off         default
zbackup1  cachefile      -           default
zbackup1  failmode       wait        default
zbackup1  listsnapshots  on          local
zbackup1  autoexpand     off         default
zbackup1  dedupditto     0           default
zbackup1  dedupratio     1.00x       -
zbackup1  free           7.95T       -
zbackup1  allocated      17.4T       -
zbackup1  readonly       off         -
zbackup1  comment        -           default

This is on an adonics adaptor.

	---Mike> 
> hint.siisch.0.sata_rev=1
> hint.siisch.1.sata_rev=1
> hint.siisch.2.sata_rev=1
> hint.siisch.3.sata_rev=1
> hint.siisch.4.sata_rev=1
> hint.siisch.5.sata_rev=1
> hint.siisch.6.sata_rev=1
> hint.siisch.7.sata_rev=1
> hint.siisch.8.sata_rev=1
> hint.siisch.9.sata_rev=1
> hint.siisch.10.sata_rev=1
> hint.siisch.11.sata_rev=1
> 
> From time to time this is also causing one of the attached drives to go
offline:
> 
> siisch0: siis_timeout is 00040000 ss 40000000 rs 40000000 es 00000000 sts
801f2000 serr 00000000
> (ada0:siisch0:0:0:0): lost device
> (ada0:siisch0:0:0:0): removing device entry
> ada0 at siisch0 bus 0 scbus0 target 0 lun 0
> ada0: <WDC WD30EZRX-00MMMB0 80.00A80> ATA-8 SATA 3.x device
> ada0: 150.000MB/s transfers (SATA 1.x, UDMA6, PIO 8192bytes)
> ada0: Command Queueing enabled
> ada0: 2861588MB (5860533168 512 byte sectors: 16H 63S/T 16383C)
> ada0: Previously was known as ad4
> siisch11: Timeout on slot 30
> 
> When the drive goes offline that causes the ZFS rebuild to restart, and so
it's never finishing the rebuild of the array.  Does anyone have any insight
into what could be causing the timeouts and what we can do to resolve them? 
Right now my priority is to get the system a bit more stable so the current ZFS
rebuild can complete ? right now it's been doing the same rebuild for just
over 6 days and the timeouts and drive drop offs are causing it to restart
constantly.
> 
> 
> 
> 
> 
> ________________________________
> 
>  This electronic message contains information from Primus
Telecommunications Canada Inc. ("PRIMUS") , which may be legally
privileged and confidential. The information is intended to be for the use of
the individual(s) or entity named above. If you are not the intended recipient,
be aware that any disclosure, copying, distribution or use of the contents of
this information is prohibited. If you have received this electronic message in
error, please notify us by telephone or e-mail (to the number or address above)
immediately. Any views, opinions or advice expressed in this electronic message
are not necessarily the views, opinions or advice of PRIMUS. It is the
responsibility of the recipient to ensure that any attachments are virus free
and PRIMUS bears no responsibility for any loss or damage arising in any way
from the use thereof.The term "PRIMUS" includes its affiliates.
> 
> ________________________________
>  Pour la version en fran?ais de ce message, veuillez voir
> http://www.primustel.ca/fr/legal/cs.htm
> 
> 
> 
> _______________________________________________
> freebsd-stable@freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-stable
> To unsubscribe, send any mail to
"freebsd-stable-unsubscribe@freebsd.org"

-- 
-------------------
Mike Tancsa, tel +1 519 651 3400
Sentex Communications, mike@sentex.net
Providing Internet services since 1994 www.sentex.net
Cambridge, Ontario Canada   http://www.tancsa.com/

freebsd stable - May 2012 - siis_timeout with port multiplier on 9.0R

siis_timeout with port multiplier on 9.0R

siis_timeout with port multiplier on 9.0R