thr3ads.net - freebsd stable - 6.2-Release ..ish.. CF + ata == freeze? [Feb 2012]

If this information is useful, please help other people find it:
Share via:

john fleming

2012-Feb-14 04:55 UTC

6.2-Release ..ish.. CF + ata == freeze?

Just thought i would post over here as i'm not getting a warm fuzzy from
checkpoint about being able to find the root cause of an issue. I have a large
install base of IPSO checkpoint firewalls, which are based on FreeBSD 6.2.
I've had 3 firewalls hang basically the same way, with something that looks
like a filesystem issue or an?issue with a CF card.
?
Does anyone happen to know of any bugs (i've been looking around) that could
cause something like that? Granted, it could be a batch of bad CF cards, but its
odd that i'm seeing the same thing on 3 different boxes and once rebooted
they seem ok.
?
Also is it possible to get useful info form the atacontroller when things go
south like this from the ddb prompt?
?
This is what shows in show msgbuf
ad0: timeout waiting to issue command
ad0: error issuing WRITE command
ad0: timeout waiting to issue command
ad0: error issuing WRITE command
ad0: timeout waiting to issue command
ad0: error issuing WRITE command
ad0: timeout waiting to issue command
ad0: error issuing WRITE command
g_vfs_done():ad0s4h[WRITE(offset=33849344, length=131072)]error = 5 
g_vfs_done():ad0s4h[WRITE(offset=33980416, length=131072)]error = 5 
g_vfs_done():ad0s4h[WRITE(offset=34111488, length=131072)]error = 5
?g_vfs_done():ad0s4h[WRITE(offset=34242560, length=131072)]error = 5 
g_vfs_done():ad0s4h[WRITE(offset=34373632, length=131072)]error = 5 
?
ad0: 1882MB <STEC M2+ CF 9.0.2 K1186-2> at ata0-master PIO4
atapci0: <Intel 6300ESB UDMA100 controller> port
0x1f0-0x1f7,0x3f6,0x170-0x177,0x376,0x5070-0x507f mem 0x80301000-0x803013ff at
device 31.1 on pci0
ata0: <ATA channel 0> on atapci0
ata1: <ATA channel 1> on atapci0
atapci1: <Intel 6300ESB SATA150 controller> port
0x5088-0x508f,0x50a4-0x50a7,0x5080-0x5087,0x50a0-0x50a3,0x5060-0x506f irq 15 at
device 31.2 on pci0
ata2: <ATA channel 0> on atapci1
ata3: <ATA channel 1> on atapci1ad0s4h is basically a r/w ufs partition on
the box where almost anything that needs to be written goes.
trace
Tracing pid 1101 tid 100043 td 0x656d8460
kdb_enter(608cc388,6246,656d8460,64ba1400,6095d580,...) at kdb_enter+0x2b
siointr1(64ba1400) at siointr1+0xf0
siointr(64ba1400) at siointr+0x38
intr_execute_handler(6095d580,f0a4ab04,6,6095d580,f0a4aafc,...) at
intr_execute_handler+0x61
intr_execute_handlers(6095d580,f0a4ab04,6,0,656d8460,...) at
intr_execute_handlers+0x40
atpic_handle_intr(4) at atpic_handle_intr+0x96
Xatpic_intr4() at Xatpic_intr4+0x20
--- interrupt, eip = 0x606044af, esp = 0xf0a4ab48, ebp = 0xf0a4ab5c ---
lockmgr(e1456a04,6,0,656d8460) at lockmgr+0x58f
getdirtybuf(e14569a4,60a405e4,1) at getdirtybuf+0x2e2
flush_deplist(68b30850,1,f0a4abb8) at flush_deplist+0x30
flush_inodedep_deps(656fa28c,1f235) at flush_inodedep_deps+0xcf
softdep_sync_metadata(65964618) at softdep_sync_metadata+0x61
ffs_syncvnode(65964618,1) at ffs_syncvnode+0x3a2
ffs_fsync(f0a4ac74) at ffs_fsync+0x12
VOP_FSYNC_APV(60949260,f0a4ac74) at VOP_FSYNC_APV+0x38
fsync(656d8460,f0a4acb4) at fsync+0x170
syscall(805003b,806003b,5fbf003b,8050000,288be450,...) at syscall+0x2ee
Xint0x80_syscall() at Xint0x80_syscall+0x1f
--More--

Jason Hellenthal

2012-Feb-14 05:13 UTC

head link

6.2-Release ..ish.. CF + ata == freeze?

On Mon, Feb 13, 2012 at 08:43:08PM -0800, john fleming
wrote:> Just thought i would post over here as i'm not getting a warm fuzzy
from checkpoint about being able to find the root cause of an issue. I have a
large install base of IPSO checkpoint firewalls, which are based on FreeBSD 6.2.
I've had 3 firewalls hang basically the same way, with something that looks
like a filesystem issue or an?issue with a CF card.
> ?
> Does anyone happen to know of any bugs (i've been looking around) that
could cause something like that? Granted, it could be a batch of bad CF cards,
but its odd that i'm seeing the same thing on 3 different boxes and once
rebooted they seem ok.
> ?
> Also is it possible to get useful info form the atacontroller when things
go south like this from the ddb prompt?
> ?
> This is what shows in show msgbuf
> ad0: timeout waiting to issue command
> ad0: error issuing WRITE command
> ad0: timeout waiting to issue command
> ad0: error issuing WRITE command
> ad0: timeout waiting to issue command
> ad0: error issuing WRITE command
> ad0: timeout waiting to issue command
> ad0: error issuing WRITE command
> g_vfs_done():ad0s4h[WRITE(offset=33849344, length=131072)]error = 5 
> g_vfs_done():ad0s4h[WRITE(offset=33980416, length=131072)]error = 5 
> g_vfs_done():ad0s4h[WRITE(offset=34111488, length=131072)]error = 5
> ?g_vfs_done():ad0s4h[WRITE(offset=34242560, length=131072)]error = 5 
> g_vfs_done():ad0s4h[WRITE(offset=34373632, length=131072)]error = 5 
> ?
> ad0: 1882MB <STEC M2+ CF 9.0.2 K1186-2> at ata0-master PIO4
> atapci0: <Intel 6300ESB UDMA100 controller> port
0x1f0-0x1f7,0x3f6,0x170-0x177,0x376,0x5070-0x507f mem 0x80301000-0x803013ff at
device 31.1 on pci0
> ata0: <ATA channel 0> on atapci0
> ata1: <ATA channel 1> on atapci0
> atapci1: <Intel 6300ESB SATA150 controller> port
0x5088-0x508f,0x50a4-0x50a7,0x5080-0x5087,0x50a0-0x50a3,0x5060-0x506f irq 15 at
device 31.2 on pci0
> ata2: <ATA channel 0> on atapci1
> ata3: <ATA channel 1> on atapci1ad0s4h is basically a r/w ufs
partition on the box where almost anything that needs to be written goes.
> trace
> Tracing pid 1101 tid 100043 td 0x656d8460
> kdb_enter(608cc388,6246,656d8460,64ba1400,6095d580,...) at kdb_enter+0x2b
> siointr1(64ba1400) at siointr1+0xf0
> siointr(64ba1400) at siointr+0x38
> intr_execute_handler(6095d580,f0a4ab04,6,6095d580,f0a4aafc,...) at
intr_execute_handler+0x61
> intr_execute_handlers(6095d580,f0a4ab04,6,0,656d8460,...) at
intr_execute_handlers+0x40
> atpic_handle_intr(4) at atpic_handle_intr+0x96
> Xatpic_intr4() at Xatpic_intr4+0x20
> --- interrupt, eip = 0x606044af, esp = 0xf0a4ab48, ebp = 0xf0a4ab5c ---
> lockmgr(e1456a04,6,0,656d8460) at lockmgr+0x58f
> getdirtybuf(e14569a4,60a405e4,1) at getdirtybuf+0x2e2
> flush_deplist(68b30850,1,f0a4abb8) at flush_deplist+0x30
> flush_inodedep_deps(656fa28c,1f235) at flush_inodedep_deps+0xcf
> softdep_sync_metadata(65964618) at softdep_sync_metadata+0x61
> ffs_syncvnode(65964618,1) at ffs_syncvnode+0x3a2
> ffs_fsync(f0a4ac74) at ffs_fsync+0x12
> VOP_FSYNC_APV(60949260,f0a4ac74) at VOP_FSYNC_APV+0x38
> fsync(656d8460,f0a4acb4) at fsync+0x170
> syscall(805003b,806003b,5fbf003b,8050000,288be450,...) at syscall+0x2ee
> Xint0x80_syscall() at Xint0x80_syscall+0x1f
This looks to be a problem with softupdates and CF cards. Can you get
this to repeat on a brand new (good) card ?

-- 
;s =;

Jeremy Chadwick

2012-Feb-14 05:18 UTC

head link

6.2-Release ..ish.. CF + ata == freeze?

On Mon, Feb 13, 2012 at 08:43:08PM -0800, john fleming
wrote:> Just thought i would post over here as i'm not getting a warm fuzzy
from checkpoint about being able to find the root cause of an issue. I have a
large install base of IPSO checkpoint firewalls, which are based on FreeBSD 6.2.
I've had 3 firewalls hang basically the same way, with something that looks
like a filesystem issue or an?issue with a CF card.
FreeBSD 6.2 was EOL'd in early-to-mid-2008.  The ATA driver has changed
significantly since then (present-day uses CAM).
> Does anyone happen to know of any bugs (i've been looking around) that
could cause something like that? Granted, it could be a batch of bad CF cards,
but its odd that i'm seeing the same thing on 3 different boxes and once
rebooted they seem ok.
> ?
> Also is it possible to get useful info form the atacontroller when things
go south like this from the ddb prompt?
Not particularly.  What's shown below indicates that the driver had
issued some form of ATA write command (there are multiple kinds per ATA
specification), and either the underlying media (CF/disk) or controller
stalled/locked up/took too long.  I forget what the timeout value is in
6.2; I can't be bothered to remember such from 6 years ago.  :-)
> This is what shows in show msgbuf
> ad0: timeout waiting to issue command
> ad0: error issuing WRITE command
> ad0: timeout waiting to issue command
> ad0: error issuing WRITE command
> ad0: timeout waiting to issue command
> ad0: error issuing WRITE command
> ad0: timeout waiting to issue command
> ad0: error issuing WRITE command
> g_vfs_done():ad0s4h[WRITE(offset=33849344, length=131072)]error = 5 
> g_vfs_done():ad0s4h[WRITE(offset=33980416, length=131072)]error = 5 
> g_vfs_done():ad0s4h[WRITE(offset=34111488, length=131072)]error = 5
> ?g_vfs_done():ad0s4h[WRITE(offset=34242560, length=131072)]error = 5 
> g_vfs_done():ad0s4h[WRITE(offset=34373632, length=131072)]error = 5 
error 5 = EIO = Input/output error.  But this isn't too big of a
surprise given the timeouts you see prior.

Are these CF cards brand new -- meaning, are they completely unused
(having never had any writes done to them), or have they been in use a
while?  I'm betting they've been in use a while, and have probably been
doing many writes over the years.

Two things to note here:

1) The errors you've shown are only happening on writes, not reads.  Of
course if you omitted information then this isn't an accurate statement.
2) Timeouts are seen when issuing writes to some LBA regions.

How full is the CF card, disk-space-wise?  Not just ad0s4h, I'm talking
about the entire card.  How much space is roughly available?  They're
very small CF cards (1.8GByte roughly), and the less space available,
the less effectiveness of wear levelling (and in some cases the slower
the writes are).

Reason I ask: given that these are CF cards, this smells of cards which
are simply "worn down".  CF cards have limited numbers of writes, and
the card may be "freaking out" internally when attempting to write to
some LBAs which map to CF sectors that are, in effect, "bad".  The CF
cards' ECC implementation may be buggy, or may simply be "spinning
hard"
for too long.  You can read about this sort of behaviour on Wikipedia's
CompactFlash article.

You wouldn't be able to verify this with dd if=/dev/ad0, because those
are read operations.  You could zero the media (dd if=/dev/zero
of=/dev/ad0) as a form of verification if you wanted.

Do you happen to know if these CF cards support SMART?  If so,
installing smartmontools (version 5.42 or newer please) and providing
output from "smartctl -a /dev/ad0" may be helpful to me, but I make no
guarantees anything of use will be shown there.

Overall my advice would be to replace the CF cards, especially if they
have been in use for a long while.  It really doesn't matter to me that
it's happening on 3 machines (honest), especially if these are 6.2
machines with CF cards that have been in use for years.  We're lucky to
get 2 years out of our CF cards on our Juniper M120/320s before they
start spitting I/O errors.  Pick larger CF cards as well; more space more room
for effective wear levelling.
> ?
> ad0: 1882MB <STEC M2+ CF 9.0.2 K1186-2> at ata0-master PIO4
> atapci0: <Intel 6300ESB UDMA100 controller> port
0x1f0-0x1f7,0x3f6,0x170-0x177,0x376,0x5070-0x507f mem 0x80301000-0x803013ff at
device 31.1 on pci0
> ata0: <ATA channel 0> on atapci0
> ata1: <ATA channel 1> on atapci0
> atapci1: <Intel 6300ESB SATA150 controller> port
0x5088-0x508f,0x50a4-0x50a7,0x5080-0x5087,0x50a0-0x50a3,0x5060-0x506f irq 15 at
device 31.2 on pci0
> ata2: <ATA channel 0> on atapci1
> ata3: <ATA channel 1> on atapci1ad0s4h is basically a r/w ufs
partition on the box where almost anything that needs to be written goes.
> trace
> Tracing pid 1101 tid 100043 td 0x656d8460
> kdb_enter(608cc388,6246,656d8460,64ba1400,6095d580,...) at kdb_enter+0x2b
> siointr1(64ba1400) at siointr1+0xf0
> siointr(64ba1400) at siointr+0x38
> intr_execute_handler(6095d580,f0a4ab04,6,6095d580,f0a4aafc,...) at
intr_execute_handler+0x61
> intr_execute_handlers(6095d580,f0a4ab04,6,0,656d8460,...) at
intr_execute_handlers+0x40
> atpic_handle_intr(4) at atpic_handle_intr+0x96
> Xatpic_intr4() at Xatpic_intr4+0x20
> --- interrupt, eip = 0x606044af, esp = 0xf0a4ab48, ebp = 0xf0a4ab5c ---
> lockmgr(e1456a04,6,0,656d8460) at lockmgr+0x58f
> getdirtybuf(e14569a4,60a405e4,1) at getdirtybuf+0x2e2
> flush_deplist(68b30850,1,f0a4abb8) at flush_deplist+0x30
> flush_inodedep_deps(656fa28c,1f235) at flush_inodedep_deps+0xcf
> softdep_sync_metadata(65964618) at softdep_sync_metadata+0x61
> ffs_syncvnode(65964618,1) at ffs_syncvnode+0x3a2
> ffs_fsync(f0a4ac74) at ffs_fsync+0x12
> VOP_FSYNC_APV(60949260,f0a4ac74) at VOP_FSYNC_APV+0x38
> fsync(656d8460,f0a4acb4) at fsync+0x170
> syscall(805003b,806003b,5fbf003b,8050000,288be450,...) at syscall+0x2ee
> Xint0x80_syscall() at Xint0x80_syscall+0x1f
-- 
| Jeremy Chadwick                                 jdc@parodius.com |
| Parodius Networking                     http://www.parodius.com/ |
| UNIX Systems Administrator                 Mountain View, CA, US |
| Making life hard for others since 1977.             PGP 4BD6C0CB |

Adam Vande More

2012-Feb-14 05:38 UTC

head link

6.2-Release ..ish.. CF + ata == freeze?

On Mon, Feb 13, 2012 at 10:43 PM, john fleming
<jflemingeds@yahoo.com>wrote:
> Just thought i would post over here as i'm not getting a warm fuzzy
from
> checkpoint about being able to find the root cause of an issue. I have a
> large install base of IPSO checkpoint firewalls, which are based on FreeBSD
> 6.2. I've had 3 firewalls hang basically the same way, with something
that
> looks like a filesystem issue or an issue with a CF card.
>
There was a thread just the other day mentioned lockup problems with DMA
and CF cards. Disabling DMA or reducing the mode helped.  Not sure if
applicable to that old of FreeBSD version.

-- 
Adam Vande More

freebsd stable - Feb 2012 - 6.2-Release ..ish.. CF + ata == freeze?

6.2-Release ..ish.. CF + ata == freeze?

6.2-Release ..ish.. CF + ata == freeze?

6.2-Release ..ish.. CF + ata == freeze?

6.2-Release ..ish.. CF + ata == freeze?