Just thought i would post over here as i'm not getting a warm fuzzy from checkpoint about being able to find the root cause of an issue. I have a large install base of IPSO checkpoint firewalls, which are based on FreeBSD 6.2. I've had 3 firewalls hang basically the same way, with something that looks like a filesystem issue or an?issue with a CF card. ? Does anyone happen to know of any bugs (i've been looking around) that could cause something like that? Granted, it could be a batch of bad CF cards, but its odd that i'm seeing the same thing on 3 different boxes and once rebooted they seem ok. ? Also is it possible to get useful info form the atacontroller when things go south like this from the ddb prompt? ? This is what shows in show msgbuf ad0: timeout waiting to issue command ad0: error issuing WRITE command ad0: timeout waiting to issue command ad0: error issuing WRITE command ad0: timeout waiting to issue command ad0: error issuing WRITE command ad0: timeout waiting to issue command ad0: error issuing WRITE command g_vfs_done():ad0s4h[WRITE(offset=33849344, length=131072)]error = 5 g_vfs_done():ad0s4h[WRITE(offset=33980416, length=131072)]error = 5 g_vfs_done():ad0s4h[WRITE(offset=34111488, length=131072)]error = 5 ?g_vfs_done():ad0s4h[WRITE(offset=34242560, length=131072)]error = 5 g_vfs_done():ad0s4h[WRITE(offset=34373632, length=131072)]error = 5 ? ad0: 1882MB <STEC M2+ CF 9.0.2 K1186-2> at ata0-master PIO4 atapci0: <Intel 6300ESB UDMA100 controller> port 0x1f0-0x1f7,0x3f6,0x170-0x177,0x376,0x5070-0x507f mem 0x80301000-0x803013ff at device 31.1 on pci0 ata0: <ATA channel 0> on atapci0 ata1: <ATA channel 1> on atapci0 atapci1: <Intel 6300ESB SATA150 controller> port 0x5088-0x508f,0x50a4-0x50a7,0x5080-0x5087,0x50a0-0x50a3,0x5060-0x506f irq 15 at device 31.2 on pci0 ata2: <ATA channel 0> on atapci1 ata3: <ATA channel 1> on atapci1ad0s4h is basically a r/w ufs partition on the box where almost anything that needs to be written goes. trace Tracing pid 1101 tid 100043 td 0x656d8460 kdb_enter(608cc388,6246,656d8460,64ba1400,6095d580,...) at kdb_enter+0x2b siointr1(64ba1400) at siointr1+0xf0 siointr(64ba1400) at siointr+0x38 intr_execute_handler(6095d580,f0a4ab04,6,6095d580,f0a4aafc,...) at intr_execute_handler+0x61 intr_execute_handlers(6095d580,f0a4ab04,6,0,656d8460,...) at intr_execute_handlers+0x40 atpic_handle_intr(4) at atpic_handle_intr+0x96 Xatpic_intr4() at Xatpic_intr4+0x20 --- interrupt, eip = 0x606044af, esp = 0xf0a4ab48, ebp = 0xf0a4ab5c --- lockmgr(e1456a04,6,0,656d8460) at lockmgr+0x58f getdirtybuf(e14569a4,60a405e4,1) at getdirtybuf+0x2e2 flush_deplist(68b30850,1,f0a4abb8) at flush_deplist+0x30 flush_inodedep_deps(656fa28c,1f235) at flush_inodedep_deps+0xcf softdep_sync_metadata(65964618) at softdep_sync_metadata+0x61 ffs_syncvnode(65964618,1) at ffs_syncvnode+0x3a2 ffs_fsync(f0a4ac74) at ffs_fsync+0x12 VOP_FSYNC_APV(60949260,f0a4ac74) at VOP_FSYNC_APV+0x38 fsync(656d8460,f0a4acb4) at fsync+0x170 syscall(805003b,806003b,5fbf003b,8050000,288be450,...) at syscall+0x2ee Xint0x80_syscall() at Xint0x80_syscall+0x1f --More--
On Mon, Feb 13, 2012 at 08:43:08PM -0800, john fleming wrote:> Just thought i would post over here as i'm not getting a warm fuzzy from checkpoint about being able to find the root cause of an issue. I have a large install base of IPSO checkpoint firewalls, which are based on FreeBSD 6.2. I've had 3 firewalls hang basically the same way, with something that looks like a filesystem issue or an?issue with a CF card. > ? > Does anyone happen to know of any bugs (i've been looking around) that could cause something like that? Granted, it could be a batch of bad CF cards, but its odd that i'm seeing the same thing on 3 different boxes and once rebooted they seem ok. > ? > Also is it possible to get useful info form the atacontroller when things go south like this from the ddb prompt? > ? > This is what shows in show msgbuf > ad0: timeout waiting to issue command > ad0: error issuing WRITE command > ad0: timeout waiting to issue command > ad0: error issuing WRITE command > ad0: timeout waiting to issue command > ad0: error issuing WRITE command > ad0: timeout waiting to issue command > ad0: error issuing WRITE command > g_vfs_done():ad0s4h[WRITE(offset=33849344, length=131072)]error = 5 > g_vfs_done():ad0s4h[WRITE(offset=33980416, length=131072)]error = 5 > g_vfs_done():ad0s4h[WRITE(offset=34111488, length=131072)]error = 5 > ?g_vfs_done():ad0s4h[WRITE(offset=34242560, length=131072)]error = 5 > g_vfs_done():ad0s4h[WRITE(offset=34373632, length=131072)]error = 5 > ? > ad0: 1882MB <STEC M2+ CF 9.0.2 K1186-2> at ata0-master PIO4 > atapci0: <Intel 6300ESB UDMA100 controller> port 0x1f0-0x1f7,0x3f6,0x170-0x177,0x376,0x5070-0x507f mem 0x80301000-0x803013ff at device 31.1 on pci0 > ata0: <ATA channel 0> on atapci0 > ata1: <ATA channel 1> on atapci0 > atapci1: <Intel 6300ESB SATA150 controller> port 0x5088-0x508f,0x50a4-0x50a7,0x5080-0x5087,0x50a0-0x50a3,0x5060-0x506f irq 15 at device 31.2 on pci0 > ata2: <ATA channel 0> on atapci1 > ata3: <ATA channel 1> on atapci1ad0s4h is basically a r/w ufs partition on the box where almost anything that needs to be written goes. > trace > Tracing pid 1101 tid 100043 td 0x656d8460 > kdb_enter(608cc388,6246,656d8460,64ba1400,6095d580,...) at kdb_enter+0x2b > siointr1(64ba1400) at siointr1+0xf0 > siointr(64ba1400) at siointr+0x38 > intr_execute_handler(6095d580,f0a4ab04,6,6095d580,f0a4aafc,...) at intr_execute_handler+0x61 > intr_execute_handlers(6095d580,f0a4ab04,6,0,656d8460,...) at intr_execute_handlers+0x40 > atpic_handle_intr(4) at atpic_handle_intr+0x96 > Xatpic_intr4() at Xatpic_intr4+0x20 > --- interrupt, eip = 0x606044af, esp = 0xf0a4ab48, ebp = 0xf0a4ab5c --- > lockmgr(e1456a04,6,0,656d8460) at lockmgr+0x58f > getdirtybuf(e14569a4,60a405e4,1) at getdirtybuf+0x2e2 > flush_deplist(68b30850,1,f0a4abb8) at flush_deplist+0x30 > flush_inodedep_deps(656fa28c,1f235) at flush_inodedep_deps+0xcf > softdep_sync_metadata(65964618) at softdep_sync_metadata+0x61 > ffs_syncvnode(65964618,1) at ffs_syncvnode+0x3a2 > ffs_fsync(f0a4ac74) at ffs_fsync+0x12 > VOP_FSYNC_APV(60949260,f0a4ac74) at VOP_FSYNC_APV+0x38 > fsync(656d8460,f0a4acb4) at fsync+0x170 > syscall(805003b,806003b,5fbf003b,8050000,288be450,...) at syscall+0x2ee > Xint0x80_syscall() at Xint0x80_syscall+0x1fThis looks to be a problem with softupdates and CF cards. Can you get this to repeat on a brand new (good) card ? -- ;s =;
On Mon, Feb 13, 2012 at 08:43:08PM -0800, john fleming wrote:> Just thought i would post over here as i'm not getting a warm fuzzy from checkpoint about being able to find the root cause of an issue. I have a large install base of IPSO checkpoint firewalls, which are based on FreeBSD 6.2. I've had 3 firewalls hang basically the same way, with something that looks like a filesystem issue or an?issue with a CF card.FreeBSD 6.2 was EOL'd in early-to-mid-2008. The ATA driver has changed significantly since then (present-day uses CAM).> Does anyone happen to know of any bugs (i've been looking around) that could cause something like that? Granted, it could be a batch of bad CF cards, but its odd that i'm seeing the same thing on 3 different boxes and once rebooted they seem ok. > ? > Also is it possible to get useful info form the atacontroller when things go south like this from the ddb prompt?Not particularly. What's shown below indicates that the driver had issued some form of ATA write command (there are multiple kinds per ATA specification), and either the underlying media (CF/disk) or controller stalled/locked up/took too long. I forget what the timeout value is in 6.2; I can't be bothered to remember such from 6 years ago. :-)> This is what shows in show msgbuf > ad0: timeout waiting to issue command > ad0: error issuing WRITE command > ad0: timeout waiting to issue command > ad0: error issuing WRITE command > ad0: timeout waiting to issue command > ad0: error issuing WRITE command > ad0: timeout waiting to issue command > ad0: error issuing WRITE command > g_vfs_done():ad0s4h[WRITE(offset=33849344, length=131072)]error = 5 > g_vfs_done():ad0s4h[WRITE(offset=33980416, length=131072)]error = 5 > g_vfs_done():ad0s4h[WRITE(offset=34111488, length=131072)]error = 5 > ?g_vfs_done():ad0s4h[WRITE(offset=34242560, length=131072)]error = 5 > g_vfs_done():ad0s4h[WRITE(offset=34373632, length=131072)]error = 5error 5 = EIO = Input/output error. But this isn't too big of a surprise given the timeouts you see prior. Are these CF cards brand new -- meaning, are they completely unused (having never had any writes done to them), or have they been in use a while? I'm betting they've been in use a while, and have probably been doing many writes over the years. Two things to note here: 1) The errors you've shown are only happening on writes, not reads. Of course if you omitted information then this isn't an accurate statement. 2) Timeouts are seen when issuing writes to some LBA regions. How full is the CF card, disk-space-wise? Not just ad0s4h, I'm talking about the entire card. How much space is roughly available? They're very small CF cards (1.8GByte roughly), and the less space available, the less effectiveness of wear levelling (and in some cases the slower the writes are). Reason I ask: given that these are CF cards, this smells of cards which are simply "worn down". CF cards have limited numbers of writes, and the card may be "freaking out" internally when attempting to write to some LBAs which map to CF sectors that are, in effect, "bad". The CF cards' ECC implementation may be buggy, or may simply be "spinning hard" for too long. You can read about this sort of behaviour on Wikipedia's CompactFlash article. You wouldn't be able to verify this with dd if=/dev/ad0, because those are read operations. You could zero the media (dd if=/dev/zero of=/dev/ad0) as a form of verification if you wanted. Do you happen to know if these CF cards support SMART? If so, installing smartmontools (version 5.42 or newer please) and providing output from "smartctl -a /dev/ad0" may be helpful to me, but I make no guarantees anything of use will be shown there. Overall my advice would be to replace the CF cards, especially if they have been in use for a long while. It really doesn't matter to me that it's happening on 3 machines (honest), especially if these are 6.2 machines with CF cards that have been in use for years. We're lucky to get 2 years out of our CF cards on our Juniper M120/320s before they start spitting I/O errors. Pick larger CF cards as well; more space more room for effective wear levelling.> ? > ad0: 1882MB <STEC M2+ CF 9.0.2 K1186-2> at ata0-master PIO4 > atapci0: <Intel 6300ESB UDMA100 controller> port 0x1f0-0x1f7,0x3f6,0x170-0x177,0x376,0x5070-0x507f mem 0x80301000-0x803013ff at device 31.1 on pci0 > ata0: <ATA channel 0> on atapci0 > ata1: <ATA channel 1> on atapci0 > atapci1: <Intel 6300ESB SATA150 controller> port 0x5088-0x508f,0x50a4-0x50a7,0x5080-0x5087,0x50a0-0x50a3,0x5060-0x506f irq 15 at device 31.2 on pci0 > ata2: <ATA channel 0> on atapci1 > ata3: <ATA channel 1> on atapci1ad0s4h is basically a r/w ufs partition on the box where almost anything that needs to be written goes. > trace > Tracing pid 1101 tid 100043 td 0x656d8460 > kdb_enter(608cc388,6246,656d8460,64ba1400,6095d580,...) at kdb_enter+0x2b > siointr1(64ba1400) at siointr1+0xf0 > siointr(64ba1400) at siointr+0x38 > intr_execute_handler(6095d580,f0a4ab04,6,6095d580,f0a4aafc,...) at intr_execute_handler+0x61 > intr_execute_handlers(6095d580,f0a4ab04,6,0,656d8460,...) at intr_execute_handlers+0x40 > atpic_handle_intr(4) at atpic_handle_intr+0x96 > Xatpic_intr4() at Xatpic_intr4+0x20 > --- interrupt, eip = 0x606044af, esp = 0xf0a4ab48, ebp = 0xf0a4ab5c --- > lockmgr(e1456a04,6,0,656d8460) at lockmgr+0x58f > getdirtybuf(e14569a4,60a405e4,1) at getdirtybuf+0x2e2 > flush_deplist(68b30850,1,f0a4abb8) at flush_deplist+0x30 > flush_inodedep_deps(656fa28c,1f235) at flush_inodedep_deps+0xcf > softdep_sync_metadata(65964618) at softdep_sync_metadata+0x61 > ffs_syncvnode(65964618,1) at ffs_syncvnode+0x3a2 > ffs_fsync(f0a4ac74) at ffs_fsync+0x12 > VOP_FSYNC_APV(60949260,f0a4ac74) at VOP_FSYNC_APV+0x38 > fsync(656d8460,f0a4acb4) at fsync+0x170 > syscall(805003b,806003b,5fbf003b,8050000,288be450,...) at syscall+0x2ee > Xint0x80_syscall() at Xint0x80_syscall+0x1f-- | Jeremy Chadwick jdc@parodius.com | | Parodius Networking http://www.parodius.com/ | | UNIX Systems Administrator Mountain View, CA, US | | Making life hard for others since 1977. PGP 4BD6C0CB |
On Mon, Feb 13, 2012 at 10:43 PM, john fleming <jflemingeds@yahoo.com>wrote:> Just thought i would post over here as i'm not getting a warm fuzzy from > checkpoint about being able to find the root cause of an issue. I have a > large install base of IPSO checkpoint firewalls, which are based on FreeBSD > 6.2. I've had 3 firewalls hang basically the same way, with something that > looks like a filesystem issue or an issue with a CF card. >There was a thread just the other day mentioned lockup problems with DMA and CF cards. Disabling DMA or reducing the mode helped. Not sure if applicable to that old of FreeBSD version. -- Adam Vande More