El lun, 23-10-2006 a las 16:28 +0400, Kirill Korotaev escribi?:> J.J.Garcia, > > in july 2006 you reported hit of bio.c:99 bug on CentOs kernel: > http://lists.centos.org/pipermail/centos/2006-July/067539.html > > Recently, we also hit the same bug, so maybe you can provide some details on: > 1. whether it is doesn't work for you or was fixed. > 2. how it was triggered > 3. whether you know the fast way to reproduce > > Any help would be appreciated. > > Thanks, > KirillHi Kirill, Since then no more news from this subject, sorry about it. You can follow the full story of my testing at CentOS Bug Tracker with 1419 id, http://bugs.centos.org/view.php?id=1419 Actually i'm running the stock kernel 42.0.3 in affected host with a lil more stability appreciated than previous kernels released but not really stable in certain situations (heavy i/o load on external usb 2.0 disk). [root at fattybox ~]# cat /etc/grub.conf | grep vmlinuz-`uname -r` kernel /vmlinuz-2.6.9-42.0.3.EL ro root=/dev/VolGroup00/LogVol00 ACPI=off vga=0x307 selinux=0 [root at fattybox ~]# dmesg | grep ACPI ACPI: Unable to locate RSDP Kernel command line: ro root=/dev/VolGroup00/LogVol00 ACPI=off vga=0x307 selinux=0 ACPI: Subsystem revision 20040816 ACPI: Interpreter disabled. What's sure, the system was running rock solid with 4.3 release and 22.0.2 kernel. From that point, any update to kernel or pkgs to 4.4 (34.x.x to 42.x.x) i get the same issue. The call trace (taken from 1419 id. at bug tracker and re-pasted here) is almost allways the same, and it happends normally when lot of i/o work is performed on an USB external disk, locally or via NFS/exported. Call Trace: [<c016deb4>] bio_put+0x27/0x28 [<c016d698>] end_bio_bh_io_sync+0x33/0x37 [<c016ea55>] bio_endio+0x4f/0x54 [<c0251ce5>] __end_that_request_first+0xea/0x1ab [<d841cdd1>] scsi_end_request+0x1b/0x174 [scsi_mod] [<d841d268>] scsi_io_completion+0x20b/0x417 [scsi_mod] [<d841800c>] scsi_finish_command+0xad/0xb1 [scsi_mod] [<d8417f31>] scsi_softirq+0xba/0xc2 [scsi_mod] [<c0126a9d>] __do_softirq+0x35/0x79 [<c010934c>] do_softirq+0x46/0x4d Don't know for you, but im using on that host an PCI/USB 2.0 card, im not having issues/tested yet same situation for hosts with native USB 2.0. [root at fattybox ~]# lspci 00:00.0 Host bridge: VIA Technologies, Inc. VT8601 [Apollo ProMedia] (rev 05) 00:01.0 PCI bridge: VIA Technologies, Inc. VT8601 [Apollo ProMedia AGP] 00:07.0 ISA bridge: VIA Technologies, Inc. VT82C686 [Apollo Super South] (rev 40) 00:07.1 IDE interface: VIA Technologies, Inc. VT82C586A/B/VT82C686/A/B/VT823x/A/C PIPC Bus Master IDE (rev 06) 00:07.4 Bridge: VIA Technologies, Inc. VT82C686 [Apollo Super ACPI] (rev 40) 00:07.5 Multimedia audio controller: VIA Technologies, Inc. VT82C686 AC97 Audio Controller (rev 50) 00:09.0 USB Controller: NEC Corporation USB (rev 43) 00:09.1 USB Controller: NEC Corporation USB (rev 43) 00:09.2 USB Controller: NEC Corporation USB 2.0 (rev 04) 00:0d.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL-8139/8139C/8139C+ (rev 10) 01:00.0 VGA compatible controller: Trident Microsystems CyberBlade/i1 (rev 6a) For stock kernel 42.0.3 i had 2 panics, one of them related to fs/bio.c: [root at spoolbox crash]# grep -R 2.6.9-42.0.3.EL 192.168.0.6-2006* 192.168.0.6-2006-10-10-05:39/log:EFLAGS: 00010082 (2.6.9-42.0.3.EL) 192.168.0.6-2006-10-12-10:21/log:EFLAGS: 00010206 (2.6.9-42.0.3.EL) [root at spoolbox crash]# tail -n 38 192.168.0.6-2006-10-12-10:21/log ------------[ cut here ]------------ kernel BUG at fs/bio.c:99! invalid operand: 0000 [#1] Modules linked in: vfat fat imm eeprom i2c_viapro nfsd exportfs nfs_acl lp netconsole netdump autofs4 via686a i2c_sensor i2c_isa i2c_dev i2c_core rfcomm l2cap lockd sunrpc ip_nat_ftp ip_conntrack_ftp ipt_MASQUERADE iptable_nat ipt_REJECT ipt_state ip_conntrack iptable_filter ip_tables dm_multipath hci_usb bluetooth sd_mod usb_storage scsi_mod ohci_hcd ehci_hcd parport_pc parport snd_via82xx snd_ac97_codec snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd_page_alloc snd_mpu401_uart snd_rawmidi snd_seq_device snd soundcore 8139too mii dm_snapshot dm_zero dm_mirror ext3 jbd dm_mod CPU: 0 EIP: 0060:[<c0171399>] Not tainted VLI EFLAGS: 00010206 (2.6.9-42.0.3.EL) EIP is at bio_destructor+0x19/0x3a eax: 00000008 ebx: c2356b00 ecx: cddf1f50 edx: c036d880 esi: 00001000 edi: c0170d59 ebp: 00000000 esp: c03fff24 ds: 007b es: 007b ss: 0068 Process usb-storage (pid: 2030, threadinfo=c03ff000 task=d5c9e130) Stack: c2356b00 c01715a8 c0170d8c c2356b00 c0172149 00001000 c2356b00 00000000 c03fff6c c0258da9 c8c4386c 00000000 00000000 00005000 00002000 c8c4386c d56de720 d57c6340 00000001 d840add1 00000001 d5509560 c8c4386c d57c6340 Call Trace: [<c01715a8>] bio_put+0x27/0x28 [<c0170d8c>] end_bio_bh_io_sync+0x33/0x37 [<c0172149>] bio_endio+0x4f/0x54 [<c0258da9>] __end_that_request_first+0xea/0x1ab [<d840add1>] scsi_end_request+0x1b/0x174 [scsi_mod] [<d840b268>] scsi_io_completion+0x20b/0x417 [scsi_mod] [<d840600c>] scsi_finish_command+0xad/0xb1 [scsi_mod] [<d8405f31>] scsi_softirq+0xba/0xc2 [scsi_mod] [<c0129a8d>] __do_softirq+0x35/0x79 [<c0109446>] do_softirq+0x46/0x4d ====================== [<c01089f7>] do_IRQ+0x2b3/0x2bf [<c031983c>] common_interrupt+0x18/0x20 [<c0316019>] __down_interruptible+0x13f/0x24a [<c0120049>] default_wake_function+0x0/0xc [<c0316137>] __down_failed_interruptible+0x7/0xc [<d83f4b13>] .text.lock.usb+0xf/0x78 [usb_storage] [<c0318d7e>] ret_from_fork+0x6/0x14 [<d83f3d24>] usb_stor_control_thread+0x0/0x417 [usb_storage] [<d83f3d24>] usb_stor_control_thread+0x0/0x417 [usb_storage] [<c01041dd>] kernel_thread_helper+0x5/0xb Code: 4d 1a 00 e9 cc c9 ff ff e8 cd 4d 1a 00 e9 26 ca ff ff 53 89 c3 8b 40 10 c1 e8 1c 89 c2 c1 e2 04 81 c2 00 d8 36 c0 83 f8 05 7e 08 <0f> 0b 63 00 f0 d8 32 c0 8b 43 30 8b 52 0c e8 90 ce fd ff 89 d8 [root at spoolbox crash]# tail -n 31 192.168.0.6-2006-10-10-05:39/log kernel BUG at kernel/panic.c:75! invalid operand: 0000 [#1] Modules linked in: nls_utf8 ppp_deflate zlib_deflate ppp_async crc_ccitt ppp_generic slhc vfat fat imm eeprom i2c_viapro nfsd exportfs nfs_acl lp netconsole netdump autofs4 via686a i2c_sensor i2c_isa i2c_dev i2c_core rfcomm l2cap lockd sunrpc ip_nat_ftp ip_conntrack_ftp ipt_MASQUERADE iptable_nat ipt_REJECT ipt_state ip_conntrack iptable_filter ip_tables dm_multipath hci_usb bluetooth sd_mod usb_storage scsi_mod ohci_hcd ehci_hcd parport_pc parport snd_via82xx snd_ac97_codec snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd_page_alloc snd_mpu401_uart snd_rawmidi snd_seq_device snd soundcore 8139too mii dm_snapshot dm_zero dm_mirror ext3 jbd dm_mod CPU: 0 EIP: 0060:[<c0123db2>] Not tainted VLI EFLAGS: 00010082 (2.6.9-42.0.3.EL) EIP is at panic+0x47/0x142 eax: 00000065 ebx: c2356b00 ecx: c032a703 edx: d75fbe40 esi: d75fbe8c edi: d75fbe94 ebp: 00000000 esp: d75fbe48 ds: 007b es: 007b ss: 0068 Process kswapd0 (pid: 41, threadinfo=d75fb000 task=d75fc050) Stack: c2356b00 c014b87e c0322158 c032c3b3 00000295 c2356b10 00000000 d75fbe8c c036e120 c0155ba7 d75fbe94 c2356a20 00000000 c015602f 0000000e ffffffff c2356b00 00000000 00000000 c1290400 c01557c6 c036c9b4 00000286 00000000 Call Trace: [<c014b87e>] find_get_pages+0xb6/0xf4 [<c0155ba7>] pagevec_lookup+0x17/0x1d [<c015602f>] invalidate_mapping_pages+0xb2/0xc5 [<c01557c6>] __pagevec_release+0x15/0x1d [<c016eceb>] remove_inode_buffers+0x12/0x112 [<c01899a8>] prune_icache+0x1a7/0x354 [<c0189b69>] shrink_icache_memory+0x14/0x2b [<c015630b>] shrink_slab+0xf7/0x14c [<c015794b>] balance_pgdat+0x1b3/0x2cb [<c0157b1c>] kswapd+0xb9/0xbb [<c0121853>] autoremove_wake_function+0x0/0x2d [<c0318d7e>] ret_from_fork+0x6/0x14 [<c0121853>] autoremove_wake_function+0x0/0x2d [<c0157a63>] kswapd+0x0/0xbb [<c01041dd>] kernel_thread_helper+0x5/0xb Code: 40 c0 e8 3b 5d 0c 00 68 60 7f 40 c0 68 03 a7 32 c0 e8 64 0b 00 00 83 c4 0c 83 3d 1c a8 42 c0 00 75 09 83 3d 18 a8 42 c0 00 74 08 <0f> 0b 4b 00 26 a7 32 c0 31 c0 e8 53 97 ff ff 31 d2 b9 60 7f 40 Finally, to reproduce iirc i simply have to perform two r/w commands concurrently, i.e an 'dd' command writing to that external usb disk and at the same time a second r/w operation operation to that disk, i.e. via nfs or even on console, after hours/secs the condition is triggered. I know usb external disks are used with scsi emulation from the point of view of kernel and they are not a really good approach for massive data storing (rather using scsi units mainly/directly) but i have almost a TB on that disks... By the way i use the workaround of not writing concurrently to that disk for a long time or for long chunks... heh, no more clues unless i rescue the old 4.3 and 22.0.2 raw hd/image for that old host serving as NAS. Don't know if all of this will be helping, anyway... Good luck! Jose.
El lun, 23-10-2006 a las 17:50 +0400, Kirill Korotaev escribi?:> J.J. Garcia, > > the bug you face looks exactly like the ours one. > I thought it is memory corruption since %eax is 8, while should be 0. > (BTW, can you run memtest to make sure your memory is really ok? > http://wiki.openvz.org/Hardware_testing ), > but the fact that it is always 8 in yours and our case makes me believe > it is something else... > > If I provide some debugging patch for you, will you be able to apply it to your > kernel, rebuild it and test the issue? > > Your help is very much appreciated. > > Thanks, > Kirill >Sure i'll do my best, if you provide me the patch i can check it on the current host, it's not a very critycall host at the network and i think the bug is relevant to stop it for a while, I've started by installing memtest86+ in the related host following the next steps, for your info: <...> ============================================================================ Package Arch Version Repository Size ============================================================================Installing: memtest86+ i386 1.26-2 base 53 k Transaction Summary ============================================================================Install 1 Package(s) Update 0 Package(s) Remove 0 Package(s) Total download size: 53 k Is this ok [y/N]: y Downloading Packages: (1/1): memtest86+-1.26-2. 100% |=========================| 53 kB 00:00 Running Transaction Test Finished Transaction Test Transaction Test Succeeded Running Transaction Installing: memtest86+ ######################### [1/1] Installed: memtest86+.i386 0:1.26-2 Complete! [root at fattybox ~]# rpm -ql memtest86+ /boot/memtest86+-1.26 /sbin/new-memtest-pkg /usr/sbin/memtest-setup /usr/share/doc/memtest86+-1.26 /usr/share/doc/memtest86+-1.26/README [root at fattybox ~]# rpm -qi memtest86+ Name : memtest86+ Relocations: (not relocatable) Version : 1.26 Vendor: CentOS Release : 2 Build Date: lun 21 feb 2005 20:35:44 CET Install Date: lun 23 oct 2006 16:25:57 CEST Build Host: bhrama.build.karan.org Group : System Environment/Base Source RPM: memtest86 +-1.26-2.src.rpm Size : 123633 License: GPL Signature : DSA/SHA1, s?b 26 feb 2005 21:59:06 CET, Key ID a53d0bab443e1821 Packager : Karanbir Singh <kbsingh at centos.org> URL : http://www.memtest.org Summary : Stand-alone memory tester for x86 and x86-64 computers Description : Memtest86+ is a thorough stand-alone memory test for x86 and x86-64 architecture computers. BIOS based memory tests are only a quick check and often miss many of the failures that are detected by Memtest86+. Run 'memtest-setup' to add to your GRUB or lilo boot menu. root at fattybox ~]# Proceding with the install on boot, [root at fattybox ~]# memtest-setup Setup complete. Lead to /etc/grub.conf in the following way, i'll use it to launch the tests by the way: title Memtest86+ (1.26) root (hd0,0) kernel /memtest86+-1.26 ro root=/dev/VolGroup00/LogVol00 ACPI=off vga=0x307 selinux=0 Since here, memtest is running using default config, feel free 2 tell me 2 change the default params when running if you are looking for something you need, i'll leave it running for 48 hours looking for something strange in memory. I've to note that this host has shared memm for the graphics, iow, there's no graphic card but embedded one on mobo, it's a DFI CM33T3-100 mobo (CM33-TL) with up2date bios according dfi with a intel celeron running. I can't assure kingstom memories... but 22.0.2 worked fine with this hardware previously for long time (months, and year of uptime with heavy loads)... We'll keep on touch, Jose.
El mar, 24-10-2006 a las 19:20 +0400, Kirill Korotaev escribi?:> J.J. Garcia, > > thanks a lot for the detailed answer and taking your time helping! >Morning Kirill, Finally i managed to solve the memory problem by replacing a 128MB PC133 module, same memory config (1x256+1x128 on that mobo) than previous, same environment then. Running memtest for almost 24 hours leads to no memory issues. Booted with 42.0.3 since few hours, sys up and running. [root at fattybox ~]# iostat Linux 2.6.9-42.0.3.EL (fattybox.stigmatedbrain.net) 25/10/06 cpu-med: %user %nice %sys %iowait %idle 2,02 78,23 18,66 0,20 0,90 Device: tps Blq_leid/s Blq_escr/s Blq_leid Blq_escr hda 2,91 33,04 28,87 1700018 1485314 hda1 0,01 0,02 0,00 1040 106 hda2 5,42 33,00 28,86 1697882 1485208 hdd 1,95 101,58 1,23 5226772 63472 hdd1 2,29 101,55 1,23 5225204 63472 dm-0 5,41 32,98 28,86 1697138 1484888 dm-1 0,00 0,01 0,01 360 320 sda 1,40 71,35 0,87 3671370 44896 sda1 1,83 71,34 0,87 3671106 44896 sdb 7,51 613,80 13,20 31583426 679032 sdb1 12,87 613,79 13,20 31583290 679032 sdc 0,00 0,02 0,00 786 168 sdc1 0,01 0,01 0,00 650 168 sdd 19,61 1020,24 1824,04 52497330 93857456 sdd1 244,30 1020,24 1824,04 52497194 93857456 sde 0,00 0,00 0,00 8 0> 1. do you use md devices in your system? >Not at the moment, no raid configuration on that host, only ide disks and usb2 external harddisks, [root at fattybox ~]# cat /proc/mdstat Personalities : unused devices: <none> [root at fattybox ~]# dmesg | grep SCSI parport0 (addr 0): SCSI adapter, IMG VP1 SCSI subsystem initialized scsi0 : SCSI emulation for USB Mass Storage devices Type: Direct-Access ANSI SCSI revision: 02 scsi1 : SCSI emulation for USB Mass Storage devices SCSI device sda: 39070080 512-byte hdwr sectors (20004 MB) SCSI device sda: 39070080 512-byte hdwr sectors (20004 MB) Type: Direct-Access ANSI SCSI revision: 02 SCSI device sdb: 586114704 512-byte hdwr sectors (300091 MB) SCSI device sdb: 586114704 512-byte hdwr sectors (300091 MB) scsi2 : SCSI emulation for USB Mass Storage devices Type: Direct-Access ANSI SCSI revision: 02 SCSI device sdc: 78140160 512-byte hdwr sectors (40008 MB) SCSI device sdc: 78140160 512-byte hdwr sectors (40008 MB) scsi3 : SCSI emulation for USB Mass Storage devices Type: Direct-Access ANSI SCSI revision: 02 SCSI device sdd: 490234752 512-byte hdwr sectors (251000 MB) SCSI device sdd: 490234752 512-byte hdwr sectors (251000 MB) Type: Direct-Access ANSI SCSI revision: 02 SCSI device sde: 196608 512-byte hdwr sectors (101 MB) SCSI device sde: drive cache: write back SCSI device sde: 196608 512-byte hdwr sectors (101 MB) SCSI device sde: drive cache: write back> 2. try applying diff-bio-debug-on-orig-rhel4 patch first: > # patch -p1 < diff-bio-debug-on-orig-rhel4 > it will print mode details in case bug happens again. > please note that it should not panic due to the bug, so you will > need to check dmesg whether bug was hit or not. > > 3. As additional check you can backout debug patch and apply 2nd patch diff-bio. > This is the only change in block I/O which I see from .22 kernel which can > influence somehow. So apply it as: > # patch -p1 -R < diff-bio-debug-on-orig-rhel4 > # patch -p1 < diff-bio > > Check whether bug is reproducable now or not.I've launched several I/O operations on usb disks to see if i can reproduce the bug, not "succeeded" by the moment but let me check it for several days. If i get again the panic, i'll patch the kernel and send you back the results, hope this help. Thanks a lot for the hints, Jose.> > Thanks, > Kirill >