Fourie Joubert
2010-Aug-18 10:14 UTC
[Lustre-discuss] More detail regarding soft lockup error
Hi Folks Just reporting some more detail about the soft lockup error I have been getting: I am running Lustre 1.8.1, kernels are from the Lustre distro. I have three OSTs. they are all volumes on an IBM DS3400. Following a crash, whenever I try to mount the volumes, I get: Aug 18 12:14:11 wonkofs kernel: BUG: soft lockup - CPU#2 stuck for 10s! [ll_mdt_16:5375] Aug 18 12:14:11 wonkofs kernel: CPU 2: Aug 18 12:14:11 wonkofs kernel: Modules linked in: obdfilter(U) ost(U) mds(U) fsfilt_ldiskfs(U) mgs(U) mgc(U) ldiskfs(U) crc16(U) loop(U) lustre(U) lov(U) mdc(U) lquota(U) osc(U) ksocklnd(U) ptlrpc(U) obdclass(U) lvfs(U) lnet(U) libcfs(U) autofs4(U) hidp(U) rfcomm(U) l2cap(U) bluetooth(U) lockd(U) sunrpc(U) cpufreq_ondemand(U) acpi_cpufreq(U) freq_table(U) rdma_ucm(U) rdma_cm(U) iw_cm(U) ib_addr(U) ib_ipoib(U) ipoib_helper(U) ib_cm(U) ib_sa(U) ib_uverbs(U) ib_umad(U) iw_nes(U) iw_cxgb3(U) cxgb3(U) mlx4_en(U) mlx4_ib(U) mlx4_core(U) ib_mthca(U) dm_mirror(U) dm_multipath(U) scsi_dh(U) video(U) hwmon(U) backlight(U) sbs(U) i2c_ec(U) button(U) battery(U) asus_acpi(U) acpi_memhotplug(U) ac(U) ipv6(U) xfrm_nalgo(U) crypto_api(U) parport_pc(U) lp(U) parport(U) joydev(U) ib_qib(U) ib_mad(U) ib_core(U) cdc_ether(U) usbnet(U) i2c_i801(U) pcspkr(U) i2c_core(U) bnx2(U) sg(U) ide_cd(U) cdrom(U) dm_raid45(U) dm_message(U) dm_region_hash(U) dm_log(U) dm_mod(U) dm_mem_cache(U) qla2xxx(U) scsi_transport_fc(U) ata_piix(U) libata( Aug 18 12:14:11 wonkofs kernel: ) shpchp(U) megaraid_sas(U) sd_mod(U) scsi_mod(U) ext3(U) jbd(U) uhci_hcd(U) ohci_hcd(U) ehci_hcd(U) Aug 18 12:14:11 wonkofs kernel: Pid: 5375, comm: ll_mdt_16 Tainted: G 2.6.18-128.1.14.el5_lustre.1.8.1 #1 Aug 18 12:14:11 wonkofs kernel: RIP: 0010:[<ffffffff88b6cf7b>] [<ffffffff88b6cf7b>] :ldiskfs:ldiskfs_get_group_desc+0x3b/0xb0 Aug 18 12:14:11 wonkofs kernel: RSP: 0018:ffff8104d24a9220 EFLAGS: 00000297 Aug 18 12:14:11 wonkofs kernel: RAX: 000000000000017e RBX: ffff8104d42f5800 RCX: 0000000000000d80 Aug 18 12:14:11 wonkofs kernel: RDX: 0000000000000000 RSI: 000000000000006b RDI: ffff8104d42f5800 Aug 18 12:14:11 wonkofs kernel: RBP: 0000000000000000 R08: 0000000000000000 R09: 000000000000006b Aug 18 12:14:11 wonkofs kernel: R10: ffff8104d46a0000 R11: 0000000000000000 R12: ffff8104d24a939b Aug 18 12:14:11 wonkofs kernel: R13: 0000000000000000 R14: ffff8104d7ae2310 R15: ffff8104d68acb30 Aug 18 12:14:11 wonkofs kernel: FS: 00002abdd2330230(0000) GS:ffff810111920cc0(0000) knlGS:0000000000000000 Aug 18 12:14:11 wonkofs kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b Aug 18 12:14:11 wonkofs kernel: CR2: 00000000162a2a5c CR3: 0000000000201000 CR4: 00000000000006e0 Aug 18 12:14:11 wonkofs kernel: Aug 18 12:14:11 wonkofs kernel: Call Trace: Aug 18 12:14:11 wonkofs kernel: [<ffffffff88b7007f>] :ldiskfs:ldiskfs_find_reverse+0x2f/0xa0 Aug 18 12:14:11 wonkofs kernel: [<ffffffff88b78f26>] :ldiskfs:ldiskfs_new_inode_wantedi+0x46/0x80 Aug 18 12:14:11 wonkofs kernel: [<ffffffff88b7986c>] :ldiskfs:ldiskfs_create+0xbc/0x140 Aug 18 12:14:11 wonkofs kernel: [<ffffffff8003a7c1>] vfs_create+0xe6/0x158 Aug 18 12:14:11 wonkofs kernel: [<ffffffff88c4ba42>] :mds:mds_obd_create+0x522/0xec0 Aug 18 12:14:11 wonkofs kernel: [<ffffffff887e0278>] :libcfs:cfs_alloc+0x28/0x60 Aug 18 12:14:11 wonkofs kernel: [<ffffffff88869298>] :obdclass:llog_alloc_handle+0x1f8/0x280 Aug 18 12:14:11 wonkofs kernel: [<ffffffff88872f8d>] :obdclass:llog_lvfs_create+0xafd/0xf3e Aug 18 12:14:11 wonkofs kernel: [<ffffffff88929359>] :ptlrpc:ldlm_srv_pool_recalc+0x79/0x250 Aug 18 12:14:11 wonkofs kernel: [<ffffffff8886be4a>] :obdclass:llog_cat_current_log+0x67a/0xfc0 Aug 18 12:14:11 wonkofs kernel: [<ffffffff88b7293a>] :ldiskfs:ldiskfs_get_blocks_handle+0x11a/0xd30 Aug 18 12:14:11 wonkofs kernel: [<ffffffff8886df9d>] :obdclass:llog_cat_add_rec+0xbd/0x730 Aug 18 12:14:11 wonkofs kernel: [<ffffffff88932eb9>] :ptlrpc:ptlrpc_prep_set+0x1e9/0x290 Aug 18 12:14:11 wonkofs kernel: [<ffffffff888741b8>] :obdclass:llog_obd_origin_add+0xa8/0x180 Aug 18 12:14:11 wonkofs kernel: [<ffffffff88926a5f>] :ptlrpc:ldlm_process_inodebits_lock+0x39f/0x430 Aug 18 12:14:11 wonkofs kernel: [<ffffffff888749d6>] :obdclass:llog_add+0x296/0x340 Aug 18 12:14:11 wonkofs kernel: [<ffffffff88929e31>] :ptlrpc:ldlm_pool_add+0x131/0x190 Aug 18 12:14:11 wonkofs kernel: [<ffffffff88a7f748>] :lov:lov_llog_origin_add+0x748/0x7b0 Aug 18 12:14:11 wonkofs kernel: [<ffffffff80019daf>] __getblk+0x1d/0x230 Aug 18 12:14:11 wonkofs kernel: [<ffffffff888749d6>] :obdclass:llog_add+0x296/0x340 Aug 18 12:14:11 wonkofs kernel: [<ffffffff88b721ef>] :ldiskfs:__ldiskfs_get_inode_loc+0x15f/0x370 Aug 18 12:14:11 wonkofs kernel: [<ffffffff88c0ce5a>] :mds:mds_llog_origin_add+0x4ba/0x500 Aug 18 12:14:11 wonkofs kernel: [<ffffffff80019daf>] __getblk+0x1d/0x230 Aug 18 12:14:11 wonkofs kernel: [<ffffffff888749ae>] :obdclass:llog_add+0x26e/0x340 Aug 18 12:14:11 wonkofs kernel: [<ffffffff88a91c8d>] :lov:lov_checkmd+0x20d/0x290 Aug 18 12:14:11 wonkofs kernel: [<ffffffff88c0c73d>] :mds:mds_llog_add_unlink+0x73d/0x9a0 Aug 18 12:14:11 wonkofs kernel: [<ffffffff88b72612>] :ldiskfs:ldiskfs_mark_inode_dirty+0x132/0x160 Aug 18 12:14:11 wonkofs kernel: [<ffffffff88c0fdfd>] :mds:mds_log_op_unlink+0x53d/0x960 Aug 18 12:14:11 wonkofs kernel: [<ffffffff8000cda2>] dnotify_parent+0x1c/0x6b Aug 18 12:14:11 wonkofs kernel: [<ffffffff88c40792>] :mds:mds_reint_unlink+0x1912/0x2780 Aug 18 12:14:11 wonkofs kernel: [<ffffffff88815067>] :lnet:lnet_prep_send+0x67/0xb0 Aug 18 12:14:11 wonkofs kernel: [<ffffffff8893328c>] :ptlrpc:ptlrpc_set_destroy+0x32c/0x390 Aug 18 12:14:11 wonkofs kernel: [<ffffffff8000d762>] dput+0x23/0x10a Aug 18 12:14:11 wonkofs kernel: [<ffffffff88c30c89>] :mds:mds_reint_rec+0x1d9/0x2b0 Aug 18 12:14:11 wonkofs kernel: [<ffffffff88c5b973>] :mds:mds_unlink_unpack+0x293/0x3b0 Aug 18 12:14:11 wonkofs kernel: [<ffffffff88941969>] :ptlrpc:lustre_pack_reply_flags+0x7f9/0x8e0 Aug 18 12:14:11 wonkofs kernel: [<ffffffff88c23a1a>] :mds:mds_reint+0x35a/0x420 Aug 18 12:14:11 wonkofs kernel: [<ffffffff88941a79>] :ptlrpc:lustre_pack_reply+0x29/0xb0 Aug 18 12:14:11 wonkofs kernel: [<ffffffff8893e135>] :ptlrpc:lustre_msg_get_flags+0x35/0xf0 Aug 18 12:14:11 wonkofs kernel: [<ffffffff88c29a63>] :mds:mds_handle+0x24b3/0x4cb0 Aug 18 12:14:11 wonkofs kernel: [<ffffffff800d739c>] cache_flusharray+0x2f/0xa3 Aug 18 12:14:11 wonkofs kernel: [<ffffffff80148d4f>] __next_cpu+0x19/0x28 Aug 18 12:14:11 wonkofs kernel: [<ffffffff80088f32>] find_busiest_group+0x20d/0x621 Aug 18 12:14:11 wonkofs kernel: [<ffffffff88943a15>] :ptlrpc:lustre_msg_get_conn_cnt+0x35/0xf0 Aug 18 12:14:11 wonkofs kernel: [<ffffffff80089d89>] enqueue_task+0x41/0x56 Aug 18 12:14:11 wonkofs kernel: [<ffffffff8894872d>] :ptlrpc:ptlrpc_check_req+0x1d/0x110 Aug 18 12:14:11 wonkofs kernel: [<ffffffff8894ae67>] :ptlrpc:ptlrpc_server_handle_request+0xa97/0x1160 Aug 18 12:14:11 wonkofs kernel: [<ffffffff8003dc3f>] lock_timer_base+0x1b/0x3c Aug 18 12:14:11 wonkofs kernel: [<ffffffff80088819>] __wake_up_common+0x3e/0x68 Aug 18 12:14:11 wonkofs kernel: [<ffffffff8894e908>] :ptlrpc:ptlrpc_main+0x1218/0x13e0 Aug 18 12:14:11 wonkofs kernel: [<ffffffff8008a3ef>] default_wake_function+0x0/0xe Aug 18 12:14:11 wonkofs kernel: [<ffffffff800b48dd>] audit_syscall_exit+0x327/0x342 Aug 18 12:14:11 wonkofs kernel: [<ffffffff8005dfb1>] child_rip+0xa/0x11 Aug 18 12:14:11 wonkofs kernel: [<ffffffff8894d6f0>] :ptlrpc:ptlrpc_main+0x0/0x13e0 Aug 18 12:14:11 wonkofs kernel: [<ffffffff8005dfa7>] child_rip+0x0/0x11 Aug 18 12:14:11 wonkofs kernel: Aug 18 12:14:21 wonkofs kernel: BUG: soft lockup - CPU#2 stuck for 10s! [ll_mdt_16:5375] Aug 18 12:14:21 wonkofs kernel: CPU 2: Aug 18 12:14:21 wonkofs kernel: Modules linked in: obdfilter(U) ost(U) mds(U) fsfilt_ldiskfs(U) mgs(U) mgc(U) ldiskfs(U) crc16(U) loop(U) lustre(U) lov(U) mdc(U) lquota(U) osc(U) ksocklnd(U) ptlrpc(U) obdclass(U) lvfs(U) lnet(U) libcfs(U) autofs4(U) hidp(U) rfcomm(U) l2cap(U) bluetooth(U) lockd(U) sunrpc(U) cpufreq_ondemand(U) acpi_cpufreq(U) freq_table(U) rdma_ucm(U) rdma_cm(U) iw_cm(U) ib_addr(U) ib_ipoib(U) ipoib_helper(U) ib_cm(U) ib_sa(U) ib_uverbs(U) ib_umad(U) iw_nes(U) iw_cxgb3(U) cxgb3(U) mlx4_en(U) mlx4_ib(U) mlx4_core(U) ib_mthca(U) dm_mirror(U) dm_multipath(U) scsi_dh(U) video(U) hwmon(U) backlight(U) sbs(U) i2c_ec(U) button(U) battery(U) asus_acpi(U) acpi_memhotplug(U) ac(U) ipv6(U) xfrm_nalgo(U) crypto_api(U) parport_pc(U) lp(U) parport(U) joydev(U) ib_qib(U) ib_mad(U) ib_core(U) cdc_ether(U) usbnet(U) i2c_i801(U) pcspkr(U) i2c_core(U) bnx2(U) sg(U) ide_cd(U) cdrom(U) dm_raid45(U) dm_message(U) dm_region_hash(U) dm_log(U) dm_mod(U) dm_mem_cache(U) qla2xxx(U) scsi_transport_fc(U) ata_piix(U) libata( Aug 18 12:14:21 wonkofs kernel: ) shpchp(U) megaraid_sas(U) sd_mod(U) scsi_mod(U) ext3(U) jbd(U) uhci_hcd(U) ohci_hcd(U) ehci_hcd(U) Aug 18 12:14:21 wonkofs kernel: Pid: 5375, comm: ll_mdt_16 Tainted: G 2.6.18-128.1.14.el5_lustre.1.8.1 #1 Aug 18 12:14:21 wonkofs kernel: RIP: 0010:[<ffffffff88b6cf7b>] [<ffffffff88b6cf7b>] :ldiskfs:ldiskfs_get_group_desc+0x3b/0xb0 Aug 18 12:14:21 wonkofs kernel: RSP: 0018:ffff8104d24a9220 EFLAGS: 00000293 Aug 18 12:14:21 wonkofs kernel: RAX: 000000000000017e RBX: ffff8104d42f5800 RCX: 0000000000000760 Aug 18 12:14:21 wonkofs kernel: RDX: 0000000000000000 RSI: 000000000000013a RDI: ffff8104d42f5800 Aug 18 12:14:21 wonkofs kernel: RBP: 0000000000000000 R08: 0000000000000002 R09: 000000000000013a Aug 18 12:14:21 wonkofs kernel: R10: ffff8104d46a0000 R11: 0000000000000000 R12: ffff8104d24a939b Aug 18 12:14:21 wonkofs kernel: R13: 0000000000000000 R14: ffff8104d7ae2310 R15: ffff8104d68acb30 Aug 18 12:14:21 wonkofs kernel: FS: 00002abdd2330230(0000) GS:ffff810111920cc0(0000) knlGS:0000000000000000 Aug 18 12:14:21 wonkofs kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b Aug 18 12:14:21 wonkofs kernel: CR2: 00000000162a2a5c CR3: 0000000000201000 CR4: 00000000000006e0 Aug 18 12:14:21 wonkofs kernel: Aug 18 12:14:21 wonkofs kernel: Call Trace: Aug 18 12:14:21 wonkofs kernel: [<ffffffff88b7007f>] :ldiskfs:ldiskfs_find_reverse+0x2f/0xa0 Aug 18 12:14:21 wonkofs kernel: [<ffffffff88b78f26>] :ldiskfs:ldiskfs_new_inode_wantedi+0x46/0x80 Aug 18 12:14:21 wonkofs kernel: [<ffffffff88b7986c>] :ldiskfs:ldiskfs_create+0xbc/0x140 Aug 18 12:14:21 wonkofs kernel: [<ffffffff8003a7c1>] vfs_create+0xe6/0x158 Aug 18 12:14:21 wonkofs kernel: [<ffffffff88c4ba42>] :mds:mds_obd_create+0x522/0xec0 Aug 18 12:14:21 wonkofs kernel: [<ffffffff887e0278>] :libcfs:cfs_alloc+0x28/0x60 Aug 18 12:14:21 wonkofs kernel: [<ffffffff88869298>] :obdclass:llog_alloc_handle+0x1f8/0x280 Aug 18 12:14:21 wonkofs kernel: [<ffffffff88872f8d>] :obdclass:llog_lvfs_create+0xafd/0xf3e Aug 18 12:14:21 wonkofs kernel: [<ffffffff88929359>] :ptlrpc:ldlm_srv_pool_recalc+0x79/0x250 Aug 18 12:14:21 wonkofs kernel: [<ffffffff8886be4a>] :obdclass:llog_cat_current_log+0x67a/0xfc0 Aug 18 12:14:21 wonkofs kernel: [<ffffffff88b7293a>] :ldiskfs:ldiskfs_get_blocks_handle+0x11a/0xd30 Aug 18 12:14:21 wonkofs kernel: [<ffffffff8886df9d>] :obdclass:llog_cat_add_rec+0xbd/0x730 Aug 18 12:14:21 wonkofs kernel: [<ffffffff88932eb9>] :ptlrpc:ptlrpc_prep_set+0x1e9/0x290 Aug 18 12:14:21 wonkofs kernel: [<ffffffff888741b8>] :obdclass:llog_obd_origin_add+0xa8/0x180 Aug 18 12:14:21 wonkofs kernel: [<ffffffff88926a5f>] :ptlrpc:ldlm_process_inodebits_lock+0x39f/0x430 Aug 18 12:14:21 wonkofs kernel: [<ffffffff888749d6>] :obdclass:llog_add+0x296/0x340 Aug 18 12:14:21 wonkofs kernel: [<ffffffff88929e31>] :ptlrpc:ldlm_pool_add+0x131/0x190 Aug 18 12:14:21 wonkofs kernel: [<ffffffff88a7f748>] :lov:lov_llog_origin_add+0x748/0x7b0 Aug 18 12:14:21 wonkofs kernel: [<ffffffff80019daf>] __getblk+0x1d/0x230 Aug 18 12:14:21 wonkofs kernel: [<ffffffff888749d6>] :obdclass:llog_add+0x296/0x340 Aug 18 12:14:21 wonkofs kernel: [<ffffffff88b721ef>] :ldiskfs:__ldiskfs_get_inode_loc+0x15f/0x370 Aug 18 12:14:21 wonkofs kernel: [<ffffffff88c0ce5a>] :mds:mds_llog_origin_add+0x4ba/0x500 Aug 18 12:14:21 wonkofs kernel: [<ffffffff80019daf>] __getblk+0x1d/0x230 Aug 18 12:14:21 wonkofs kernel: [<ffffffff888749ae>] :obdclass:llog_add+0x26e/0x340 Aug 18 12:14:21 wonkofs kernel: [<ffffffff88a91c8d>] :lov:lov_checkmd+0x20d/0x290 Aug 18 12:14:21 wonkofs kernel: [<ffffffff88c0c73d>] :mds:mds_llog_add_unlink+0x73d/0x9a0 Aug 18 12:14:21 wonkofs kernel: [<ffffffff88b72612>] :ldiskfs:ldiskfs_mark_inode_dirty+0x132/0x160 Aug 18 12:14:21 wonkofs kernel: [<ffffffff88c0fdfd>] :mds:mds_log_op_unlink+0x53d/0x960 Aug 18 12:14:21 wonkofs kernel: [<ffffffff8000cda2>] dnotify_parent+0x1c/0x6b Aug 18 12:14:21 wonkofs kernel: [<ffffffff88c40792>] :mds:mds_reint_unlink+0x1912/0x2780 Aug 18 12:14:21 wonkofs kernel: [<ffffffff88815067>] :lnet:lnet_prep_send+0x67/0xb0 Aug 18 12:14:21 wonkofs kernel: [<ffffffff8893328c>] :ptlrpc:ptlrpc_set_destroy+0x32c/0x390 Aug 18 12:14:21 wonkofs kernel: [<ffffffff8000d762>] dput+0x23/0x10a Aug 18 12:14:21 wonkofs kernel: [<ffffffff88c30c89>] :mds:mds_reint_rec+0x1d9/0x2b0 Aug 18 12:14:21 wonkofs kernel: [<ffffffff88c5b973>] :mds:mds_unlink_unpack+0x293/0x3b0 Aug 18 12:14:21 wonkofs kernel: [<ffffffff88941969>] :ptlrpc:lustre_pack_reply_flags+0x7f9/0x8e0 Aug 18 12:14:21 wonkofs kernel: [<ffffffff88c23a1a>] :mds:mds_reint+0x35a/0x420 Aug 18 12:14:21 wonkofs kernel: [<ffffffff88941a79>] :ptlrpc:lustre_pack_reply+0x29/0xb0 Aug 18 12:14:21 wonkofs kernel: [<ffffffff8893e135>] :ptlrpc:lustre_msg_get_flags+0x35/0xf0 Aug 18 12:14:21 wonkofs kernel: [<ffffffff88c29a63>] :mds:mds_handle+0x24b3/0x4cb0 Aug 18 12:14:21 wonkofs kernel: [<ffffffff800d739c>] cache_flusharray+0x2f/0xa3 Aug 18 12:14:21 wonkofs kernel: [<ffffffff80148d4f>] __next_cpu+0x19/0x28 Aug 18 12:14:21 wonkofs kernel: [<ffffffff80088f32>] find_busiest_group+0x20d/0x621 Aug 18 12:14:21 wonkofs kernel: [<ffffffff88943a15>] :ptlrpc:lustre_msg_get_conn_cnt+0x35/0xf0 Aug 18 12:14:21 wonkofs kernel: [<ffffffff80089d89>] enqueue_task+0x41/0x56 Aug 18 12:14:21 wonkofs kernel: [<ffffffff8894872d>] :ptlrpc:ptlrpc_check_req+0x1d/0x110 Aug 18 12:14:21 wonkofs kernel: [<ffffffff8894ae67>] :ptlrpc:ptlrpc_server_handle_request+0xa97/0x1160 Aug 18 12:14:21 wonkofs kernel: [<ffffffff8003dc3f>] lock_timer_base+0x1b/0x3c Aug 18 12:14:21 wonkofs kernel: [<ffffffff80088819>] __wake_up_common+0x3e/0x68 Aug 18 12:14:21 wonkofs kernel: [<ffffffff8894e908>] :ptlrpc:ptlrpc_main+0x1218/0x13e0 Aug 18 12:14:21 wonkofs kernel: [<ffffffff8008a3ef>] default_wake_function+0x0/0xe Aug 18 12:14:21 wonkofs kernel: [<ffffffff800b48dd>] audit_syscall_exit+0x327/0x342 Aug 18 12:14:21 wonkofs kernel: [<ffffffff8005dfb1>] child_rip+0xa/0x11 Aug 18 12:14:21 wonkofs kernel: [<ffffffff8894d6f0>] :ptlrpc:ptlrpc_main+0x0/0x13e0 Aug 18 12:14:21 wonkofs kernel: [<ffffffff8005dfa7>] child_rip+0x0/0x11 Aug 18 12:14:21 wonkofs kernel: Any ideas would be appreciated. Kindest regards! Fourie -- -------------- Prof Fourie Joubert Associate Professor Bioinformatics and Computational Biology Unit Department of Biochemistry University of Pretoria fourie.joubert at up.ac.za http://www.bi.up.ac.za Tel. +27-12-420-5802 Fax. +27-12-420-5800 ------------------------------------------------------------------------- This message and attachments are subject to a disclaimer. Please refer to www.it.up.ac.za/documentation/governance/disclaimer/ for full details.
Andreas Dilger
2010-Aug-18 14:27 UTC
[Lustre-discuss] More detail regarding soft lockup error
On 2010-08-18, at 4:14, Fourie Joubert <fourie.joubert at up.ac.za> wrote:> Just reporting some more detail about the soft lockup error I have been > getting: > > I am running Lustre 1.8.1, kernels are from the Lustre distro.Firstly, there is a known corruption bug in 1.8.1, you should ar minimum upgrade to 1.8.1.1, but it may be that this problem is fixed in a newer release.> Aug 18 12:14:11 wonkofs kernel: [<ffffffff88b7007f>] :ldiskfs:ldiskfs_get_group_desc+0x3b/0xb0 > Aug 18 12:14:11 wonkofs kernel: [<ffffffff88b7007f>] > :ldiskfs:ldiskfs_find_reverse+0x2f/0xa0 > Aug 18 12:14:11 wonkofs kernel: [<ffffffff88b78f26>] > :ldiskfs:ldiskfs_new_inode_wantedi+0x46/0x80This is getting stuck looking for a free inode. Is there free space in the MDT filesystem? Have you run a full e2fsck? I do recall some changes is this code a while ago, but I don''t recall what version it was in offhand. Cheers, Andreas
Fourie Joubert
2010-Aug-18 18:00 UTC
[Lustre-discuss] More detail regarding soft lockup error
Hi Many thanks for the reply! The MDT is only 5% used. OST''s are all around 30%. I have run the full series of e2fsck''s. Have been running 1.8.1 as it was built with weak dependencies to Infiniband, and I needed OFED 1.5 for my hardware.... But, I have recently been told by Marco Gomes (based on a recent forum discussion?) that 1.8.3 seems to work with OFED 1.5.2-rc2. Will try to upgrade and see... Thanks and kindest regards! Fourie On 18/08/2010 16:27, Andreas Dilger wrote:> On 2010-08-18, at 4:14, Fourie Joubert<fourie.joubert at up.ac.za> wrote: >> Just reporting some more detail about the soft lockup error I have been >> getting: >> >> I am running Lustre 1.8.1, kernels are from the Lustre distro. > Firstly, there is a known corruption bug in 1.8.1, you should ar minimum upgrade to 1.8.1.1, but it may be that this problem is fixed in a newer release. > > >> Aug 18 12:14:11 wonkofs kernel: [<ffffffff88b7007f>] :ldiskfs:ldiskfs_get_group_desc+0x3b/0xb0 >> Aug 18 12:14:11 wonkofs kernel: [<ffffffff88b7007f>] >> :ldiskfs:ldiskfs_find_reverse+0x2f/0xa0 >> Aug 18 12:14:11 wonkofs kernel: [<ffffffff88b78f26>] >> :ldiskfs:ldiskfs_new_inode_wantedi+0x46/0x80 > This is getting stuck looking for a free inode. Is there free space in the MDT filesystem? Have you run a full e2fsck? > > I do recall some changes is this code a while ago, but I don''t recall what version it was in offhand. > > Cheers, Andreas >-- -------------- Prof Fourie Joubert Associate Professor Bioinformatics and Computational Biology Unit Department of Biochemistry University of Pretoria fourie.joubert at up.ac.za http://www.bi.up.ac.za Tel. +27-12-420-5802 Fax. +27-12-420-5800 ------------------------------------------------------------------------- This message and attachments are subject to a disclaimer. Please refer to www.it.up.ac.za/documentation/governance/disclaimer/ for full details.
Fourie Joubert
2010-Aug-19 06:41 UTC
[Lustre-discuss] More detail regarding soft lockup error
Hi Many thanks for the suggestion - I seem to have indeed run out of inodes! We have a massive amount of very small files :-( Does anyone know if it is possible increase the inodes, or do I need to redo the MDT? Kindest regards! Fourie Ben Evans wrote:> I''ve found MDT utilization is best measured using "df -i" rather than > "df -k", since you''ll run out of inodes before you run out of space. No > file data is stored on the MDT and the metadata is small. > > -----Original Message----- > From: lustre-discuss-bounces at lists.lustre.org > [mailto:lustre-discuss-bounces at lists.lustre.org] On Behalf Of Fourie > Joubert > Sent: Wednesday, August 18, 2010 2:01 PM > To: Andreas Dilger; lustre-discuss at lists.lustre.org > Subject: Re: [Lustre-discuss] More detail regarding soft lockup error > > Hi > > Many thanks for the reply! > > The MDT is only 5% used. > > OST''s are all around 30%. > > I have run the full series of e2fsck''s. > > Have been running 1.8.1 as it was built with weak dependencies to > Infiniband, and I needed OFED 1.5 for my hardware.... > > But, I have recently been told by Marco Gomes (based on a recent forum > discussion?) that 1.8.3 seems to work with OFED 1.5.2-rc2. > > Will try to upgrade and see... > > Thanks and kindest regards! > > Fourie > > > > > On 18/08/2010 16:27, Andreas Dilger wrote: > >> On 2010-08-18, at 4:14, Fourie Joubert<fourie.joubert at up.ac.za> >> > wrote: > >>> Just reporting some more detail about the soft lockup error I have >>> > been > >>> getting: >>> >>> I am running Lustre 1.8.1, kernels are from the Lustre distro. >>> >> Firstly, there is a known corruption bug in 1.8.1, you should ar >> > minimum upgrade to 1.8.1.1, but it may be that this problem is fixed in > a newer release. > >> >>> Aug 18 12:14:11 wonkofs kernel: [<ffffffff88b7007f>] >>> > :ldiskfs:ldiskfs_get_group_desc+0x3b/0xb0 > >>> Aug 18 12:14:11 wonkofs kernel: [<ffffffff88b7007f>] >>> :ldiskfs:ldiskfs_find_reverse+0x2f/0xa0 >>> Aug 18 12:14:11 wonkofs kernel: [<ffffffff88b78f26>] >>> :ldiskfs:ldiskfs_new_inode_wantedi+0x46/0x80 >>> >> This is getting stuck looking for a free inode. Is there free space in >> > the MDT filesystem? Have you run a full e2fsck? > >> I do recall some changes is this code a while ago, but I don''t recall >> > what version it was in offhand. > >> Cheers, Andreas >> >> > > >-- -------------- Prof Fourie Joubert Associate Professor Bioinformatics and Computational Biology Unit Department of Biochemistry University of Pretoria fourie.joubert at up.ac.za http://www.bi.up.ac.za Tel. +27-12-420-5802 Fax. +27-12-420-5800 ------------------------------------------------------------------------- This message and attachments are subject to a disclaimer. Please refer to www.it.up.ac.za/documentation/governance/disclaimer/ for full details.
Andreas Dilger
2010-Aug-19 16:09 UTC
[Lustre-discuss] More detail regarding soft lockup error
On 2010-08-19, at 00:41, Fourie Joubert wrote:> Many thanks for the suggestion - I seem to have indeed run out of inodes!That is what I had asked previously, but I wasn''t very clear.> We have a massive amount of very small files :-( > > Does anyone know if it is possible increase the inodes, or do I need to > redo the MDT?If you increase the size of the MDT (via resize2fs) it will increase the number of inodes as well.> Ben Evans wrote: >> I''ve found MDT utilization is best measured using "df -i" rather than >> "df -k", since you''ll run out of inodes before you run out of space. No >> file data is stored on the MDT and the metadata is small. >> >> -----Original Message----- >> From: lustre-discuss-bounces at lists.lustre.org >> [mailto:lustre-discuss-bounces at lists.lustre.org] On Behalf Of Fourie >> Joubert >> Sent: Wednesday, August 18, 2010 2:01 PM >> To: Andreas Dilger; lustre-discuss at lists.lustre.org >> Subject: Re: [Lustre-discuss] More detail regarding soft lockup error >> >> Hi >> >> Many thanks for the reply! >> >> The MDT is only 5% used. >> >> OST''s are all around 30%. >> >> I have run the full series of e2fsck''s. >> >> Have been running 1.8.1 as it was built with weak dependencies to >> Infiniband, and I needed OFED 1.5 for my hardware.... >> >> But, I have recently been told by Marco Gomes (based on a recent forum >> discussion?) that 1.8.3 seems to work with OFED 1.5.2-rc2. >> >> Will try to upgrade and see... >> >> Thanks and kindest regards! >> >> Fourie >> >> >> >> >> On 18/08/2010 16:27, Andreas Dilger wrote: >> >>> On 2010-08-18, at 4:14, Fourie Joubert<fourie.joubert at up.ac.za> >>> >> wrote: >> >>>> Just reporting some more detail about the soft lockup error I have >>>> >> been >> >>>> getting: >>>> >>>> I am running Lustre 1.8.1, kernels are from the Lustre distro. >>>> >>> Firstly, there is a known corruption bug in 1.8.1, you should ar >>> >> minimum upgrade to 1.8.1.1, but it may be that this problem is fixed in >> a newer release. >> >>> >>>> Aug 18 12:14:11 wonkofs kernel: [<ffffffff88b7007f>] >>>> >> :ldiskfs:ldiskfs_get_group_desc+0x3b/0xb0 >> >>>> Aug 18 12:14:11 wonkofs kernel: [<ffffffff88b7007f>] >>>> :ldiskfs:ldiskfs_find_reverse+0x2f/0xa0 >>>> Aug 18 12:14:11 wonkofs kernel: [<ffffffff88b78f26>] >>>> :ldiskfs:ldiskfs_new_inode_wantedi+0x46/0x80 >>>> >>> This is getting stuck looking for a free inode. Is there free space in >>> >> the MDT filesystem? Have you run a full e2fsck? >> >>> I do recall some changes is this code a while ago, but I don''t recall >>> >> what version it was in offhand. >> >>> Cheers, Andreas >>> >>> >> >> >> > > > -- > -------------- > Prof Fourie Joubert > Associate Professor > Bioinformatics and Computational Biology Unit > Department of Biochemistry > University of Pretoria > fourie.joubert at up.ac.za > http://www.bi.up.ac.za > Tel. +27-12-420-5802 > Fax. +27-12-420-5800 > > ------------------------------------------------------------------------- > This message and attachments are subject to a disclaimer. Please refer > to www.it.up.ac.za/documentation/governance/disclaimer/ for full details. > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discussCheers, Andreas -- Andreas Dilger Lustre Technical Lead Oracle Corporation Canada Inc.
Brian J. Murrell
2010-Aug-19 16:49 UTC
[Lustre-discuss] More detail regarding soft lockup error
On Thu, 2010-08-19 at 10:09 -0600, Andreas Dilger wrote:> > If you increase the size of the MDT (via resize2fs) it will increase the number of inodes as well.Andreas: what is [y]our confidence level with resize2fs and our MDT? Given that I don''t think we regularly (if at all) test this in our QA cycles (although I wish we would) I personally would be a lot more comfortable with a backup first. What are your thoughts? Unnecessary? b. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 198 bytes Desc: This is a digitally signed message part Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20100819/3ad705bd/attachment.bin
Kevin Van Maren
2010-Aug-19 16:58 UTC
[Lustre-discuss] More detail regarding soft lockup error
Andreas _always_ recommends a backup first. Kevin Brian J. Murrell wrote:> On Thu, 2010-08-19 at 10:09 -0600, Andreas Dilger wrote: > >> If you increase the size of the MDT (via resize2fs) it will increase the number of inodes as well. >> > > Andreas: what is [y]our confidence level with resize2fs and our MDT? > Given that I don''t think we regularly (if at all) test this in our QA > cycles (although I wish we would) I personally would be a lot more > comfortable with a backup first. What are your thoughts? Unnecessary? > > b. > > ------------------------------------------------------------------------ > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss >
Ms. Megan Larko
2010-Aug-19 18:16 UTC
[Lustre-discuss] More detail regarding soft lockup error
I will add emphasis here. Backup the MDT before doing anything at all. The MDT backup procedure is short and documented in the Lustre Manual. Megan (MDT back-up saved my bacon) Larko --------------------------------------------------------- Kevin said: Message: 3 Date: Thu, 19 Aug 2010 10:58:36 -0600 From: Kevin Van Maren <kevin.van.maren at oracle.com> Subject: Re: [Lustre-discuss] More detail regarding soft lockup error To: "Brian J. Murrell" <brian.murrell at oracle.com> Cc: lustre-discuss at lists.lustre.org Message-ID: <4C6D62BC.1060405 at oracle.com> Content-Type: text/plain; charset=UTF-8; format=flowed Andreas _always_ recommends a backup first. Kevin Brian J. Murrell wrote:> On Thu, 2010-08-19 at 10:09 -0600, Andreas Dilger wrote: > >> If you increase the size of the MDT (via resize2fs) it will increase the number of inodes as well. >> > > Andreas: what is [y]our confidence level with resize2fs and our MDT? > Given that I don''t think we regularly (if at all) test this in our QA > cycles (although I wish we would) I personally would be a lot more > comfortable with a backup first. What are your thoughts? Unnecessary? > > b.
Andreas Dilger
2010-Aug-19 19:28 UTC
[Lustre-discuss] More detail regarding soft lockup error
On 2010-08-19, at 10:49, Brian J. Murrell wrote:> On Thu, 2010-08-19 at 10:09 -0600, Andreas Dilger wrote: >> >> If you increase the size of the MDT (via resize2fs) it will increase the number of inodes as well. > > Andreas: what is [y]our confidence level with resize2fs and our MDT? > Given that I don''t think we regularly (if at all) test this in our QA > cycles (although I wish we would) I personally would be a lot more > comfortable with a backup first. What are your thoughts? Unnecessary?Always have a backup of the MDS, even if you are NOT doing an inherently risky process like potentially rewriting all of the metadata in the filesystem... I keep two full "dd" copies of my MDS, alternating days, given that the space required is so small. Even if there is a large MDS with short-stroked SAS drives or SSDs in RAID-1+0, keeping a handful of slow 1.5TB SATA drives attached just for backups makes a lot of sense, and costs a few hundred dollars. They don''t need to be dual-ported (or even more than RAID-0 or LVM concatenated volumes). Because the "dd" backup and restore is entirely linear IO the SATA drives will give very good performance. That said, any ext* filesystem formatted in the past 5 years can normally do a resize of 1024x without actually having to scan/rewrite the filesystem metadata. I can''t say that I''ve had to do any MDT resizing, but I''ve resized my OSTs a bunch of times w/o ill effects. That isn''t to say that Oracle tests this, just my personal observation from my home setup. Cheers, Andreas
Nirmal Seenu
2010-Aug-19 20:21 UTC
[Lustre-discuss] More detail regarding soft lockup error
We recently resized our MDT on two different lustre installation under production use and everything worked out smoothly. I did a backup with dd before resizing but we didn''t need to use the backup image after all: [root at mdt1 ~]# e2fsck -f /dev/mdt1_vol/mdt e2fsck 1.41.6.sun1 (30-May-2009) lqcdproj-MDT0000: recovering journal Pass 1: Checking inodes, blocks, and sizes Pass 2: Checking directory structure Pass 3: Checking directory connectivity Pass 4: Checking reference counts Pass 5: Checking group summary information lqcdproj-MDT0000: 10485760/10485760 files (0.0% non-contiguous), 1706638/10485760 blocks [root at mdt1 ~]# lvdisplay /dev/mdt1_vol/mdt --- Logical volume --- LV Name /dev/mdt1_vol/mdt VG Name mdt1_vol LV Size 40.00 GB Current LE 10240 Segments 1 Allocation inherit Read ahead sectors auto - currently set to 256 Block device 253:2 [root at mdt1 ~]# lvextend --size +40G -n /dev/mdt1_vol/mdt Extending logical volume mdt to 80.00 GB Logical volume mdt successfully resized [root at mdt1 ~]# lvdisplay /dev/mdt1_vol/mdt --- Logical volume --- LV Name /dev/mdt1_vol/mdt VG Name mdt1_vol LV Size 80.00 GB Current LE 20480 Segments 2 Allocation inherit Read ahead sectors auto - currently set to 256 Block device 253:2 [root at mdt1 ~]# e2fsck -f /dev/mdt1_vol/mdt [root at mdt1 ~]# resize2fs /dev/mdt1_vol/mdt resize2fs 1.41.6.sun1 (30-May-2009) Resizing the filesystem on /dev/mdt1_vol/mdt to 20971520 (4k) blocks. The filesystem on /dev/mdt1_vol/mdt is now 20971520 blocks long. [root at mdt1 ~]# e2fsck -f /dev/mdt1_vol/mdt Nirmal