"Budsjö, Martin"
2009-Nov-09 13:37 UTC
[Lustre-discuss] Has anybody seen: ldiskfs_get_inode_block: bad inode number: 1
Hi After our upgrade to 1.6.7.2 (from 1.4.12) we started to get "ldiskfs_get_inode_block: bad inode number: 1" errors. This causes the ldiskfs filesystem to remount read only making the whole lustre filesystem read only. We took the filesytem offline and ran fsck on them. Some minor errors was found and fixxed, but the error persists. Any clue? Regards Martin Budsj? Example of the error: Nov 4 03:06:32 hlc305 kernel: LDISKFS-fs error (device dm-28): ldiskfs_get_inode_block: bad inode number: 1 Nov 4 03:06:32 hlc305 kernel: Remounting filesystem read-only Nov 4 03:06:32 hlc305 kernel: LDISKFS-fs error (device dm-28): ldiskfs_get_inode_block: bad inode number: 1 Nov 4 03:06:32 hlc305 kernel: LustreError: 1316:0:(fsfilt-ldiskfs.c:280:fsfilt_ldiskfs_start()) error starting handle for op 4 (35 credits): rc -30 Nov 4 03:06:32 hlc305 kernel: LustreError: 1316:0:(fsfilt-ldiskfs.c:280:fsfilt_ldiskfs_start()) Skipped 52 previous similar messages Nov 4 03:06:32 hlc305 kernel: LustreError: 1316:0:(mds_open.c:769:mds_finish_open()) mds_create_objects: rc = -30 Nov 4 03:06:32 hlc305 kernel: LustreError: 1316:0:(mds_open.c:769:mds_finish_open()) Skipped 1 previous similar message Nov 4 03:06:32 hlc305 kernel: LustreError: 1316:0:(mds_reint.c:154:mds_finish_transno()) fsfilt_start: -30 Nov 4 03:06:32 hlc305 kernel: LustreError: 1316:0:(mds_reint.c:154:mds_finish_transno()) Skipped 52 previous similar messages Nov 4 03:06:32 hlc305 kernel: LDISKFS-fs error (device dm-28): ldiskfs_get_inode_block: bad inode number: 1 Nov 4 03:06:32 hlc305 kernel: LDISKFS-fs error (device dm-28): ldiskfs_get_inode_block: bad inode number: 1 Nov 4 03:06:32 hlc305 kernel: LustreError: 1189:0:(mds_open.c:769:mds_finish_open()) mds_create_objects: rc = -30 Nov 4 03:07:14 hlc305 kernel: Lustre: Failing over spehome-MDT0000 Nov 4 03:07:14 hlc305 kernel: Lustre: Skipped 12 previous similar messages Nov 4 03:07:14 hlc305 kernel: Lustre: *** setting obd spehome-MDT0000 device ''dm-28'' read-only *** Nov 4 03:07:14 hlc305 kernel: Turning device dm-28 (0xfd0001c) read-only Nov 4 03:07:15 hlc305 kernel: LustreError: 1243:0:(handler.c:1601:mds_handle()) operation 101 on unconnected MDS from 12345-10.0.0.86 at tcp Nov 4 03:07:15 hlc305 kernel: LustreError: 1243:0:(handler.c:1601:mds_handle()) Skipped 74 previous similar messages Nov 4 03:07:15 hlc305 kernel: LustreError: 1243:0:(ldlm_lib.c:1643:target_send_reply_msg()) @@@ processing error (-107) req at 00000100cabd0800 x30168375/t0 o101-><?>@<?>: 0/0 lens 440/0 e 0 to 0 dl 1257300535 ref 1 fl Interpret:/0/0 rc -107/0 Nov 4 03:07:15 hlc305 kernel: LustreError: 1243:0:(ldlm_lib.c:1643:target_send_reply_msg()) Skipped 74 previous similar messages Nov 4 03:07:15 hlc305 kernel: LustreError: 137-5: UUID ''mds5_UUID'' is not available for connect (stopping) Nov 4 03:07:15 hlc305 kernel: LustreError: 17131:0:(llog_obd.c:380:llog_obd_origin_cleanup()) failure destroying log during cleanup: -30 Nov 4 03:07:15 hlc305 kernel: LustreError: 17131:0:(llog_obd.c:380:llog_obd_origin_cleanup()) Skipped 6 previous similar messages Nov 4 03:07:15 hlc305 kernel: LustreError: 17131:0:(fsfilt-ldiskfs.c:1236:fsfilt_ldiskfs_write_record()) can''t start transaction for 34 blocks (8192 bytes) Nov 4 03:07:15 hlc305 kernel: LustreError: 17131:0:(fsfilt-ldiskfs.c:1236:fsfilt_ldiskfs_write_record()) Skipped 6 previous similar messages Nov 4 03:07:15 hlc305 kernel: LustreError: 17131:0:(llog_lvfs.c:116:llog_lvfs_write_blob()) error writing log record: rc -30 Nov 4 03:07:15 hlc305 kernel: LustreError: 17131:0:(llog_lvfs.c:116:llog_lvfs_write_blob()) Skipped 6 previous similar messages Nov 4 03:07:15 hlc305 kernel: LustreError: 17131:0:(llog.c:135:llog_cancel_rec()) Failure re-writing header -30 Nov 4 03:07:15 hlc305 kernel: LustreError: 17131:0:(llog.c:135:llog_cancel_rec()) Skipped 6 previous similar messages Nov 4 03:07:15 hlc305 kernel: LustreError: 17131:0:(handler.c:1963:mds_update_server_data()) error writing MDS server data: rc = -30 Nov 4 03:07:15 hlc305 kernel: Lustre: spehome-MDT0000: shutting down for failover; client state will be preserved. Nov 4 03:07:15 hlc305 kernel: Lustre: MDT spehome-MDT0000 has stopped. Nov 4 03:07:15 hlc305 kernel: VFS: Busy inodes after unmount. Self-destruct in 5 seconds. Have a nice day... Nov 4 03:07:15 hlc305 kernel: Removing read-only on unknown block (0xfd0001c) Nov 4 03:07:15 hlc305 kernel: Lustre: server umount spehome-MDT0000 complete Nov 4 03:07:16 hlc305 kernel: kjournald starting. Commit interval 5 seconds Nov 4 03:07:16 hlc305 kernel: LDISKFS-fs warning: mounting fs with errors, running e2fsck is recommended Nov 4 03:07:16 hlc305 kernel: LDISKFS FS on dm-28, internal journal Nov 4 03:07:16 hlc305 kernel: LDISKFS-fs: recovery complete. Nov 4 03:07:16 hlc305 kernel: LDISKFS-fs: mounted filesystem with ordered data mode. Nov 4 03:07:16 hlc305 kernel: kjournald starting. Commit interval 5 seconds Nov 4 03:07:16 hlc305 kernel: LDISKFS-fs warning: mounting fs with errors, running e2fsck is recommended Nov 4 03:07:16 hlc305 kernel: LDISKFS FS on dm-28, internal journal Nov 4 03:07:16 hlc305 kernel: LDISKFS-fs: mounted filesystem with ordered data mode. Nov 4 03:07:16 hlc305 kernel: Lustre: Enabling user_xattr Nov 4 03:07:16 hlc305 kernel: Lustre: 17137:0:(mds_fs.c:511:mds_init_server_data()) RECOVERY: service spehome-MDT0000, 166 recoverable clients, last_transno 7646800754 Nov 4 03:07:16 hlc305 kernel: Lustre: 636:0:(mds_lov.c:1075:mds_notify()) MDS spehome-MDT0000: in recovery, not resetting orphans on ost1spehome_UUID Nov 4 03:07:16 hlc305 kernel: Lustre: MDT spehome-MDT0000 now serving mds5_UUID (spehome-MDT0000/2af526ca-85d4-996b-4773-4898812d6a31), but will be in recovery for at le ast 5:00, or until 166 clients reconnect. During this time new clients will not be allowed to connect. Recovery progress can be monitored by watching /proc/fs/lustre/mds/ spehome-MDT0000/recovery_status. Nov 4 03:07:16 hlc305 kernel: Lustre: Server spehome-MDT0000 on device /dev/vg_mds/spehome has started Nov 4 03:07:16 hlc305 kernel: Lustre: spehome-MDT0000: temporarily refusing client connection from 10.0.0.69 at tcp Nov 4 03:07:16 hlc305 kernel: Lustre: 1247:0:(ldlm_lib.c:1240:check_and_start_recovery_timer()) spehome-MDT0000: starting recovery timer Nov 4 03:07:16 hlc305 kernel: Lustre: 1249:0:(ldlm_lib.c:1591:target_queue_last_replay_reply()) spehome-MDT0000: 165 recoverable clients remain Nov 4 03:07:16 hlc305 kernel: Lustre: 1249:0:(ldlm_lib.c:1591:target_queue_last_replay_reply()) Skipped 7 previous similar messages Nov 4 03:07:16 hlc305 kernel: Lustre: 1284:0:(mds_open.c:841:mds_open_by_fid()) Orphan d5839c:772a0a47 found and opened in PENDING directory Nov 4 03:07:16 hlc305 kernel: Lustre: 1284:0:(mds_open.c:841:mds_open_by_fid()) Skipped 1 previous similar message -------------------------------------------------------------------------- Confidentiality Notice: This message is private and may contain confidential and proprietary information. If you have received this message in error, please notify us and remove it from your system and note that you must not copy, distribute or take any action in reliance on it. Any unauthorized use or disclosure of the contents of this message is not permitted and may be unlawful.
Andreas Dilger
2009-Nov-09 20:02 UTC
[Lustre-discuss] Has anybody seen: ldiskfs_get_inode_block: bad inode number: 1
On 2009-11-09, at 06:37, Budsj?, Martin wrote:> After our upgrade to 1.6.7.2 (from 1.4.12) we started to get > "ldiskfs_get_inode_block: bad inode number: 1" errors. This causes > the ldiskfs filesystem to remount read only making the whole lustre > filesystem read only. > > We took the filesytem offline and ran fsck on them. Some minor > errors was found and fixxed, but the error persists. Any clue?That inode number is invalid to be used, so either there is a directory entry with that value (seems unlikely to exist after an e2fsck run), or some client is incorrectly requesting that inode number (possibly cached after having read it from the MDS previously?). It will probably go away if you reboot the client(s) that have cached this inode number, but it isn''t clear from the error messages below which client that is. You might be able to find that in the Lustre kernel debug logs.> Example of the error: > > Nov 4 03:06:32 hlc305 kernel: LDISKFS-fs error (device dm-28): > ldiskfs_get_inode_block: bad inode number: 1 > Nov 4 03:06:32 hlc305 kernel: Remounting filesystem read-only > Nov 4 03:06:32 hlc305 kernel: LDISKFS-fs error (device dm-28): > ldiskfs_get_inode_block: bad inode number: 1 > Nov 4 03:06:32 hlc305 kernel: LustreError: 1316:0:(fsfilt-ldiskfs.c: > 280:fsfilt_ldiskfs_start()) error starting handle for op 4 (35 > credits): rc -30 > Nov 4 03:06:32 hlc305 kernel: LustreError: 1316:0:(fsfilt-ldiskfs.c: > 280:fsfilt_ldiskfs_start()) Skipped 52 previous similar messages > Nov 4 03:06:32 hlc305 kernel: LustreError: 1316:0:(mds_open.c: > 769:mds_finish_open()) mds_create_objects: rc = -30 > Nov 4 03:06:32 hlc305 kernel: LustreError: 1316:0:(mds_open.c: > 769:mds_finish_open()) Skipped 1 previous similar message > Nov 4 03:06:32 hlc305 kernel: LustreError: 1316:0:(mds_reint.c: > 154:mds_finish_transno()) fsfilt_start: -30 > Nov 4 03:06:32 hlc305 kernel: LustreError: 1316:0:(mds_reint.c: > 154:mds_finish_transno()) Skipped 52 previous similar messages > Nov 4 03:06:32 hlc305 kernel: LDISKFS-fs error (device dm-28): > ldiskfs_get_inode_block: bad inode number: 1 > Nov 4 03:06:32 hlc305 kernel: LDISKFS-fs error (device dm-28): > ldiskfs_get_inode_block: bad inode number: 1 > Nov 4 03:06:32 hlc305 kernel: LustreError: 1189:0:(mds_open.c: > 769:mds_finish_open()) mds_create_objects: rc = -30 > Nov 4 03:07:14 hlc305 kernel: Lustre: Failing over spehome-MDT0000 > Nov 4 03:07:14 hlc305 kernel: Lustre: Skipped 12 previous similar > messages > Nov 4 03:07:14 hlc305 kernel: Lustre: *** setting obd spehome- > MDT0000 device ''dm-28'' read-only *** > Nov 4 03:07:14 hlc305 kernel: Turning device dm-28 (0xfd0001c) read- > only > Nov 4 03:07:15 hlc305 kernel: LustreError: 1243:0:(handler.c: > 1601:mds_handle()) operation 101 on unconnected MDS from > 12345-10.0.0.86 at tcp > Nov 4 03:07:15 hlc305 kernel: LustreError: 1243:0:(handler.c: > 1601:mds_handle()) Skipped 74 previous similar messages > Nov 4 03:07:15 hlc305 kernel: LustreError: 1243:0:(ldlm_lib.c: > 1643:target_send_reply_msg()) @@@ processing error (-107) > req at 00000100cabd0800 x30168375/t0 o101-><?>@<?>: > 0/0 lens 440/0 e 0 to 0 dl 1257300535 ref 1 fl Interpret:/0/0 rc > -107/0 > Nov 4 03:07:15 hlc305 kernel: LustreError: 1243:0:(ldlm_lib.c: > 1643:target_send_reply_msg()) Skipped 74 previous similar messages > Nov 4 03:07:15 hlc305 kernel: LustreError: 137-5: UUID ''mds5_UUID'' > is not available for connect (stopping) > Nov 4 03:07:15 hlc305 kernel: LustreError: 17131:0:(llog_obd.c: > 380:llog_obd_origin_cleanup()) failure destroying log during > cleanup: -30 > Nov 4 03:07:15 hlc305 kernel: LustreError: 17131:0:(llog_obd.c: > 380:llog_obd_origin_cleanup()) Skipped 6 previous similar messages > Nov 4 03:07:15 hlc305 kernel: LustreError: 17131:0:(fsfilt- > ldiskfs.c:1236:fsfilt_ldiskfs_write_record()) can''t start > transaction for 34 blocks (8192 bytes) > Nov 4 03:07:15 hlc305 kernel: LustreError: 17131:0:(fsfilt- > ldiskfs.c:1236:fsfilt_ldiskfs_write_record()) Skipped 6 previous > similar messages > Nov 4 03:07:15 hlc305 kernel: LustreError: 17131:0:(llog_lvfs.c: > 116:llog_lvfs_write_blob()) error writing log record: rc -30 > Nov 4 03:07:15 hlc305 kernel: LustreError: 17131:0:(llog_lvfs.c: > 116:llog_lvfs_write_blob()) Skipped 6 previous similar messages > Nov 4 03:07:15 hlc305 kernel: LustreError: 17131:0:(llog.c: > 135:llog_cancel_rec()) Failure re-writing header -30 > Nov 4 03:07:15 hlc305 kernel: LustreError: 17131:0:(llog.c: > 135:llog_cancel_rec()) Skipped 6 previous similar messages > Nov 4 03:07:15 hlc305 kernel: LustreError: 17131:0:(handler.c: > 1963:mds_update_server_data()) error writing MDS server data: rc = -30 > Nov 4 03:07:15 hlc305 kernel: Lustre: spehome-MDT0000: shutting > down for failover; client state will be preserved. > Nov 4 03:07:15 hlc305 kernel: Lustre: MDT spehome-MDT0000 has > stopped. > Nov 4 03:07:15 hlc305 kernel: VFS: Busy inodes after unmount. Self- > destruct in 5 seconds. Have a nice day... > Nov 4 03:07:15 hlc305 kernel: Removing read-only on unknown block > (0xfd0001c) > Nov 4 03:07:15 hlc305 kernel: Lustre: server umount spehome-MDT0000 > complete > Nov 4 03:07:16 hlc305 kernel: kjournald starting. Commit interval > 5 seconds > Nov 4 03:07:16 hlc305 kernel: LDISKFS-fs warning: mounting fs with > errors, running e2fsck is recommended > Nov 4 03:07:16 hlc305 kernel: LDISKFS FS on dm-28, internal journal > Nov 4 03:07:16 hlc305 kernel: LDISKFS-fs: recovery complete. > Nov 4 03:07:16 hlc305 kernel: LDISKFS-fs: mounted filesystem with > ordered data mode. > Nov 4 03:07:16 hlc305 kernel: kjournald starting. Commit interval > 5 seconds > Nov 4 03:07:16 hlc305 kernel: LDISKFS-fs warning: mounting fs with > errors, running e2fsck is recommended > Nov 4 03:07:16 hlc305 kernel: LDISKFS FS on dm-28, internal journal > Nov 4 03:07:16 hlc305 kernel: LDISKFS-fs: mounted filesystem with > ordered data mode. > Nov 4 03:07:16 hlc305 kernel: Lustre: Enabling user_xattr > Nov 4 03:07:16 hlc305 kernel: Lustre: 17137:0:(mds_fs.c: > 511:mds_init_server_data()) RECOVERY: service spehome-MDT0000, 166 > recoverable clients, last_transno 7646800754 > Nov 4 03:07:16 hlc305 kernel: Lustre: 636:0:(mds_lov.c: > 1075:mds_notify()) MDS spehome-MDT0000: in recovery, not resetting > orphans on ost1spehome_UUID > Nov 4 03:07:16 hlc305 kernel: Lustre: MDT spehome-MDT0000 now > serving mds5_UUID (spehome- > MDT0000/2af526ca-85d4-996b-4773-4898812d6a31), but will be in > recovery for at le > ast 5:00, or until 166 clients reconnect. During this time new > clients will not be allowed to connect. Recovery progress can be > monitored by watching /proc/fs/lustre/mds/ > spehome-MDT0000/recovery_status. > Nov 4 03:07:16 hlc305 kernel: Lustre: Server spehome-MDT0000 on > device /dev/vg_mds/spehome has started > Nov 4 03:07:16 hlc305 kernel: Lustre: spehome-MDT0000: temporarily > refusing client connection from 10.0.0.69 at tcp > Nov 4 03:07:16 hlc305 kernel: Lustre: 1247:0:(ldlm_lib.c: > 1240:check_and_start_recovery_timer()) spehome-MDT0000: starting > recovery timer > Nov 4 03:07:16 hlc305 kernel: Lustre: 1249:0:(ldlm_lib.c: > 1591:target_queue_last_replay_reply()) spehome-MDT0000: 165 > recoverable clients remain > Nov 4 03:07:16 hlc305 kernel: Lustre: 1249:0:(ldlm_lib.c: > 1591:target_queue_last_replay_reply()) Skipped 7 previous similar > messages > Nov 4 03:07:16 hlc305 kernel: Lustre: 1284:0:(mds_open.c: > 841:mds_open_by_fid()) Orphan d5839c:772a0a47 found and opened in > PENDING directory > Nov 4 03:07:16 hlc305 kernel: Lustre: 1284:0:(mds_open.c: > 841:mds_open_by_fid()) Skipped 1 previous similar message > > > -------------------------------------------------------------------------- > Confidentiality Notice: This message is private and may contain > confidential and proprietary information. If you have received this > message in error, please notify us and remove it from your system > and note that you must not copy, distribute or take any action in > reliance on it. Any unauthorized use or disclosure of the contents > of this message is not permitted and may be unlawful. > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discussCheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc.