By the way, we have also tried to dd the MDT device and mount the replica, the
problem still exists. Besides, we have not seen any error reported on hardware
monitor. It is much more like an ldiskfs error than hardware error.
Lu
? 2012-10-9???12:04? wanglu ???
> Dear all,
> Two of our MDS have got repeatedly read-only error recently after
once e2fsck on lustre 1.8.5. After the MDT mounted for a while, the kernel will
reports errors like:
> Oct 8 20:16:44 mainmds kernel: LDISKFS-fs error (device cciss!c0d1):
ldiskfs_ext_check_inode: bad header/extent in inode #50736178: invalid magic -
magic 0, entries 0, max 0(0), depth 0(0)
> Oct 8 20:16:44 mainmds kernel: Aborting journal on device cciss!c0d1-8.
> And make the MDS read-only.
> This problem has made about 1PB data, 0.1 billion files unavailable
to access. We believe there is some structure wrong in the local file system of
MDT, so we have tried to use e2fsck to fix it follow the process in lustre
manual. However, with the loop always goes like this:
> 1. run e2fsck, fixed or not fixed some errors
> 2. mount MDT, report read-only after some client operations, and
the whole system became unusable.
> 3. e2fsck again.
>
> We have tried with three different version lustre: 1.8.5, 1.8.6,
and 1.8.8-wc and their corresponding e2fsprog, the problem still exists.
Currently, We can only use lustre with all the clients mounted in read-only
mode, and tried to copy the whole file system. However, It takes a long period
to generate all the directory structure and file list for 0.1 billion files.
>
> Can any one give us some suggestions? Thank you very much!
>
> Lu Wang
> CC-IHEP
>
>
>
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss