Dear all, There are more than 100 files demaged recently without any error logs on OSS. The demaged files has same size with their original copys in our backup system. However, the chksum changed. For example, #ll run_0008126_All_file015_SFO-1.raw.353645 -rw-r--r-- 1 chyd u07 2108082156 Mar 31 10:07 run_0008126_All_file015_SFO-1.raw.353645 # ll demaged -rw-r--r-- 1 root root 2108082156 Mar 31 11:19 demaged # cmp run_0008126_All_file015_SFO-1.raw.353645 demaged run_0008126_All_file015_SFO-1.raw.353645 demaged differ: byte 16777217, line 118663 # adler32 run_0008126_All_file015_SFO-1.raw.353645 adler32(run_0008126_All_file015_SFO-1.raw.353645) = 3653083401, 0xd9bda109 #adler32 demaged adler32(demaged) = 195426776, 0xba5f9d8 PS: 1.The modifiy time of these demaged files are same as the time they copied to Lustre. 2.There is no abnormal signals in OSS logs. 3. All the OSSs,MDSs works well. Has anyone met this problem? Any ideas? Best Regards Lu Wang -------------------------------------------------------------- Computing Center IHEP Email: Lu.Wang at ihep.ac.cn --------------------------------------------------------------
Brian J. Murrell
2009-Mar-31 15:25 UTC
[Lustre-discuss] File Content change without Error log
On Tue, 2009-03-31 at 12:15 +0800, Lu Wang wrote:> Dear all, > There are more than 100 files demaged recently without any error logs on OSS. The demaged files has same size with their original copys in our backup system. However, the chksum changed. For example, > #ll run_0008126_All_file015_SFO-1.raw.353645 > -rw-r--r-- 1 chyd u07 2108082156 Mar 31 10:07 run_0008126_All_file015_SFO-1.raw.353645 > # ll demaged > -rw-r--r-- 1 root root 2108082156 Mar 31 11:19 demagedI''m assuming run_0008126_All_file015_SFO-1.raw.353645 is from your backup and demaged is the "corrupt" file, is that correct? I will base my statements on that...> # cmp run_0008126_All_file015_SFO-1.raw.353645 demaged > run_0008126_All_file015_SFO-1.raw.353645 demaged differ: byte 16777217, line 118663 > > # adler32 run_0008126_All_file015_SFO-1.raw.353645 > adler32(run_0008126_All_file015_SFO-1.raw.353645) = 3653083401, 0xd9bda109 > #adler32 demaged > adler32(demaged) = 195426776, 0xba5f9d8 > PS: > 1.The modifiy time of these demaged files are same as the time they copied to Lustre.Why is the modification time of run_0008126_All_file015_SFO-1.raw.353645 and demaged different? Could that difference, and the relatively newness of run_0008126_All_file015_SFO-1.raw.353645 explain what happened (i.e. it was written to, legitimately).> 2.There is no abnormal signals in OSS logs.There wouldn''t be in normal situations such as the file was written to after the backup was made. The modification times give no assurance that that was not the case as "demaged" is written after run_0008126_All_file015_SFO-1.raw.353645. Also, silent disk corruption (i.e. in the hardware) could be a cause as could any kind of silent failure below the Lustre stack. Also, with regard to the backup file that you are comparing to, is it truly the actual file on the backup medium that you are using in the comparison or is it a copy (i.e. restored to a disk from the backup medium)? If it''s a copy of the backup file, how do you know that the copy from the backup is not actually corrupt and that the copy on disk is in fact the true copy? Or how do you know that the copy that''s on the backup medium is not corrupted (i.e. faulty backup medium)? What''s your point of reference that assures that (the copy from) the backup is the true image and not damaged? Just some things to consider. b. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 197 bytes Desc: This is a digitally signed message part Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20090331/7b8390a8/attachment-0001.bin
Yes, you are right. The problem is caused by misconfiguration of one disk array.Two Patritions of this array are mapped to a same lun. That is to say: When I created OST1 on /dev/sda OST2 on /dev/sdb, the two OSTs are acturally written to a same disk patrition on the disk array. (It is quite strange, there was not errors when I created two OST on a "same" patrition. )And when the content grows to nearly 40% of each OST,some of the files are demaged( because double use of disk space) . I found this problem because all the files damaged are on a same OST. After e2fsck, I lost one OST, the other OST becomes double sizes. There are a lot of "red" files in directory "O" when I mount the OST as ldiskfs. I used lfs getstripe --obd ****_UUID /dir generated the demaged file list. Is it possible to get back the lost OST using the "red" files? I am not sure I have explained clearly ... ------------------ Lu Wang 2009-04-01 ------------------------------------------------------------- ????Brian J. Murrell ?????2009-03-31 23:23:04 ????lustre-discuss ??? ???Re: [Lustre-discuss] File Content change without Error log On Tue, 2009-03-31 at 12:15 +0800, Lu Wang wrote:> Dear all, > There are more than 100 files demaged recently without any error logs on OSS. The demaged files has same size with their original copys in our backup system. However, the chksum changed. For example, > #ll run_0008126_All_file015_SFO-1.raw.353645 > -rw-r--r-- 1 chyd u07 2108082156 Mar 31 10:07 run_0008126_All_file015_SFO-1.raw.353645 > # ll demaged > -rw-r--r-- 1 root root 2108082156 Mar 31 11:19 demagedI''m assuming run_0008126_All_file015_SFO-1.raw.353645 is from your backup and demaged is the "corrupt" file, is that correct? I will base my statements on that...> # cmp run_0008126_All_file015_SFO-1.raw.353645 demaged > run_0008126_All_file015_SFO-1.raw.353645 demaged differ: byte 16777217, line 118663 > > # adler32 run_0008126_All_file015_SFO-1.raw.353645 > adler32(run_0008126_All_file015_SFO-1.raw.353645) = 3653083401, 0xd9bda109 > #adler32 demaged > adler32(demaged) = 195426776, 0xba5f9d8 > PS: > 1.The modifiy time of these demaged files are same as the time they copied to Lustre.Why is the modification time of run_0008126_All_file015_SFO-1.raw.353645 and demaged different? Could that difference, and the relatively newness of run_0008126_All_file015_SFO-1.raw.353645 explain what happened (i.e. it was written to, legitimately).> 2.There is no abnormal signals in OSS logs.There wouldn''t be in normal situations such as the file was written to after the backup was made. The modification times give no assurance that that was not the case as "demaged" is written after run_0008126_All_file015_SFO-1.raw.353645. Also, silent disk corruption (i.e. in the hardware) could be a cause as could any kind of silent failure below the Lustre stack. Also, with regard to the backup file that you are comparing to, is it truly the actual file on the backup medium that you are using in the comparison or is it a copy (i.e. restored to a disk from the backup medium)? If it''s a copy of the backup file, how do you know that the copy from the backup is not actually corrupt and that the copy on disk is in fact the true copy? Or how do you know that the copy that''s on the backup medium is not corrupted (i.e. faulty backup medium)? What''s your point of reference that assures that (the copy from) the backup is the true image and not damaged? Just some things to consider. b. _______________________________________________ Lustre-discuss mailing list Lustre-discuss at lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Brian J. Murrell
2009-Mar-31 16:57 UTC
[Lustre-discuss] File Content change without Error log
On Wed, 2009-04-01 at 00:40 +0800, Lu Wang wrote:> Yes, you are right. > The problem is caused by misconfiguration of one disk array.Two Patritions of this array are mapped to a same lun.Hrm. That sounds rather bad.> That is to say: When I created OST1 on /dev/sda OST2 on /dev/sdb, the two OSTs are acturally written to a same disk patrition on the disk array.Ouch.> After e2fsck, I lost one OST, the other OST becomes double sizes.I don''t know that I would trust such an OST even after an e2fsck. The structure of the filesystem may be repaired but the contents of files are not. Of course, I suppose recovering whatever you can from one of the two OSTs is better than recovering from neither, but I would be very suspect of the data coming from it.> There are a lot of "red" files in directory "O" when I mount the OST as ldiskfs.I''m not sure what a "red" file is.> I used > lfs getstripe --obd ****_UUID /dir generated the demaged file list. > Is it possible to get back the lost OST using the "red" files?If I''m understanding what happened, I''d say you are rather lucky to get any data from either of the OSTs and that recovering data from both OSTs is rather unlikely. b. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 197 bytes Desc: This is a digitally signed message part Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20090331/0c28eee5/attachment.bin
yes, I am copying some files from our backup storage. # pwd /lustre/ost1/O/0/d0 [root at boss10 d0]# ll total 58931924 -rwSrwSrw- 1 root root 0 Mar 4 15:33 10016 -rwSrwSrw- 1 root root 0 Mar 4 15:33 10048 -rwSrwSrw- 1 root root 0 Mar 4 15:33 10080 -rwSrwSrw- 1 root root 0 Mar 4 15:33 10112 -rwSrwSrw- 1 root root 0 Mar 4 15:33 10144 -rwSrwSrw- 1 root root 0 Mar 4 15:33 10176 -rwSrwSrw- 1 root root 0 Mar 4 15:33 10208 For "red", I mean files with "S" bit, they are of zerosize, so I think they are useless. I think data in the "good" OST may also be demaged, so I decide to delete all files on these two OSTs. By the way, when I unlink a file, there is a "Input/Output error" , however the file disappears. #unlink run_0005818_Any_file007_SFO-1.rec unlink: cannot unlink `run_0005818_Any_file007_SFO-1.rec'': Input/output error # ll run_0005818_Any_file007_SFO-1.rec ls: run_0005818_Any_file007_SFO-1.rec: No such file or directory I am not sure the file is saftely delete or not. Any suggustion? ------------------ Lu Wang 2009-04-01 ------------------------------------------------------------- ????Brian J. Murrell ?????2009-04-01 00:52:47 ????lustre-discuss ??? ???Re: [Lustre-discuss] File Content change without Error log On Wed, 2009-04-01 at 00:40 +0800, Lu Wang wrote:> Yes, you are right. > The problem is caused by misconfiguration of one disk array.Two Patritions of this array are mapped to a same lun.Hrm. That sounds rather bad.> That is to say: When I created OST1 on /dev/sda OST2 on /dev/sdb, the two OSTs are acturally written to a same disk patrition on the disk array.Ouch.> After e2fsck, I lost one OST, the other OST becomes double sizes.I don''t know that I would trust such an OST even after an e2fsck. The structure of the filesystem may be repaired but the contents of files are not. Of course, I suppose recovering whatever you can from one of the two OSTs is better than recovering from neither, but I would be very suspect of the data coming from it.> There are a lot of "red" files in directory "O" when I mount the OST as ldiskfs.I''m not sure what a "red" file is.> I used > lfs getstripe --obd ****_UUID /dir generated the demaged file list. > Is it possible to get back the lost OST using the "red" files?If I''m understanding what happened, I''d say you are rather lucky to get any data from either of the OSTs and that recovering data from both OSTs is rather unlikely. b. _______________________________________________ Lustre-discuss mailing list Lustre-discuss at lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Brian J. Murrell
2009-Mar-31 17:30 UTC
[Lustre-discuss] File Content change without Error log
On Wed, 2009-04-01 at 01:24 +0800, Lu Wang wrote:> I think data in the "good" OST may also be demaged, so I decide to delete all files on these two OSTs.Probably the safest thing to do.> By the way, when I unlink a file, there is a "Input/Output error" , however the file disappears. > #unlink run_0005818_Any_file007_SFO-1.rec > unlink: cannot unlink `run_0005818_Any_file007_SFO-1.rec'': Input/output error > # ll run_0005818_Any_file007_SFO-1.rec > ls: run_0005818_Any_file007_SFO-1.rec: No such file or directory > > I am not sure the file is saftely delete or not. Any suggustion?This is just speculation without having any evidence one way or another, but just likely due to the damaged OSTs. You can use lfs find to find all files that had objects on those two OSTs and try to clean them up, or you can simply replace the OSTs with fresh ones and let lfsck sort it all out. It will correct the situations where files exist on the MDT but the objects on the OSTs are missing. b. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 197 bytes Desc: This is a digitally signed message part Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20090331/09e04220/attachment-0001.bin
Dear list, When I trying to remount the OST (after mkfs.lustre --reformat )I got these errors: on prompt line: mount -t lustre /dev/sda /lustre/ost1 mount.lustre: mount /dev/sda at /lustre/ost1 failed: Address already in use In log: Apr 1 17:46:40 boss10 kernel: LustreError: 11-0: an error occurred while communicating with 192.168.50.32 at tcp. The mgs_target_reg operation failed with -98 Apr 1 17:46:40 boss10 kernel: LustreError: 2204:0:(obd_mount.c:1084:server_start_targets()) Required registration failed for besfs-OST0014: -98 Apr 1 17:46:40 boss10 kernel: LustreError: 2204:0:(obd_mount.c:1623:server_fill_super()) Unable to start targets: -98 Apr 1 17:46:40 boss10 kernel: LustreError: 2204:0:(obd_mount.c:1406:server_put_super()) no obd besfs-OST0014 Apr 1 17:46:40 boss10 kernel: LustreError: 2204:0:(obd_mount.c:136:server_deregister_mount()) besfs-OST0014 not registered Apr 1 17:46:40 boss10 kernel: LDISKFS-fs: mballoc: 0 blocks 0 reqs (0 success) Apr 1 17:46:40 boss10 kernel: LDISKFS-fs: mballoc: 0 extents scanned, 0 goal hits, 0 2^N hits, 0 breaks, 0 lost Apr 1 17:46:40 boss10 kernel: LDISKFS-fs: mballoc: 0 generated and it took 0 Apr 1 17:46:40 boss10 kernel: LDISKFS-fs: mballoc: 0 preallocated, 0 discarded Apr 1 17:46:40 boss10 kernel: Lustre: server umount besfs-OST0014 complete Apr 1 17:46:40 boss10 kernel: LustreError: 2204:0:(obd_mount.c:1980:lustre_fill_super()) Unable to mount (-98) I use the old index of the OST. The OSTs are still Inactive in the system. Need I reactivate it first? ------------------ Lu Wang 2009-04-01 ------------------------------------------------------------- ????Brian J. Murrell ?????2009-04-01 01:25:46 ????lustre-discuss ??? ???Re: [Lustre-discuss] File Content change without Error log On Wed, 2009-04-01 at 01:24 +0800, Lu Wang wrote:> I think data in the "good" OST may also be demaged, so I decide to delete all files on these two OSTs.Probably the safest thing to do.> By the way, when I unlink a file, there is a "Input/Output error" , however the file disappears. > #unlink run_0005818_Any_file007_SFO-1.rec > unlink: cannot unlink `run_0005818_Any_file007_SFO-1.rec'': Input/output error > # ll run_0005818_Any_file007_SFO-1.rec > ls: run_0005818_Any_file007_SFO-1.rec: No such file or directory > > I am not sure the file is saftely delete or not. Any suggustion?This is just speculation without having any evidence one way or another, but just likely due to the damaged OSTs. You can use lfs find to find all files that had objects on those two OSTs and try to clean them up, or you can simply replace the OSTs with fresh ones and let lfsck sort it all out. It will correct the situations where files exist on the MDT but the objects on the OSTs are missing. b. _______________________________________________ Lustre-discuss mailing list Lustre-discuss at lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss