thr3ads.net - Lustre discuss - [Lustre-discuss] File Content change without Error log [Mar 2009]

If this information is useful, please help other people find it:
Share via:

Lu Wang

2009-Mar-31 04:15 UTC

[Lustre-discuss] File Content change without Error log

Dear  all,
     There are more than 100 files demaged recently without any error logs on
OSS. The demaged files has same size with their original copys in our backup
system. However, the chksum changed. For example,
#ll run_0008126_All_file015_SFO-1.raw.353645 
-rw-r--r--  1 chyd u07 2108082156 Mar 31 10:07
run_0008126_All_file015_SFO-1.raw.353645
# ll demaged 
-rw-r--r--  1 root root 2108082156 Mar 31 11:19 demaged

# cmp run_0008126_All_file015_SFO-1.raw.353645 demaged 
run_0008126_All_file015_SFO-1.raw.353645 demaged differ: byte 16777217, line
118663

# adler32 run_0008126_All_file015_SFO-1.raw.353645 
adler32(run_0008126_All_file015_SFO-1.raw.353645) = 3653083401, 0xd9bda109
#adler32 demaged 
adler32(demaged) = 195426776, 0xba5f9d8
PS:
1.The modifiy time of these demaged files are same as the time they copied to
Lustre.
2.There is no abnormal signals in OSS logs. 
3. All the OSSs,MDSs works well. 

 Has anyone met this problem? Any ideas? 





Best Regards
Lu Wang
--------------------------------------------------------------	     			
Computing Center
IHEP							Email: Lu.Wang at ihep.ac.cn							
--------------------------------------------------------------

Brian J. Murrell

2009-Mar-31 15:25 UTC

head link

[Lustre-discuss] File Content change without Error log

On Tue, 2009-03-31 at 12:15 +0800, Lu Wang wrote:> Dear  all,
>      There are more than 100 files demaged recently without any error logs
on OSS. The demaged files has same size with their original copys in our backup
system. However, the chksum changed. For example,
> #ll run_0008126_All_file015_SFO-1.raw.353645 
> -rw-r--r--  1 chyd u07 2108082156 Mar 31 10:07
run_0008126_All_file015_SFO-1.raw.353645
> # ll demaged 
> -rw-r--r--  1 root root 2108082156 Mar 31 11:19 demaged
I''m assuming run_0008126_All_file015_SFO-1.raw.353645 is from your
backup and demaged is the "corrupt" file, is that correct?  I will
base
my statements on that...
> # cmp run_0008126_All_file015_SFO-1.raw.353645 demaged 
> run_0008126_All_file015_SFO-1.raw.353645 demaged differ: byte 16777217,
line 118663
> 
> # adler32 run_0008126_All_file015_SFO-1.raw.353645 
> adler32(run_0008126_All_file015_SFO-1.raw.353645) = 3653083401, 0xd9bda109
> #adler32 demaged 
> adler32(demaged) = 195426776, 0xba5f9d8
> PS:
> 1.The modifiy time of these demaged files are same as the time they copied
to Lustre.
Why is the modification time of run_0008126_All_file015_SFO-1.raw.353645
and demaged different?  Could that difference, and the relatively
newness of run_0008126_All_file015_SFO-1.raw.353645 explain what
happened (i.e. it was written to, legitimately).
> 2.There is no abnormal signals in OSS logs. 
There wouldn''t be in normal situations such as the file was written to
after the backup was made.  The modification times give no assurance
that that was not the case as "demaged" is written after
run_0008126_All_file015_SFO-1.raw.353645.

Also, silent disk corruption (i.e. in the hardware) could be a cause as
could any kind of silent failure below the Lustre stack.

Also, with regard to the backup file that you are comparing to, is it
truly the actual file on the backup medium that you are using in the
comparison or is it a copy (i.e. restored to a disk from the backup
medium)?

If it''s a copy of the backup file, how do you know that the copy from
the backup is not actually corrupt and that the copy on disk is in fact
the true copy?  Or how do you know that the copy that''s on the backup
medium is not corrupted (i.e. faulty backup medium)?  What''s your point
of reference that assures that (the copy from) the backup is the true
image and not damaged?

Just some things to consider.

b.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 197 bytes
Desc: This is a digitally signed message part
Url :
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20090331/7b8390a8/attachment-0001.bin

Lu Wang

2009-Mar-31 16:41 UTC

head link

[Lustre-discuss] File Content change without Error log

Yes, you are right. 
The problem is caused by misconfiguration of one disk array.Two  Patritions of
this array are mapped to a same lun.
That is to say: When I created OST1 on /dev/sda OST2 on /dev/sdb,  the two OSTs
are acturally written to a same disk patrition on the disk array.

(It is quite strange, there was not errors when I created two OST on a
"same"    patrition.  )And when the content grows to nearly 40% of
each OST,some of the files are demaged( because double use of disk space) .
I found this problem because all the files damaged are on a same OST.   After
e2fsck, I lost one OST, the other OST becomes double sizes.  There are a lot of
"red" files in directory "O" when I mount the OST as
ldiskfs.
I used
lfs getstripe --obd ****_UUID /dir generated the demaged file list. 
Is it possible to get back the lost OST using the "red" files?    I am
not sure I have explained clearly ...
------------------				 
Lu Wang
2009-04-01

-------------------------------------------------------------
????Brian J. Murrell
?????2009-03-31 23:23:04
????lustre-discuss
???
???Re: [Lustre-discuss] File Content change without Error log

On Tue, 2009-03-31 at 12:15 +0800, Lu Wang wrote:> Dear  all,
>      There are more than 100 files demaged recently without any error logs
on OSS. The demaged files has same size with their original copys in our backup
system. However, the chksum changed. For example,
> #ll run_0008126_All_file015_SFO-1.raw.353645 
> -rw-r--r--  1 chyd u07 2108082156 Mar 31 10:07
run_0008126_All_file015_SFO-1.raw.353645
> # ll demaged 
> -rw-r--r--  1 root root 2108082156 Mar 31 11:19 demaged
I''m assuming run_0008126_All_file015_SFO-1.raw.353645 is from your
backup and demaged is the "corrupt" file, is that correct?  I will
base
my statements on that...
> # cmp run_0008126_All_file015_SFO-1.raw.353645 demaged 
> run_0008126_All_file015_SFO-1.raw.353645 demaged differ: byte 16777217,
line 118663
> 
> # adler32 run_0008126_All_file015_SFO-1.raw.353645 
> adler32(run_0008126_All_file015_SFO-1.raw.353645) = 3653083401, 0xd9bda109
> #adler32 demaged 
> adler32(demaged) = 195426776, 0xba5f9d8
> PS:
> 1.The modifiy time of these demaged files are same as the time they copied
to Lustre.
Why is the modification time of run_0008126_All_file015_SFO-1.raw.353645
and demaged different?  Could that difference, and the relatively
newness of run_0008126_All_file015_SFO-1.raw.353645 explain what
happened (i.e. it was written to, legitimately).
> 2.There is no abnormal signals in OSS logs. 
There wouldn''t be in normal situations such as the file was written to
after the backup was made.  The modification times give no assurance
that that was not the case as "demaged" is written after
run_0008126_All_file015_SFO-1.raw.353645.

Also, silent disk corruption (i.e. in the hardware) could be a cause as
could any kind of silent failure below the Lustre stack.

Also, with regard to the backup file that you are comparing to, is it
truly the actual file on the backup medium that you are using in the
comparison or is it a copy (i.e. restored to a disk from the backup
medium)?

If it''s a copy of the backup file, how do you know that the copy from
the backup is not actually corrupt and that the copy on disk is in fact
the true copy?  Or how do you know that the copy that''s on the backup
medium is not corrupted (i.e. faulty backup medium)?  What''s your point
of reference that assures that (the copy from) the backup is the true
image and not damaged?

Just some things to consider.

b.


_______________________________________________
Lustre-discuss mailing list
Lustre-discuss at lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss

Brian J. Murrell

2009-Mar-31 16:57 UTC

head link

[Lustre-discuss] File Content change without Error log

On Wed, 2009-04-01 at 00:40 +0800, Lu Wang wrote: > Yes, you are right. 
> The problem is caused by misconfiguration of one disk array.Two  Patritions
of this array are mapped to a same lun.
Hrm.  That sounds rather bad.
> That is to say: When I created OST1 on /dev/sda OST2 on /dev/sdb,  the two
OSTs are acturally written to a same disk patrition on the disk array.
Ouch.
> After e2fsck, I lost one OST, the other OST becomes double sizes.
I don''t know that I would trust such an OST even after an e2fsck.  The
structure of the filesystem may be repaired but the contents of files
are not.

Of course, I suppose recovering whatever you can from one of the two
OSTs is better than recovering from neither, but I would be very suspect
of the data coming from it.
> There are a lot of "red" files in directory "O" when I
mount the OST as ldiskfs.
I''m not sure what a "red" file is.
> I used
> lfs getstripe --obd ****_UUID /dir generated the demaged file list. 
> Is it possible to get back the lost OST using the "red" files?
If I''m understanding what happened, I''d say you are rather
lucky to get
any data from either of the OSTs and that recovering data from both OSTs
is rather unlikely.

b.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 197 bytes
Desc: This is a digitally signed message part
Url :
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20090331/0c28eee5/attachment.bin

Lu Wang

2009-Mar-31 17:24 UTC

head link

[Lustre-discuss] File Content change without Error log

yes, I am copying some files from our backup storage. 
# pwd
/lustre/ost1/O/0/d0
[root at boss10 d0]# ll
total 58931924
-rwSrwSrw-  1 root  root          0 Mar  4 15:33 10016
-rwSrwSrw-  1 root  root          0 Mar  4 15:33 10048
-rwSrwSrw-  1 root  root          0 Mar  4 15:33 10080
-rwSrwSrw-  1 root  root          0 Mar  4 15:33 10112
-rwSrwSrw-  1 root  root          0 Mar  4 15:33 10144
-rwSrwSrw-  1 root  root          0 Mar  4 15:33 10176
-rwSrwSrw-  1 root  root          0 Mar  4 15:33 10208
For "red", I mean files with "S" bit, they are of zerosize,
so I think they are useless.

I think data in the "good" OST may also be demaged, so I decide to
delete all files on these two OSTs.

By the way, when I unlink a file, there is a "Input/Output error" ,
however the file disappears.
 #unlink run_0005818_Any_file007_SFO-1.rec
unlink: cannot unlink `run_0005818_Any_file007_SFO-1.rec'': Input/output
error
# ll run_0005818_Any_file007_SFO-1.rec
ls: run_0005818_Any_file007_SFO-1.rec: No such file or directory

I am not sure the file is saftely delete or not. Any suggustion?  


------------------				 
Lu Wang
2009-04-01

-------------------------------------------------------------
????Brian J. Murrell
?????2009-04-01 00:52:47
????lustre-discuss
???
???Re: [Lustre-discuss] File Content change without Error log

On Wed, 2009-04-01 at 00:40 +0800, Lu Wang wrote: > Yes, you are right. 
> The problem is caused by misconfiguration of one disk array.Two  Patritions
of this array are mapped to a same lun.
Hrm.  That sounds rather bad.
> That is to say: When I created OST1 on /dev/sda OST2 on /dev/sdb,  the two
OSTs are acturally written to a same disk patrition on the disk array.
Ouch.
> After e2fsck, I lost one OST, the other OST becomes double sizes.
I don''t know that I would trust such an OST even after an e2fsck.  The
structure of the filesystem may be repaired but the contents of files
are not.

Of course, I suppose recovering whatever you can from one of the two
OSTs is better than recovering from neither, but I would be very suspect
of the data coming from it.
> There are a lot of "red" files in directory "O" when I
mount the OST as ldiskfs.
I''m not sure what a "red" file is.
> I used
> lfs getstripe --obd ****_UUID /dir generated the demaged file list. 
> Is it possible to get back the lost OST using the "red" files?
If I''m understanding what happened, I''d say you are rather
lucky to get
any data from either of the OSTs and that recovering data from both OSTs
is rather unlikely.

b.


_______________________________________________
Lustre-discuss mailing list
Lustre-discuss at lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss

Brian J. Murrell

2009-Mar-31 17:30 UTC

head link

[Lustre-discuss] File Content change without Error log

On Wed, 2009-04-01 at 01:24 +0800, Lu Wang wrote:> I think data in the "good" OST may also be demaged, so I decide
to delete all files on these two OSTs.
Probably the safest thing to do.
> By the way, when I unlink a file, there is a "Input/Output error"
, however the file disappears.
>  #unlink run_0005818_Any_file007_SFO-1.rec
> unlink: cannot unlink `run_0005818_Any_file007_SFO-1.rec'':
Input/output error
> # ll run_0005818_Any_file007_SFO-1.rec
> ls: run_0005818_Any_file007_SFO-1.rec: No such file or directory
> 
> I am not sure the file is saftely delete or not. Any suggustion?  
This is just speculation without having any evidence one way or another,
but just likely due to the damaged OSTs.

You can use lfs find to find all files that had objects on those two
OSTs and try to clean them up, or you can simply replace the OSTs with
fresh ones and let lfsck sort it all out.  It will correct the
situations where files exist on the MDT but the objects on the OSTs are
missing.

b.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 197 bytes
Desc: This is a digitally signed message part
Url :
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20090331/09e04220/attachment-0001.bin

Lu Wang

2009-Apr-01 09:56 UTC

head link

[Lustre-discuss] File Content change without Error log

Dear list, 
 	When I trying to remount the OST (after mkfs.lustre --reformat )I got these
errors:

on prompt line:
 mount -t lustre /dev/sda /lustre/ost1
mount.lustre: mount /dev/sda at /lustre/ost1 failed: Address already in use

In log:
Apr  1 17:46:40 boss10 kernel: LustreError: 11-0: an error occurred while
communicating with 192.168.50.32 at tcp. The mgs_target_reg operation failed
with -98
Apr  1 17:46:40 boss10 kernel: LustreError:
2204:0:(obd_mount.c:1084:server_start_targets()) Required registration failed
for besfs-OST0014: -98
Apr  1 17:46:40 boss10 kernel: LustreError:
2204:0:(obd_mount.c:1623:server_fill_super()) Unable to start targets: -98
Apr  1 17:46:40 boss10 kernel: LustreError:
2204:0:(obd_mount.c:1406:server_put_super()) no obd besfs-OST0014
Apr  1 17:46:40 boss10 kernel: LustreError:
2204:0:(obd_mount.c:136:server_deregister_mount()) besfs-OST0014 not registered
Apr  1 17:46:40 boss10 kernel: LDISKFS-fs: mballoc: 0 blocks 0 reqs (0 success)
Apr  1 17:46:40 boss10 kernel: LDISKFS-fs: mballoc: 0 extents scanned, 0 goal
hits, 0 2^N hits, 0 breaks, 0 lost
Apr  1 17:46:40 boss10 kernel: LDISKFS-fs: mballoc: 0 generated and it took 0
Apr  1 17:46:40 boss10 kernel: LDISKFS-fs: mballoc: 0 preallocated, 0 discarded
Apr  1 17:46:40 boss10 kernel: Lustre: server umount besfs-OST0014 complete
Apr  1 17:46:40 boss10 kernel: LustreError:
2204:0:(obd_mount.c:1980:lustre_fill_super()) Unable to mount  (-98)

I use the old index of the OST. The OSTs are still Inactive in the system. Need
I reactivate it first?
------------------				 
Lu Wang
2009-04-01

-------------------------------------------------------------
????Brian J. Murrell
?????2009-04-01 01:25:46
????lustre-discuss
???
???Re: [Lustre-discuss] File Content change without Error log

On Wed, 2009-04-01 at 01:24 +0800, Lu Wang wrote:> I think data in the "good" OST may also be demaged, so I decide
to delete all files on these two OSTs.
Probably the safest thing to do.
> By the way, when I unlink a file, there is a "Input/Output error"
, however the file disappears.
>  #unlink run_0005818_Any_file007_SFO-1.rec
> unlink: cannot unlink `run_0005818_Any_file007_SFO-1.rec'':
Input/output error
> # ll run_0005818_Any_file007_SFO-1.rec
> ls: run_0005818_Any_file007_SFO-1.rec: No such file or directory
> 
> I am not sure the file is saftely delete or not. Any suggustion?  
This is just speculation without having any evidence one way or another,
but just likely due to the damaged OSTs.

You can use lfs find to find all files that had objects on those two
OSTs and try to clean them up, or you can simply replace the OSTs with
fresh ones and let lfsck sort it all out.  It will correct the
situations where files exist on the MDT but the objects on the OSTs are
missing.

b.

_______________________________________________
Lustre-discuss mailing list
Lustre-discuss at lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss

Lustre discuss - Mar 2009 - File Content change without Error log

[Lustre-discuss] File Content change without Error log

[Lustre-discuss] File Content change without Error log

[Lustre-discuss] File Content change without Error log

[Lustre-discuss] File Content change without Error log

[Lustre-discuss] File Content change without Error log

[Lustre-discuss] File Content change without Error log

[Lustre-discuss] File Content change without Error log