Hello again,
I'm sorry to insist on this matter, but I'm wondering if this message
went through unnoticed or if I said something absurd (?). I find strange
nobody had nothing to say.
> * I now have knowledge of corrupted files, and I don't even know how
> many there is. I just know they are as much or more than those
'file'
> detected. Some of the files whose inodes fsck.ocfs2 tried to clone
> belong to the supra time period, and this suggests there were some kind
> of mess going on that the cluster wrote different files parts on the
> same blocks.. what could have caused this, and how do I avoid happening
> again?
>
> * Show I turn on tracing for a particular bit? Which one?
>
> * How can I monitor OCFS2 health on a running cluster?
I also find strange what mounted.ocfs2 reports:
[root at server01 ~]# mounted.ocfs2 -f
Device FS Nodes
/dev/sdc1 ocfs2 Unknown, server01
The output is consistent to server02's.
I'd really like to hear your thoughts on this.
--
Nuno Tavares
DRI, Consultoria Inform?tica
Telef: +351 936 184 086
Nuno Tavares escreveu:> Greetings all,
>
> I'm wondering if anyone can shed some light here.
>
> Some days ago an user reported problems dealing with a specific
> directory. After further investigation, I'm now suspecting that there
> was data corruption between a specific time period.
>
> Addressing the initial issue, I've checked /var/log/messages just in
> case and it had lots of messages like this:
>
> Jan 18 14:24:40 fsnode01 kernel: (4626,1):ocfs2_check_dir_entry:111 ERROR:
> bad entry in directory #8075816: directory entry across blocks -
> offset=0, inode=1164370764863544510, rec_len=57912, name_len=167
> Jan 18 14:24:40 fsnode01 kernel: (4626,1):ocfs2_prepare_dir_for_insert:1734
> ERROR: status = -2
> Jan 18 14:24:40 fsnode01 kernel: (4626,1):ocfs2_mknod:240 ERROR: status =
-2
> Jan 18 14:24:59 fsnode01 kernel: (4264,0):ocfs2_check_dir_entry:111 ERROR:
> bad entry in directory #8075816: directory entry across blocks -
> offset=0, inode=1164370764863544510, rec_len=57912, name_len=167
> Jan 18 14:24:59 fsnode01 kernel: (4264,0):ocfs2_prepare_dir_for_insert:1734
> ERROR: status = -2
> Jan 18 14:24:59 fsnode01 kernel: (4264,0):ocfs2_mknod:240 ERROR: status =
-2
> [...]
>
> Indeed, that inode was bound to the problematic directory:
>
> [root at fsnode01 ~]# debugfs.ocfs2 -R "findpath <8075816>"
/dev/sdc1
> 8075816 /storage/problematic/directory/
>
> So I brought the cluster down and requested a filesystem check which
> dumped a lot of messages like this:
>
> Cluster 1135086 is claimed by the following inodes:
> /storage/unrelated/file1
> /storage/unrelated/file2
> [DUP_CLUSTERS_CLONE] Inode "/storage/unrelated/file1" may be
cloned or
> deleted to break the claim it has on its clusters. Clone inode
> "/storage/unrelated/file1" to break claims on clusters it shares
with
> other inodes? y
> pass1d: Invalid argument passed to OCFS2 library while reading inode to
> clone
>
> Just check that pass1d (last) message. I've checked my tools, and
> although they mismatch, they are the latest versions available:
> [root at fsnode01 ~]# rpm -qa | grep ocfs
> ocfs2-tools-1.4.3-1.el5
> ocfs2-2.6.18-164.el5-1.4.4-1.el5
> ocfs2console-1.4.3-1.el5
>
> Notice kernel modules are 1.4.4 and tools are 1.4.3. Could this version
> mismatch cause the pass1d error? Does it have any consequence? I've
> checked again, they were the only ones available...
>
> I must say /storage/unrelated/* are all PDF files. However, there are
> some damaged ones, and I've tracked some down using 'file -bi'
to an
> interval of time between 'Jan 18 09:47' and 'Jan 18 12:24'.
I could only
> track these files because 'file' reported a damaged PDF header, but
I
> can't be sure the other ones are all OK, I can just say their header is
OK.
>
> Also worth mentioning is that there are other files between that time
> interval that seem to be OK (again, I can't be sure). I can't be
certain
> when this mess was started and when did the cluster recovered from this
> mess.
>
> I'm almost sure the files were OK when they were "about to
be" stored on
> /storage. This investigation suggested they were damaged *during* their
> existence on /storage. I've now taken appropriate measures to prove
this
> in the future.
>
> What is puzzling me is:
> * I now have knowledge of corrupted files, and I don't even know how
> many there is. I just know they are as much or more than those
'file'
> detected. Some of the files whose inodes fsck.ocfs2 tried to clone
> belong to the supra time period, and this suggests there were some kind
> of mess going on that the cluster wrote different files parts on the
> same blocks.. what could have caused this, and how do I avoid happening
> again?
>
> * Show I turn on tracing for a particular bit? Which one?
>
> * How can I monitor OCFS2 health on a running cluster?
>
> Regards,
>
>
> ------------------------------------------------------------------------
>
> _______________________________________________
> Ocfs2-users mailing list
> Ocfs2-users at oss.oracle.com
> http://oss.oracle.com/mailman/listinfo/ocfs2-users
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 260 bytes
Desc: OpenPGP digital signature
Url :
http://oss.oracle.com/pipermail/ocfs2-users/attachments/20100223/d45a328b/attachment.bin