Chen, Yukun
2004-Aug-20 03:25 UTC
[Ocfs2-devel] [2.6.6 svn 1364]System hang randomly when writing to the same file from different processes of the same node
Hi all Steps to duplicate: 1.Do some operation ,such as mkdir&touch , on node A and node B 2.on node A process1 write to a file at a specific position(such as offset 1000) ,100 times 2.also on node A, at the same time , process2 write to the same file at the same position, 100 times Repeat step 1-2 several times, system will hang with the following message found in node A: state=1, lockid=22765568, flags = 0x1000, asked type = 5 master = 1, state = 0x0, type = 5 (18397) ERROR at /tmp/trunk/src/dlm.c, 461: status = -110 (18397) ERROR at /tmp/trunk/src/vote.c, 910: inode 5558, vote_status=0, vote_state=1, lockid=22765568, flags = 0x1000, asked type = 5 master 1, state = 0x0, type = 5 ... on node B , error message with dmesg: Call Trace: recalc_task_prio shedule ocfs_comm_process_msg ocfs_dlm_recv_msg worker_thread ocfs_dlm_recv_msg default_wake_function .... Any ideas on it? thanx. Aaron Intel China Software Lab Tel: 8621-52574545 Ext.1587 E_mail:yukun.chen@intel.com -------------- next part -------------- An HTML attachment was scrubbed... URL: http://oss.oracle.com/pipermail/ocfs2-devel/attachments/20040820/51ee0f1b/attachment.htm
Mark Fasheh
2004-Aug-20 12:43 UTC
[Ocfs2-devel] [2.6.6 svn 1364]System hang randomly when writing to the same file from different processes of the same node
Are all your nodes updated to r1364 btw? That'd make a big difference as the voting flags got juggled around a bit (sorry!) Otherwise it looks like it's hung doing a TRUNCATE_PAGES message which would be very troubling indeed. If both nodes *are* in fact, running 1364, you mind posting your test code up so I can give it a try? Thanks, --Mark On Fri, Aug 20, 2004 at 04:24:37PM +0800, Chen, Yukun wrote:> Hi all > > > Steps to duplicate: > > 1.Do some operation ,such as mkdir&touch , on node A and node B > > > > 2.on node A process1 write to a file at a specific position(such as offset > 1000) ,100 times > > > > 2.also on node A, at the same time , process2 write to the same file at the > > > > same position, 100 times > > > > Repeat step 1-2 several times, system will hang with the following message > found in node A: > > > > state=1, lockid=22765568, flags = 0x1000, asked type = 5 master = 1, state > 0x0, type = 5 > > (18397) ERROR at /tmp/trunk/src/dlm.c, 461: status = -110 > > (18397) ERROR at /tmp/trunk/src/vote.c, 910: inode 5558, vote_status=0, > vote_state=1, lockid=22765568, flags = 0x1000, asked type = 5 master = 1, state > = 0x0, type = 5 > > ... > > > > on node B , error message with dmesg: > > Call Trace: > > recalc_task_prio > > shedule > > ocfs_comm_process_msg > > ocfs_dlm_recv_msg > > worker_thread > > ocfs_dlm_recv_msg > > default_wake_function > > .... > > > > Any ideas on it? thanx. > > > > Aaron > > Intel China Software Lab > > Tel: 8621-52574545 Ext.1587 > > E_mail:yukun.chen@intel.com > > >> _______________________________________________ > Ocfs2-devel mailing list > Ocfs2-devel@oss.oracle.com > http://oss.oracle.com/mailman/listinfo/ocfs2-devel-- Mark Fasheh Software Developer, Oracle Corp mark.fasheh@oracle.com
Chen, Yukun
2004-Aug-22 21:51 UTC
[Ocfs2-devel] [2.6.6 svn 1364]System hang randomly when writing to the same file from different processes of the same node
Hi Mark I checked the version and found 1364 on both nodes. Also, I attached the test cases for duplicating such bug. The steps to run the test case: 1. make sure you have setup the tvs environment 2. make sure the two test machine can ssh each other as root without password 3.update the variable OCFSDEV in test.config to the device name of your ocfs2 partition 4.update the variable REMOTE in setup.sh to the remote machine name 5.make sure you have created dir /ocfs (I will updated it later to an arbitrary dir which the user can change in the latter version) 6.run "test_filelock.sh" Feel free let me if you have any problems. Thanx. Aaron -----Original Message----- From: Mark Fasheh [mailto:mark.fasheh@oracle.com] Sent: 2004Äê8ÔÂ21ÈÕ 1:43 To: Chen, Yukun Cc: ocfs2-devel@oss.oracle.com Subject: Re: [Ocfs2-devel] [2.6.6 svn 1364]System hang randomly when writing to the same file from different processes of the same node Are all your nodes updated to r1364 btw? That'd make a big difference as the voting flags got juggled around a bit (sorry!) Otherwise it looks like it's hung doing a TRUNCATE_PAGES message which would be very troubling indeed. If both nodes *are* in fact, running 1364, you mind posting your test code up so I can give it a try? Thanks, --Mark On Fri, Aug 20, 2004 at 04:24:37PM +0800, Chen, Yukun wrote:> Hi all > > > Steps to duplicate: > > 1.Do some operation ,such as mkdir&touch , on node A and node B > > > > 2.on node A process1 write to a file at a specific position(such as offset > 1000) ,100 times > > > > 2.also on node A, at the same time , process2 write to the same file at the > > > > same position, 100 times > > > > Repeat step 1-2 several times, system will hang with the following message > found in node A: > > > > state=1, lockid=22765568, flags = 0x1000, asked type = 5 master = 1, state > 0x0, type = 5 > > (18397) ERROR at /tmp/trunk/src/dlm.c, 461: status = -110 > > (18397) ERROR at /tmp/trunk/src/vote.c, 910: inode 5558, vote_status=0, > vote_state=1, lockid=22765568, flags = 0x1000, asked type = 5 master = 1, state > = 0x0, type = 5 > > ... > > > > on node B , error message with dmesg: > > Call Trace: > > recalc_task_prio > > shedule > > ocfs_comm_process_msg > > ocfs_dlm_recv_msg > > worker_thread > > ocfs_dlm_recv_msg > > default_wake_function > > .... > > > > Any ideas on it? thanx. > > > > Aaron > > Intel China Software Lab > > Tel: 8621-52574545 Ext.1587 > > E_mail:yukun.chen@intel.com > > >> _______________________________________________ > Ocfs2-devel mailing list > Ocfs2-devel@oss.oracle.com > http://oss.oracle.com/mailman/listinfo/ocfs2-devel-- Mark Fasheh Software Developer, Oracle Corp mark.fasheh@oracle.com -------------- next part -------------- A non-text attachment was scrubbed... Name: hang.tar Type: application/x-tar Size: 20480 bytes Desc: hang.tar Url : http://oss.oracle.com/pipermail/ocfs2-devel/attachments/20040823/68866931/hang-0001.tar
Chen, Yukun
2004-Aug-24 20:01 UTC
[Ocfs2-devel] [2.6.6 svn 1364]System hang randomly when writing to the same file from different processes of the same node
In the test_filelock.sh scripts, there are 2 steps. One is "inode-test.sh" and the other is "filelock-single.sh".=20 In "inode-test.sh" we will load ocfs2 module in %%BOTH 2 NODES %% and do some file/dir access across the 2 nodes. As for "filelock-single.sh", %%ONLY ON ONE NODE%%, we write in one place on a file through 2 process simultaneously. I think the bug will be duplicated if you do some operation across 2 nodes before writing file. Hope it will help. Thanx. Aaron -----Original Message----- From: Mark Fasheh [mailto:mark.fasheh@oracle.com]=20 Sent: 2004=C4=EA8=D4=C225=C8=D5 2:42 To: Chen, Yukun Cc: ocfs2-devel@oss.oracle.com Subject: Re: [Ocfs2-devel] [2.6.6 svn 1364]System hang randomly when writing to the same file from different processes of the same node On Mon, Aug 23, 2004 at 10:51:30AM +0800, Chen, Yukun wrote:> Hi Mark >=20 > I checked the version and found 1364 on both nodes.=20Ok. what messages do you see on "node B" when this happens on A? Is node B doing anything in particular?> Also, I attached the test cases for duplicating such bug.I might've bitten off more than I can chew by asking for that test code :) I wrote a simple program to write in one place on a file (and I run this twice) and I couldn't reproduce it yet. Looking through your test scripts it seems that's basically what's going on, but please fill me in on any steps I've missed. I guess I'm looking for an easily reproducable test case. Does this happen every time you run your test suite or is it intermittent? --Mark>=20 > The steps to run the test case: >=20 > 1. make sure you have setup the tvs environment >=20 > 2. make sure the two test machine can ssh each other as root without password=20 >=20 > 3.update the variable OCFSDEV in test.config to the device name of your ocfs2 partition >=20 > 4.update the variable REMOTE in setup.sh to the remote machine name >=20 > 5.make sure you have created dir /ocfs (I will updated it later to an arbitrary dir which the user can change in the latter version) >=20 > 6.run "test_filelock.sh" >=20 > Feel free let me if you have any problems.=20 > Thanx. >=20 > Aaron >=20 > -----Original Message----- > From: Mark Fasheh [mailto:mark.fasheh@oracle.com]=20 > Sent: 2004??8??21?? 1:43 > To: Chen, Yukun > Cc: ocfs2-devel@oss.oracle.com > Subject: Re: [Ocfs2-devel] [2.6.6 svn 1364]System hang randomly when writing to the same file from different processes of the same node >=20 >=20 > Are all your nodes updated to r1364 btw? That'd make a big difference as the > voting flags got juggled around a bit (sorry!) Otherwise it looks like it's > hung doing a TRUNCATE_PAGES message which would be very troubling indeed. If > both nodes *are* in fact, running 1364, you mind posting your test code up > so I can give it a try? Thanks, > --Mark >=20 > On Fri, Aug 20, 2004 at 04:24:37PM +0800, Chen, Yukun wrote: > > Hi all > > > >=20 > > Steps to duplicate: > >=20 > > 1.Do some operation ,such as mkdir&touch , on node A and node B > >=20 > > =20 > >=20 > > 2.on node A process1 write to a file at a specific position(such as offset > > 1000) ,100 times > >=20 > > =20 > >=20 > > 2.also on node A, at the same time , process2 write to the same file at the > >=20 > > =20 > >=20 > > same position, 100 times > >=20 > > =20 > >=20 > > Repeat step 1-2 several times, system will hang with the following message > > found in node A: > >=20 > > =20 > >=20 > > state=3D1, lockid=3D22765568, flags =3D 0x1000, asked type =3D 5 master =3D 1, state =3D > > 0x0, type =3D 5 > >=20 > > (18397) ERROR at /tmp/trunk/src/dlm.c, 461: status =3D -110 > >=20 > > (18397) ERROR at /tmp/trunk/src/vote.c, 910: inode 5558, vote_status=3D0, > > vote_state=3D1, lockid=3D22765568, flags =3D 0x1000, asked type =3D 5 master =3D 1, state > > =3D 0x0, type =3D 5 > >=20 > > ... > >=20 > > =20 > >=20 > > on node B , error message with dmesg: > >=20 > > Call Trace: > >=20 > > recalc_task_prio > >=20 > > shedule > >=20 > > ocfs_comm_process_msg > >=20 > > ocfs_dlm_recv_msg > >=20 > > worker_thread > >=20 > > ocfs_dlm_recv_msg > >=20 > > default_wake_function > >=20 > > .... > >=20 > > =20 > >=20 > > Any ideas on it? thanx. > >=20 > > =20 > >=20 > > Aaron > >=20 > > Intel China Software Lab > >=20 > > Tel: 8621-52574545 Ext.1587 > >=20 > > E_mail:yukun.chen@intel.com > >=20 > > =20 > >=20 >=20 > > _______________________________________________ > > Ocfs2-devel mailing list > > Ocfs2-devel@oss.oracle.com > > http://oss.oracle.com/mailman/listinfo/ocfs2-devel >=20 > -- > Mark Fasheh > Software Developer, Oracle Corp > mark.fasheh@oracle.com-- Mark Fasheh Software Developer, Oracle Corp mark.fasheh@oracle.com