Jacques-Charles Lafoucriere
2003-Dec-17 12:46 UTC
[lustre-devel] Tests with lustre 1.0.0/1.0.1
Hello, I am using the following cluster : - 100 UP P4 client 1GB mem - 6 BP P4 OST node 2 GB mem (each OST node has 3 partitions on 3 scsi disks => 18 OST) - 1 BP MDS 2 GB mem - gigabit ethernet as interconnect The base kernel is 2.4.20-19.9 I have installed lustre 1.0.0 and lustre 1.0.1 (with portals debug off) and with patch level 29 and 32 (1.0.0 tested with 29 and 32, 1.0.1 only with 32) I am running the following test: each client write a 900 MB private file in a common directory With both releases I have i/o errors on some files (2 or 3, not always the same) I have the following messages on the OSTs: Dec 17 13:13:42 cors123 kernel: LustreError: 1979:(niobuf.c:369:ptlrpc_abort_bulk()) Unexpectedly long timeout: desc f737d800 Dec 17 13:13:46 cors123 kernel: LustreError: 1980:(niobuf.c:369:ptlrpc_abort_bulk()) Unexpectedly long timeout: desc ed73a000 Dec 17 13:13:52 cors123 kernel: LustreError: 1979:(niobuf.c:369:ptlrpc_abort_bulk()) Unexpectedly long timeout: desc f737d800 Dec 17 13:13:56 cors123 kernel: LustreError: 1980:(niobuf.c:369:ptlrpc_abort_bulk()) Unexpectedly long timeout: desc ed73a000 Dec 17 13:14:02 cors123 kernel: LustreError: 1979:(niobuf.c:369:ptlrpc_abort_bulk()) Unexpectedly long timeout: desc f737d800 Dec 17 13:14:06 cors123 kernel: LustreError: 1980:(niobuf.c:369:ptlrpc_abort_bulk()) Unexpectedly long timeout: desc ed73a000 Dec 17 13:14:12 cors123 kernel: LustreError: 1979:(niobuf.c:369:ptlrpc_abort_bulk()) Unexpectedly long timeout: desc f737d800 Dec 17 13:14:16 cors123 kernel: LustreError: 1980:(niobuf.c:369:ptlrpc_abort_bulk()) Unexpectedly long timeout: desc ed73a000 A few days ago, I run the same test with a b_devel from November and patch level 29 (the latest with the release) and got no issue (it was very stable) Any idea ? JCL
I have gotten some errors recently that were also in niobuf.c, (see https://bugzilla.lustre.org/show_bug.cgi?id=2363). Both of our errors are triggered by PTL_MD_INUSE being the return value of PtlMDUnlink and l_wait_event returning -ETIMEDOUT. I''m not sure how our bugs are related (they may not be), but I thought I would point it out. When I get these errors occasionally the system becomes unusable, are you able to eventually continue with reads and writes? -Blake Whatchu talkin'' ''bout, Willis?> Hello, > > I am using the following cluster : > - 100 UP P4 client 1GB mem > - 6 BP P4 OST node 2 GB mem (each OST node has 3 partitions on 3 scsi > disks => 18 OST) > - 1 BP MDS 2 GB mem > - gigabit ethernet as interconnect > > The base kernel is 2.4.20-19.9 > > I have installed lustre 1.0.0 and lustre 1.0.1 (with portals debug off) > and with patch level 29 and 32 (1.0.0 tested with 29 and 32, 1.0.1 only > with 32) > > I am running the following test: each client write a 900 MB private file > in a common directory > > With both releases I have i/o errors on some files (2 or 3, not always > the same) > > I have the following messages on the OSTs: > Dec 17 13:13:42 cors123 kernel: LustreError: > 1979:(niobuf.c:369:ptlrpc_abort_bulk()) Unexpectedly long timeout: desc > f737d800 > Dec 17 13:13:46 cors123 kernel: LustreError: > 1980:(niobuf.c:369:ptlrpc_abort_bulk()) Unexpectedly long timeout: desc > ed73a000 > Dec 17 13:13:52 cors123 kernel: LustreError: > 1979:(niobuf.c:369:ptlrpc_abort_bulk()) Unexpectedly long timeout: desc > f737d800 > Dec 17 13:13:56 cors123 kernel: LustreError: > 1980:(niobuf.c:369:ptlrpc_abort_bulk()) Unexpectedly long timeout: desc > ed73a000 > Dec 17 13:14:02 cors123 kernel: LustreError: > 1979:(niobuf.c:369:ptlrpc_abort_bulk()) Unexpectedly long timeout: desc > f737d800 > Dec 17 13:14:06 cors123 kernel: LustreError: > 1980:(niobuf.c:369:ptlrpc_abort_bulk()) Unexpectedly long timeout: desc > ed73a000 > Dec 17 13:14:12 cors123 kernel: LustreError: > 1979:(niobuf.c:369:ptlrpc_abort_bulk()) Unexpectedly long timeout: desc > f737d800 > Dec 17 13:14:16 cors123 kernel: LustreError: > 1980:(niobuf.c:369:ptlrpc_abort_bulk()) Unexpectedly long timeout: desc > ed73a000 > > A few days ago, I run the same test with a b_devel from November and > patch level 29 (the latest with the release) and got no issue (it was > very stable) > > Any idea ? > > JCL > > > _______________________________________________ > lustre-devel mailing list > lustre-devel@lists.clusterfs.com > https://lists.clusterfs.com/mailman/listinfo/lustre-devel-- Blake Matheny bmatheny@purdue.edu The things that should be available to everyone, http://www.mkfifo.net are the things that impact everyone. http://ovmj.org/GNUnet/
Jacques-Charles Lafoucriere
2003-Dec-18 12:53 UTC
[lustre-devel] Tests with lustre 1.0.0/1.0.1
The client on with the error occur has to be rebooted (sometime rlogin is no more possible on the node) The other nodes are ok and can still continue to work with lustre (ls -l were ok, read/write also) I have also made tests with 10 nodes (insted of 100 but with still 100 client configured) and I have the same issue. May be I have a kind saturation of my ethernet network. With the previous release the client unitary performance was smaller, with 1.0.X the stress on the OST nodes mays be higher. I''ll make some tunning on the tcp memory buffers on the nodes to be sure there is much mode buffers on OST than on clients. JC Blake Matheny wrote:>I have gotten some errors recently that were also in niobuf.c, >(see https://bugzilla.lustre.org/show_bug.cgi?id=2363). Both of our errors are >triggered by PTL_MD_INUSE being the return value of PtlMDUnlink and >l_wait_event returning -ETIMEDOUT. I''m not sure how our bugs are related (they >may not be), but I thought I would point it out. When I get these errors >occasionally the system becomes unusable, are you able to eventually continue >with reads and writes? > >-Blake > >Whatchu talkin'' ''bout, Willis? > > >>Hello, >> >>I am using the following cluster : >>- 100 UP P4 client 1GB mem >>- 6 BP P4 OST node 2 GB mem (each OST node has 3 partitions on 3 scsi >>disks => 18 OST) >>- 1 BP MDS 2 GB mem >>- gigabit ethernet as interconnect >> >>The base kernel is 2.4.20-19.9 >> >>I have installed lustre 1.0.0 and lustre 1.0.1 (with portals debug off) >>and with patch level 29 and 32 (1.0.0 tested with 29 and 32, 1.0.1 only >>with 32) >> >>I am running the following test: each client write a 900 MB private file >>in a common directory >> >>With both releases I have i/o errors on some files (2 or 3, not always >>the same) >> >>I have the following messages on the OSTs: >>Dec 17 13:13:42 cors123 kernel: LustreError: >>1979:(niobuf.c:369:ptlrpc_abort_bulk()) Unexpectedly long timeout: desc >>f737d800 >>Dec 17 13:13:46 cors123 kernel: LustreError: >>1980:(niobuf.c:369:ptlrpc_abort_bulk()) Unexpectedly long timeout: desc >>ed73a000 >>Dec 17 13:13:52 cors123 kernel: LustreError: >>1979:(niobuf.c:369:ptlrpc_abort_bulk()) Unexpectedly long timeout: desc >>f737d800 >>Dec 17 13:13:56 cors123 kernel: LustreError: >>1980:(niobuf.c:369:ptlrpc_abort_bulk()) Unexpectedly long timeout: desc >>ed73a000 >>Dec 17 13:14:02 cors123 kernel: LustreError: >>1979:(niobuf.c:369:ptlrpc_abort_bulk()) Unexpectedly long timeout: desc >>f737d800 >>Dec 17 13:14:06 cors123 kernel: LustreError: >>1980:(niobuf.c:369:ptlrpc_abort_bulk()) Unexpectedly long timeout: desc >>ed73a000 >>Dec 17 13:14:12 cors123 kernel: LustreError: >>1979:(niobuf.c:369:ptlrpc_abort_bulk()) Unexpectedly long timeout: desc >>f737d800 >>Dec 17 13:14:16 cors123 kernel: LustreError: >>1980:(niobuf.c:369:ptlrpc_abort_bulk()) Unexpectedly long timeout: desc >>ed73a000 >> >>A few days ago, I run the same test with a b_devel from November and >>patch level 29 (the latest with the release) and got no issue (it was >>very stable) >> >>Any idea ? >> >>JCL >> >> >>_______________________________________________ >>lustre-devel mailing list >>lustre-devel@lists.clusterfs.com >>https://lists.clusterfs.com/mailman/listinfo/lustre-devel >> >> > > >
> The client on with the error occur has to be rebooted (sometime rlogin > is no more possible on the node) > The other nodes are ok and can still continue to work with lustre (ls -l > were ok, read/write also)Same thing here.> I have also made tests with 10 nodes (insted of 100 but with still 100 > client configured) and I have the same issue. > May be I have a kind saturation of my ethernet network. With the > previous release the client unitary performance was smaller, with 1.0.X > the stress on the OST nodes mays be higher. > I''ll make some tunning on the tcp memory buffers on the nodes to be sure > there is much mode buffers on OST than on clients.I doubt it''s saturation. I have a test cluster with only 6 machines (1 mds, 4 ost, 1 client, lov across all ost''s) on a 100Mb switch, and I really doubt it''s being saturated. Although, I do seem to get this error mostly when I''m testing writes of large files. -Blake> JC > > Blake Matheny wrote: > > >I have gotten some errors recently that were also in niobuf.c, > >(see https://bugzilla.lustre.org/show_bug.cgi?id=2363). Both of our errors > >are > >triggered by PTL_MD_INUSE being the return value of PtlMDUnlink and > >l_wait_event returning -ETIMEDOUT. I''m not sure how our bugs are related > >(they > >may not be), but I thought I would point it out. When I get these errors > >occasionally the system becomes unusable, are you able to eventually > >continue > >with reads and writes? > > > >-Blake > > > >Whatchu talkin'' ''bout, Willis? > > > > > >>Hello, > >> > >>I am using the following cluster : > >>- 100 UP P4 client 1GB mem > >>- 6 BP P4 OST node 2 GB mem (each OST node has 3 partitions on 3 scsi > >>disks => 18 OST) > >>- 1 BP MDS 2 GB mem > >>- gigabit ethernet as interconnect > >> > >>The base kernel is 2.4.20-19.9 > >> > >>I have installed lustre 1.0.0 and lustre 1.0.1 (with portals debug off) > >>and with patch level 29 and 32 (1.0.0 tested with 29 and 32, 1.0.1 only > >>with 32) > >> > >>I am running the following test: each client write a 900 MB private file > >>in a common directory > >> > >>With both releases I have i/o errors on some files (2 or 3, not always > >>the same) > >> > >>I have the following messages on the OSTs: > >>Dec 17 13:13:42 cors123 kernel: LustreError: > >>1979:(niobuf.c:369:ptlrpc_abort_bulk()) Unexpectedly long timeout: desc > >>f737d800 > >>Dec 17 13:13:46 cors123 kernel: LustreError: > >>1980:(niobuf.c:369:ptlrpc_abort_bulk()) Unexpectedly long timeout: desc > >>ed73a000 > >>Dec 17 13:13:52 cors123 kernel: LustreError: > >>1979:(niobuf.c:369:ptlrpc_abort_bulk()) Unexpectedly long timeout: desc > >>f737d800 > >>Dec 17 13:13:56 cors123 kernel: LustreError: > >>1980:(niobuf.c:369:ptlrpc_abort_bulk()) Unexpectedly long timeout: desc > >>ed73a000 > >>Dec 17 13:14:02 cors123 kernel: LustreError: > >>1979:(niobuf.c:369:ptlrpc_abort_bulk()) Unexpectedly long timeout: desc > >>f737d800 > >>Dec 17 13:14:06 cors123 kernel: LustreError: > >>1980:(niobuf.c:369:ptlrpc_abort_bulk()) Unexpectedly long timeout: desc > >>ed73a000 > >>Dec 17 13:14:12 cors123 kernel: LustreError: > >>1979:(niobuf.c:369:ptlrpc_abort_bulk()) Unexpectedly long timeout: desc > >>f737d800 > >>Dec 17 13:14:16 cors123 kernel: LustreError: > >>1980:(niobuf.c:369:ptlrpc_abort_bulk()) Unexpectedly long timeout: desc > >>ed73a000 > >> > >>A few days ago, I run the same test with a b_devel from November and > >>patch level 29 (the latest with the release) and got no issue (it was > >>very stable) > >> > >>Any idea ? > >> > >>JCL > >> > >> > >>_______________________________________________ > >>lustre-devel mailing list > >>lustre-devel@lists.clusterfs.com > >>https://lists.clusterfs.com/mailman/listinfo/lustre-devel > >> > >> > > > > > > > > > _______________________________________________ > lustre-devel mailing list > lustre-devel@lists.clusterfs.com > https://lists.clusterfs.com/mailman/listinfo/lustre-devel-- Blake Matheny bmatheny@purdue.edu The things that should be available to everyone, http://www.mkfifo.net are the things that impact everyone. http://ovmj.org/GNUnet/
Hello Jacques-- Jacques-Charles Lafoucriere wrote:> > I have also made tests with 10 nodes (insted of 100 but with still 100 > client configured) and I have the same issue. > May be I have a kind saturation of my ethernet network. With the > previous release the client unitary performance was smaller, with 1.0.X > the stress on the OST nodes mays be higher. > I''ll make some tunning on the tcp memory buffers on the nodes to be sure > there is much mode buffers on OST than on clients.I agree with Blake that the network is probably not saturated as such, but something is causing a node not to respond in a timely fashion, which leads to timeouts and errors. Certainly the test that you''re running is something which our production installations do every day. TCP memory buffers will not normally cause something catastrophic, in my experience. A shortage of Lustre request buffers would, but with 10 clients, I think it is very unlikely that you are running out of Lustre request buffers.>>>I have the following messages on the OSTs: >>>Dec 17 13:13:42 cors123 kernel: LustreError: >>>1979:(niobuf.c:369:ptlrpc_abort_bulk()) Unexpectedly long timeout: desc >>>f737d800Can you send me the messages that you see on the OSTs and the client, before this one occurs? By the time this message arrives, things have already been unhappy for a little while. Also, you mentioned that you can reproduce this with just 10 clients. Can you perhaps reproduce it with just one? What is your default file striping pattern? If you run "vmstat" on one or two OSTs while this test is running, what is the average disk I/O (BO) rate? It may be that your configuration requires some tuning away from the default cache parameters, although with only 10 clients something else would probably also have to be wrong.>>>A few days ago, I run the same test with a b_devel from November and >>>patch level 29 (the latest with the release) and got no issue (it was >>>very stable)That is very interesting, too. Do you happen to remember which snapshot number this was? Thanks for your patience and feedback-- -Phil
Jacques-Charles Lafoucriere
2003-Dec-18 16:46 UTC
[lustre-devel] Tests with lustre 1.0.0/1.0.1
Phil Schwan wrote:>Hello Jacques-- > >Jacques-Charles Lafoucriere wrote: > > >>I have also made tests with 10 nodes (insted of 100 but with still 100 >>client configured) and I have the same issue. >>May be I have a kind saturation of my ethernet network. With the >>previous release the client unitary performance was smaller, with 1.0.X >>the stress on the OST nodes mays be higher. >>I''ll make some tunning on the tcp memory buffers on the nodes to be sure >>there is much mode buffers on OST than on clients. >> >> > >I agree with Blake that the network is probably not saturated as such, >but something is causing a node not to respond in a timely fashion, >which leads to timeouts and errors. Certainly the test that you''re >running is something which our production installations do every day. > >TCP memory buffers will not normally cause something catastrophic, in my >experience. A shortage of Lustre request buffers would, but with 10 >clients, I think it is very unlikely that you are running out of Lustre >request buffers. > > > >>>>I have the following messages on the OSTs: >>>>Dec 17 13:13:42 cors123 kernel: LustreError: >>>>1979:(niobuf.c:369:ptlrpc_abort_bulk()) Unexpectedly long timeout: desc >>>>f737d800 >>>> >>>> > >Can you send me the messages that you see on the OSTs and the client, >before this one occurs? By the time this message arrives, things have >already been unhappy for a little while. >On the OST, there are plenty of lines of the same type and few lines like : Dec 17 13:13:33 cors123 kernel: Lustre: 1732:(socknal_cb.c:1534:ksocknal_process_receive()) [d8856800] EOF from 0xc0a8011b ip c0a8011b:32805 and also Dec 17 12:56:42 cors123 kernel: LustreError: 1999:(ost_handler.c:730:ost_brw_write()) bulk IO comms error: evicting c798f_lov1_48c0260155@ nid 0xc0a80161 (0:192.168.1.97) Dec 17 12:56:42 cors123 kernel: LustreError: 1968:(ost_handler.c:730:ost_brw_write()) bulk IO comms error: evicting c798f_lov1_48c0260155@ nid 0xc0a80161 (0:192.168.1.97) Dec 17 12:56:42 cors123 kernel: LustreError: 1983:(ost_handler.c:730:ost_brw_write()) bulk IO comms error: evicting c798f_lov1_48c0260155@ nid 0xc0a80161 (0:192.168.1.97) Dec 17 12:56:42 cors123 kernel: LustreError: 1978:(ost_handler.c:730:ost_brw_write()) bulk IO comms error: evicting c798f_lov1_48c0260155@ nid 0xc0a80161 (0:192.168.1.97) Dec 17 12:56:42 cors123 kernel: LustreError: 1971:(ost_handler.c:730:ost_brw_write()) bulk IO comms error: evicting c798f_lov1_48c0260155@ nid 0xc0a80161 (0:192.168.1.97) Dec 17 12:56:46 cors123 kernel: LustreError: 1980:(niobuf.c:369:ptlrpc_abort_bulk()) Unexpectedly long timeout: desc ed73a000 Dec 17 12:56:52 cors123 kernel: LustreError: 1979:(niobuf.c:369:ptlrpc_abort_bulk()) Unexpectedly long timeout: desc f737d800 Dec 17 12:56:56 cors123 kernel: LustreError: 1980:(niobuf.c:369:ptlrpc_abort_bulk()) Unexpectedly long timeout: desc ed73a000 Dec 17 12:57:02 cors123 kernel: LustreError: 1979:(niobuf.c:369:ptlrpc_abort_bulk()) Unexpectedly long timeout: desc f737d800 Dec 17 12:57:06 cors123 kernel: LustreError: 1980:(niobuf.c:369:ptlrpc_abort_bulk()) Unexpectedly long timeout: desc ed73a000 Dec 17 12:57:08 cors123 kernel: Lustre: 1732:(socknal_cb.c:1534:ksocknal_process_receive()) [d88b3800] EOF from 0xc0a80161 ip c0a80161:32788 Dec 17 12:57:08 cors123 kernel: LustreError: 1740:(socknal.c:1004:ksocknal_destroy_conn()) Refusing to complete a partial receive from 0xc0a80161, ip c0a80161 Dec 17 12:57:08 cors123 kernel: LustreError: 1740:(socknal.c:1006:ksocknal_destroy_conn()) This may hang communications and prevent modules from unloading Dec 17 12:57:12 cors123 kernel: LustreError: 1979:(niobuf.c:369:ptlrpc_abort_bulk()) Unexpectedly long timeout: desc f737d800 I can send you the full listings but it is big On the client (I am not sure it is the good one but it is the only one which have errors at 13:13) : Dec 17 13:13:17 cors071 kernel: LustreError: 1717:(socknal_cb.c:2491:ksocknal_find_timed_out_conn()) Timed out TX to 0xc0a8017e 4425024 f75e3000 Dec 17 13:41:37 cors071 kernel: LustreError: 1717:(socknal_cb.c:2522:ksocknal_check_peer_timeouts()) Timeout out conn->0xc0a8017e ip c0a8017e:988 Dec 17 13:41:37 cors071 kernel: LustreError: 1742:(client.c:810:ptlrpc_expire_one_request()) @@@ timeout req@ed252000 x6558/t0 o4->OST_cors126_2_UUID@NID_192.168.1.126_UUID:6 lens 288/248 ref 1 fl RPC:N/0/0 rc 0 Dec 17 13:41:37 cors071 kernel: LustreError: 1742:(recover.c:100:ptlrpc_run_failed_import_upcall()) Error invoking recovery upcall /usr/lib/lustre/lustre_upcall FAILED_IMPORT OST_cors126_2_UUID OSC_cors071.ocre.cea.fr_OST_cors126_2_MNT_cors071 NID_192.168.1.126_UUID: -2; check /proc/sys/lustre/lustre_upcall> >Also, you mentioned that you can reproduce this with just 10 clients. >Can you perhaps reproduce it with just one? What is your default file >striping pattern? If you run "vmstat" on one or two OSTs while this >test is running, what is the average disk I/O (BO) rate? > >on one client it is working well (and faster than with previous release) default is on all OST => 18>It may be that your configuration requires some tuning away from the >default cache parameters, although with only 10 clients something else >would probably also have to be wrong. > > > >>>>A few days ago, I run the same test with a b_devel from November and >>>>patch level 29 (the latest with the release) and got no issue (it was >>>>very stable) >>>> >>>> > >That is very interesting, too. Do you happen to remember which snapshot >number this was? > >I download it on november 8th at 23:20 (french time) I used it with up to 100 clients writing on 6 and 12 OST with many different stripe size>Thanks for your patience and feedback-- > >-Phil > > > >
Hello Jacques-- Jacques-Charles Lafoucriere wrote:> > On the client (I am not sure it is the good one but it is the only one > which have errors at 13:13) : > Dec 17 13:13:17 cors071 kernel: LustreError: > 1717:(socknal_cb.c:2491:ksocknal_find_timed_out_conn()) Timed out TX to > 0xc0a8017e 4425024 f75e3000I have one idea, before we go to the next step: are you using the RPMs from the web site, or did you compile Lustre yourself? If you compiled your own Lustre, did you run configure with --enable-zerocopy? Recent testing is showing us that the memory allocations in the non-zero-copy TCP path may be doing significant harm, because they can block unpredictably. Lustre 1.0.2 will default to zero-copy, but 1.0.1 did not. If you are already using zero-copy, then we will give you some instructions for gathering more detailed log data. This message points us in the right direction, I think.>> What is your default file striping pattern? > > default is on all OST => 18Although this is not the source of your problem, with these kinds of applications (N clients writing N files), we have found that one stripe per file is usually best. 100 clients are many more than are needed to saturate 18 OSTs, and more stripes just causes unnecessary contention. For the other common case (N clients writing 1 file), it is indeed best to stripe over all OSTs for maximum performance. Unfortunately, it is not yet possible to change your default stripe policy on a per-directory or per-environment basis, so you would have to create the files in advance with a script, or change your application to call an ioctl. Neither is a great answer. Best-- -Phil