On Aug 26, 2005 11:40 +0200, Roland Fehrenbacher wrote:> I permanently (every couple of minutes) get messages like the one below > on my MDS, while running stress tests (bonnie++ on 4 > clients, and some unpacking, copying, diffing of large tar files on > one other client). Again, I''m running Lustre 1.4.1 with kernel > 2.6.12.5 + bugzilla patches with 2 OSTs over Gigabit Ethernet. Here is > my configuration:These errors are actually rather harmless. It indicates that a lock callback operation on the MDS is taking too long for some reason.> lmc -m config.xml --format --add mds --node ha-beo-2 --mds mds-beo \ > --fstype ldiskfs --dev /dev/drbd/2I suspect the fact that the MDS is running atop drdb may be a contributing factor to the MDS slowness.> Aug 26 11:32:17 ha-beo-2 kernel: Lustre: 0:0:(watchdog.c:122:lcw_cb()) Watchdog > triggered for pid 25181: it was inactive for 1500usTry changing ldlm/ldlm_lockd.c::ldlm_setup() to use "ldlm_timeout * 1000" instead of "1500" where ldlm_cb_service is initialized via ptlrpc_init_svc(). Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc.
On Aug 26, 2005 13:28 -0600, Andreas Dilger wrote:> On Aug 26, 2005 11:40 +0200, Roland Fehrenbacher wrote: > > I permanently (every couple of minutes) get messages like the one below > > on my MDS, while running stress tests (bonnie++ on 4 > > clients, and some unpacking, copying, diffing of large tar files on > > one other client). Again, I''m running Lustre 1.4.1 with kernel > > 2.6.12.5 + bugzilla patches with 2 OSTs over Gigabit Ethernet. Here is > > my configuration: > > These errors are actually rather harmless. It indicates that a lock > callback operation on the MDS is taking too long for some reason. > > > lmc -m config.xml --format --add mds --node ha-beo-2 --mds mds-beo \ > > --fstype ldiskfs --dev /dev/drbd/2 > > I suspect the fact that the MDS is running atop drdb may be a contributing > factor to the MDS slowness. > > > Aug 26 11:32:17 ha-beo-2 kernel: Lustre: 0:0:(watchdog.c:122:lcw_cb()) > > Watchdog triggered for pid 25181: it was inactive for 1500us > > Try changing ldlm/ldlm_lockd.c::ldlm_setup() to use "ldlm_timeout * 1000" > instead of "1500" where ldlm_cb_service is initialized via ptlrpc_init_svc().Actually, in hindsight this is bug 5515, already fixed in 1.4.2. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc.
On Fri, 2005-08-26 at 11:40 +0200, Roland Fehrenbacher wrote:> Hi, > > I permanently (every couple of minutes) get messages like the one below > on my MDS, while running stress tests (bonnie++ on 4 > clients, and some unpacking, copying, diffing of large tar files on > one other client). Again, I''m running Lustre 1.4.1 with kernel > 2.6.12.5 + bugzilla patches with 2 OSTs over Gigabit Ethernet. Here is > my configuration:I have no idea what the problem is, but isn''t an outdated release on the latest mainline kernel the least-supported of all possible unsupported configurations? -jwb
Hi, I permanently (every couple of minutes) get messages like the one below on my MDS, while running stress tests (bonnie++ on 4 clients, and some unpacking, copying, diffing of large tar files on one other client). Again, I''m running Lustre 1.4.1 with kernel 2.6.12.5 + bugzilla patches with 2 OSTs over Gigabit Ethernet. Here is my configuration: ------------------------------------------------------------------------- lmc -m config.xml --add net --node ha-beo-2 --nid ha-beo-i-2 --nettype tcp lmc -m config.xml --add net --node sn-03-1 --nid sn-03-1-i --nettype tcp lmc -m config.xml --add net --node sn-03-2 --nid sn-03-2-i --nettype tcp lmc -m config.xml --add net --node client --nid ''*'' --nettype tcp # MDS lmc -m config.xml --format --add mds --node ha-beo-2 --mds mds-beo \ --fstype ldiskfs --dev /dev/drbd/2 # OSS lmc -m config.xml --add lov --lov lov-beo --mds mds-beo --stripe_sz 1048576 \ --stripe_cnt 0 --stripe_pattern 0 lmc -m config.xml --add ost --node sn-03-1 --lov lov-beo --ost sn-03-1 \ --failover --fstype ldiskfs --dev /dev/vgraid50/ost lmc -m config.xml --add ost --node sn-03-2 --lov lov-beo --ost sn-03-2 \ --failover --fstype ldiskfs --dev /dev/vgraid50/ost # Clients lmc -m config.xml --add mtpt --node client --path /l/1 \ --mds mds-beo --lov lov-beo ------------------------------------------------------------------------- ------------------------- error message --------------------------------------- Aug 26 11:32:17 ha-beo-2 kernel: Lustre: 0:0:(watchdog.c:122:lcw_cb()) Watchdog triggered for pid 25181: it was inactive for 1500us Aug 26 11:32:17 ha-beo-2 kernel: ldlm_cb_27 D 000206a1fe5da43f 0 25181 1 25182 25180 (L-TLB) Aug 26 11:32:17 ha-beo-2 kernel: ffff810011eefc48 0000000000000046 ffff8100344a8 599 000000738893b46b Aug 26 11:32:17 ha-beo-2 kernel: ffff810011eefc48 000000737cc1e224 000000 0100000000 0000000000000934 Aug 26 11:32:17 ha-beo-2 kernel: 000206a1fe5da43f ffff81007c50b800 Aug 26 11:32:17 ha-beo-2 kernel: Call Trace:<ffffffff88279d93>{:libcfs:portals_d ebug_msg+883} <ffffffff803de0bd>{__down_write+141} Aug 26 11:32:17 ha-beo-2 kernel: <ffffffff8830bcea>{:obdclass:llog_cat_ca ncel_records+762} Aug 26 11:32:17 ha-beo-2 kernel: <ffffffff8854e420>{:ptlrpc:llog_origin_h andle_cancel+3984} Aug 26 11:32:17 ha-beo-2 kernel: <ffffffff88503f78>{:ptlrpc:ldlm_callback _handler+3768} Aug 26 11:32:17 ha-beo-2 kernel: <ffffffff88539cdb>{:ptlrpc:ptlrpc_server _handle_request+4011} Aug 26 11:32:17 ha-beo-2 kernel: <ffffffff8853b871>{:ptlrpc:ptlrpc_main+2 177} <ffffffff8012e0f0>{default_wake_function+0} Aug 26 11:32:17 ha-beo-2 kernel: <ffffffff8853afe0>{:ptlrpc:ptlrpc_retry_ rqbds+0} <ffffffff8853afe0>{:ptlrpc:ptlrpc_retry_rqbds+0} Aug 26 11:32:17 ha-beo-2 kernel: <ffffffff8010e577>{child_rip+8} <fffffff f8853aff0>{:ptlrpc:ptlrpc_main+0} Aug 26 11:32:17 ha-beo-2 kernel: <ffffffff8010e56f>{child_rip+0} ------------------------------------------------------------------------------- Any idea, what is causing this, and how serious it is? Thanks for your help, Roland