thr3ads.net - Lustre discuss - [Lustre-discuss] MDS Errors [May 2006]

If this information is useful, please help other people find it:
Share via:

Andreas Dilger

2006-May-19 07:36 UTC

[Lustre-discuss] MDS Errors

On Aug 26, 2005  11:40 +0200, Roland Fehrenbacher wrote:> I permanently (every couple of minutes) get messages like the one below
> on my MDS, while running stress tests (bonnie++ on 4
> clients, and some unpacking, copying, diffing of large tar files on
> one other client). Again, I''m running Lustre 1.4.1 with kernel
> 2.6.12.5 + bugzilla patches with 2 OSTs over Gigabit Ethernet. Here is
> my configuration:
These errors are actually rather harmless.  It indicates that a lock
callback operation on the MDS is taking too long for some reason.
> lmc -m config.xml --format --add mds --node ha-beo-2 --mds mds-beo \
>     --fstype ldiskfs --dev /dev/drbd/2
I suspect the fact that the MDS is running atop drdb may be a contributing
factor to the MDS slowness.
> Aug 26 11:32:17 ha-beo-2 kernel: Lustre: 0:0:(watchdog.c:122:lcw_cb())
Watchdog
> triggered for pid 25181: it was inactive for 1500us
Try changing ldlm/ldlm_lockd.c::ldlm_setup() to use "ldlm_timeout *
1000"
instead of "1500" where ldlm_cb_service is initialized via
ptlrpc_init_svc().

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

Andreas Dilger

2006-May-19 07:36 UTC

head link

[Lustre-discuss] MDS Errors

On Aug 26, 2005  13:28 -0600, Andreas Dilger wrote:> On Aug 26, 2005  11:40 +0200, Roland Fehrenbacher wrote:
> > I permanently (every couple of minutes) get messages like the one
below
> > on my MDS, while running stress tests (bonnie++ on 4
> > clients, and some unpacking, copying, diffing of large tar files on
> > one other client). Again, I''m running Lustre 1.4.1 with
kernel
> > 2.6.12.5 + bugzilla patches with 2 OSTs over Gigabit Ethernet. Here is
> > my configuration:
> 
> These errors are actually rather harmless.  It indicates that a lock
> callback operation on the MDS is taking too long for some reason.
> 
> > lmc -m config.xml --format --add mds --node ha-beo-2 --mds mds-beo \
> >     --fstype ldiskfs --dev /dev/drbd/2
> 
> I suspect the fact that the MDS is running atop drdb may be a contributing
> factor to the MDS slowness.
> 
> > Aug 26 11:32:17 ha-beo-2 kernel: Lustre: 0:0:(watchdog.c:122:lcw_cb())
> > Watchdog triggered for pid 25181: it was inactive for 1500us
> 
> Try changing ldlm/ldlm_lockd.c::ldlm_setup() to use "ldlm_timeout *
1000"
> instead of "1500" where ldlm_cb_service is initialized via
ptlrpc_init_svc().
Actually, in hindsight this is bug 5515, already fixed in 1.4.2.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

Jeffrey W. Baker

2006-May-19 07:36 UTC

head link

[Lustre-discuss] MDS Errors

On Fri, 2005-08-26 at 11:40 +0200, Roland Fehrenbacher
wrote:> Hi,
> 
> I permanently (every couple of minutes) get messages like the one below
> on my MDS, while running stress tests (bonnie++ on 4
> clients, and some unpacking, copying, diffing of large tar files on
> one other client). Again, I''m running Lustre 1.4.1 with kernel
> 2.6.12.5 + bugzilla patches with 2 OSTs over Gigabit Ethernet. Here is
> my configuration:
I have no idea what the problem is, but isn''t an outdated release on
the
latest mainline kernel the least-supported of all possible unsupported
configurations?

-jwb

Roland Fehrenbacher

2006-May-19 07:36 UTC

head link

[Lustre-discuss] MDS Errors

Hi,

I permanently (every couple of minutes) get messages like the one below
on my MDS, while running stress tests (bonnie++ on 4
clients, and some unpacking, copying, diffing of large tar files on
one other client). Again, I''m running Lustre 1.4.1 with kernel
2.6.12.5 + bugzilla patches with 2 OSTs over Gigabit Ethernet. Here is
my configuration:

-------------------------------------------------------------------------
lmc -m config.xml --add net --node ha-beo-2 --nid ha-beo-i-2 --nettype tcp
lmc -m config.xml --add net --node sn-03-1 --nid sn-03-1-i --nettype tcp
lmc -m config.xml --add net --node sn-03-2 --nid sn-03-2-i --nettype tcp
lmc -m config.xml --add net --node client --nid ''*'' --nettype
tcp

# MDS
lmc -m config.xml --format --add mds --node ha-beo-2 --mds mds-beo \
    --fstype ldiskfs --dev /dev/drbd/2


# OSS
lmc -m config.xml --add lov --lov lov-beo --mds mds-beo --stripe_sz 1048576 \
  --stripe_cnt 0 --stripe_pattern 0

lmc -m config.xml --add ost --node sn-03-1 --lov lov-beo --ost sn-03-1 \
    --failover --fstype ldiskfs --dev /dev/vgraid50/ost

lmc -m config.xml --add ost --node sn-03-2 --lov lov-beo --ost sn-03-2 \
    --failover --fstype ldiskfs --dev /dev/vgraid50/ost


# Clients
lmc -m config.xml --add mtpt --node client --path /l/1 \
    --mds mds-beo --lov lov-beo

-------------------------------------------------------------------------


------------------------- error message ---------------------------------------

Aug 26 11:32:17 ha-beo-2 kernel: Lustre: 0:0:(watchdog.c:122:lcw_cb()) Watchdog
triggered for pid 25181: it was inactive for 1500us
Aug 26 11:32:17 ha-beo-2 kernel: ldlm_cb_27    D 000206a1fe5da43f     0 25181
   1         25182 25180 (L-TLB)
Aug 26 11:32:17 ha-beo-2 kernel: ffff810011eefc48 0000000000000046 ffff8100344a8
599 000000738893b46b
Aug 26 11:32:17 ha-beo-2 kernel:        ffff810011eefc48 000000737cc1e224 000000
0100000000 0000000000000934
Aug 26 11:32:17 ha-beo-2 kernel:        000206a1fe5da43f ffff81007c50b800
Aug 26 11:32:17 ha-beo-2 kernel: Call
Trace:<ffffffff88279d93>{:libcfs:portals_d
ebug_msg+883} <ffffffff803de0bd>{__down_write+141}
Aug 26 11:32:17 ha-beo-2 kernel:       
<ffffffff8830bcea>{:obdclass:llog_cat_ca
ncel_records+762}
Aug 26 11:32:17 ha-beo-2 kernel:       
<ffffffff8854e420>{:ptlrpc:llog_origin_h
andle_cancel+3984}
Aug 26 11:32:17 ha-beo-2 kernel:       
<ffffffff88503f78>{:ptlrpc:ldlm_callback
_handler+3768}
Aug 26 11:32:17 ha-beo-2 kernel:       
<ffffffff88539cdb>{:ptlrpc:ptlrpc_server
_handle_request+4011}
Aug 26 11:32:17 ha-beo-2 kernel:       
<ffffffff8853b871>{:ptlrpc:ptlrpc_main+2
177} <ffffffff8012e0f0>{default_wake_function+0}
Aug 26 11:32:17 ha-beo-2 kernel:       
<ffffffff8853afe0>{:ptlrpc:ptlrpc_retry_
rqbds+0} <ffffffff8853afe0>{:ptlrpc:ptlrpc_retry_rqbds+0}
Aug 26 11:32:17 ha-beo-2 kernel:        <ffffffff8010e577>{child_rip+8}
<fffffff
f8853aff0>{:ptlrpc:ptlrpc_main+0}
Aug 26 11:32:17 ha-beo-2 kernel:        <ffffffff8010e56f>{child_rip+0}

-------------------------------------------------------------------------------

Any idea, what is causing this, and how serious it is?

Thanks for your help,

Roland

Lustre discuss - May 2006 - MDS Errors

[Lustre-discuss] MDS Errors

[Lustre-discuss] MDS Errors

[Lustre-discuss] MDS Errors

[Lustre-discuss] MDS Errors