thr3ads.net - Lustre discuss - [Lustre-discuss] Kernel Crash [May 2006]

If this information is useful, please help other people find it:
Share via:

Andreas Dilger

2006-May-19 07:36 UTC

[Lustre-discuss] Kernel Crash

On Jun 03, 2005  09:34 -0300, Leandro Tavares Carneiro
wrote:> I''m new using lustre and i''m evaluting this to use in our
production
> environment. I have installed Lustre 1.2.4 on 8 nodes, 1 MDS 7 OSS, and 
> everything works fine.
> 
> My next step is mount this filesystem from this little cluster above in 
> another cluster, with 128 dual Opteron machines. I installed Lustre in one 
> node to test the client funcionality, and the local tests runned well, 
> confirming the software are working.
> 
> Call Trace: [<ffffffff802bcca8>]{sprintf+136} 
> [<ffffffffa0133cb9>]{:obdclass:class_process_config+297}
>        [<ffffffffa013889c>]{:obdclass:class_config_dump_llog+3660}
>        [<ffffffffa00de8e7>]{:portals:kportal_nal_cmd+519}
>        [<ffffffffa013574f>]{:obdclass:class_config_llog_handler+1679}
>        [<ffffffffa02861c3>]{:ptlrpc:llog_client_next_block+1795}
>        [<ffffffffa01033c2>]{:obdclass:llog_process+3122} 
>
> I am using RedHat WS 3 update 4 on all this nodes. The server nodes are 
> dual PIII 1.4GHz and are used only as lustre OSS and MDS. The client is a 
> dual Opteron 244.
The 1.2.4 version of Lustre does not support "zeroconf" mounting
across
systems with different wordsize (i.e. i686 and x86_64), which appears
to be the problem here.  If you instead mount the clients with "lconf
--node client {config}.xml" (assuming your generic client config is
called "client" as it is in most sample configs) then it should work.
This is fixed in the 1.4.2 release.

If you are seriously considering Lustre evaluation for your company you
should contact sales@clusterfs.com to get an evaluation of the 1.4.2
Lustre code.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

Leandro Tavares Carneiro

2006-May-19 07:36 UTC

head link

[Lustre-discuss] Kernel Crash

Andreas,

Thank you for your help. To evalute the comercial version, it need to run on the
public version... This is the condition my boss put on me to evalute the 
comercial version...

Now, using lconf to mount the filesystem, it give to me this messages:

  Lustre: 2407:0:(socknal.c:1631:ksocknal_module_init()) maximum lustre stack
16283
Lustre: 2407:0:(socknal.c:130:ksocknal_init()) maximum lustre stack 16380
Lustre: 2407:0:(lib-init.c:257:lib_init()) maximum lustre stack 16384
LustreError: 2441:0:(client.c:452:ptlrpc_check_status()) @@@ type == 
PTL_RPC_MSG_ERR, err == -16 req@00000100f872ec00 x15/t0 
o38->mds1_UUID@NID_bw3n25_UUID:12 lens 168/64 ref 1 fl Rpc:R/0/50000 rc 0/-16
LustreError: 2441:0:(client.c:452:ptlrpc_check_status()) @@@ type == 
PTL_RPC_MSG_ERR, err == -16 req@000001007f708400 x17/t0 
o38->mds1_UUID@NID_bw3n25_UUID:12 lens 168/64 ref 1 fl Rpc:R/0/50000 rc 0/-16
LustreError: 2441:0:(client.c:452:ptlrpc_check_status()) @@@ type == 
PTL_RPC_MSG_ERR, err == -16 req@00000100f872e400 x32/t0 
o38->mds1_UUID@NID_bw3n25_UUID:12 lens 168/64 ref 1 fl Rpc:R/0/50000 rc 0/-16
LustreError: 2441:0:(client.c:452:ptlrpc_check_status()) @@@ type == 
PTL_RPC_MSG_ERR, err == -16 req@000001007d0b2c00 x34/t0 
o38->mds1_UUID@NID_bw3n25_UUID:12 lens 168/64 ref 1 fl Rpc:R/0/50000 rc 0/-16

And after some time is mounted. The file system server messages are:

Jun  3 15:25:24 bw3n29 acceptor[2260]: Accepted host: 
bw6n128.ep.petrobras.com.br snd: 16777216 rcv 16777216 nagle: disabled
Jun  3 15:25:24 bw3n27 acceptor[2194]: Accepted host: 
bw6n128.ep.petrobras.com.br snd: 16777216 rcv 16777216 nagle: disabled
Jun  3 15:25:24 bw3n28 acceptor[2263]: Accepted host: 
bw6n128.ep.petrobras.com.br snd: 16777216 rcv 16777216 nagle: disabled
Jun  3 15:25:24 bw3n30 acceptor[2260]: Accepted host: 
bw6n128.ep.petrobras.com.br snd: 16777216 rcv 16777216 nagle: disabled
Jun  3 15:25:24 bw3n32 acceptor[2260]: Accepted host: 
bw6n128.ep.petrobras.com.br snd: 16777216 rcv 16777216 nagle: disabled
Jun  3 15:25:24 bw3n31 acceptor[2260]: Accepted host: 
bw6n128.ep.petrobras.com.br snd: 16777216 rcv 16777216 nagle: disabled
Jun  3 15:25:24 bw3n26 acceptor[2263]: Accepted host: 
bw6n128.ep.petrobras.com.br snd: 16777216 rcv 16777216 nagle: disabled
Jun  3 15:25:24 bw3n25 acceptor[2224]: Accepted host: 
bw6n128.ep.petrobras.com.br snd: 16777216 rcv 16777216 nagle: disabled
Jun  3 15:25:24 bw3n25 kernel: Lustre: 
2227:0:(ldlm_lib.c:752:target_start_recovery_timer()) mds1: starting recovery 
timer (250s)
Jun  3 15:25:24 bw3n25 kernel: LustreError: 
2227:0:(ldlm_lib.c:470:target_handle_connect()) denying connection for new 
client aa896_MNT_client_391951a146: 1 clients in recovery for 250s
Jun  3 15:25:24 bw3n25 kernel: LustreError: 
2227:0:(ldlm_lib.c:1050:target_send_reply_msg()) @@@ processing error (-16) 
req@f7829800 x15/t0 o38-><?>@<?>:-1 lens 168/64 ref 0 fl
?phase?:/0/50000 rc -16/0
Jun  3 15:25:24 bw3n25 acceptor[2224]: Accepted host: 
bw6n128.ep.petrobras.com.br snd: 16777216 rcv 16777216 nagle: disabled
Jun  3 15:27:04 bw3n25 acceptor[2224]: Accepted host: 
bw6n128.ep.petrobras.com.br snd: 16777216 rcv 16777216 nagle: disabled
Jun  3 15:27:04 bw3n25 kernel: LustreError: 
2228:0:(ldlm_lib.c:470:target_handle_connect()) denying connection for new 
client aa896_MNT_client_391951a146: 1 clients in recovery for 150s
Jun  3 15:27:04 bw3n25 kernel: LustreError: 
2228:0:(ldlm_lib.c:1050:target_send_reply_msg()) @@@ processing error (-16) 
req@f64a4400 x17/t0 o38-><?>@<?>:-1 lens 168/64 ref 0 fl
?phase?:/0/50000 rc -16/0
Jun  3 15:27:50 bw3n25 kernel: Lustre: 
2196:0:(socknal_cb.c:1513:ksocknal_process_receive()) [f64b6000] EOF from 
0xa027480 ip 10.2.116.128:32793
Jun  3 15:27:50 bw3n25 kernel: Lustre: 
2195:0:(socknal_cb.c:1513:ksocknal_process_receive()) [f64b6800] EOF from 
0xa027480 ip 10.2.116.128:32792
Jun  3 15:27:50 bw3n25 acceptor[2224]: Accepted host: 
bw6n128.ep.petrobras.com.br snd: 16777216 rcv 16777216 nagle: disabled
Jun  3 15:27:50 bw3n25 kernel: LustreError: 
2229:0:(ldlm_lib.c:470:target_handle_connect()) denying connection for new 
client eed7a_MNT_client_ad7be1467d: 1 clients in recovery for 103s
Jun  3 15:27:50 bw3n25 kernel: LustreError: 
2229:0:(ldlm_lib.c:1050:target_send_reply_msg()) @@@ processing error (-16) 
req@f6485a00 x32/t0 o38-><?>@<?>:-1 lens 168/64 ref 0 fl
?phase?:/0/50000 rc -16/0
Jun  3 15:27:50 bw3n25 acceptor[2224]: Accepted host: 
bw6n128.ep.petrobras.com.br snd: 16777216 rcv 16777216 nagle: disabled
Jun  3 15:29:30 bw3n25 acceptor[2224]: Accepted host: 
bw6n128.ep.petrobras.com.br snd: 16777216 rcv 16777216 nagle: disabled
Jun  3 15:29:30 bw3n25 kernel: LustreError: 
2230:0:(ldlm_lib.c:470:target_handle_connect()) denying connection for new 
client eed7a_MNT_client_ad7be1467d: 1 clients in recovery for 3s
Jun  3 15:29:30 bw3n25 kernel: LustreError: 
2230:0:(ldlm_lib.c:1050:target_send_reply_msg()) @@@ processing error (-16) 
req@f6e7d400 x34/t0 o38-><?>@<?>:-1 lens 168/64 ref 0 fl
?phase?:/0/50000 rc -16/0
Jun  3 15:29:34 bw3n25 kernel: LustreError: 
0:0:(ldlm_lib.c:713:target_recovery_expired()) recovery timed out, aborting
Jun  3 15:30:01 bw3n26 last message repeated 2 times
Jun  3 15:31:10 bw3n25 kernel: LustreError: 
2231:0:(ldlm_lib.c:701:target_abort_recovery()) mds1: recovery period over; 
disconnecting unfinished clients.
Jun  3 15:31:10 bw3n25 kernel: LustreError: 
2231:0:(genops.c:689:class_disconnect_stale_exports()) mds1: disconnecting 1 
stale clients
Jun  3 15:31:10 bw3n25 kernel: Lustre: 
2231:0:(ldlm_lib.c:596:target_finish_recovery()) mds1: sending delayed replies 
to recovered clients
Jun  3 15:31:10 bw3n25 kernel: Lustre: 
2231:0:(ldlm_lib.c:605:target_finish_recovery()) mds1: all clients recovered, 0 
MDS orphans deleted
Jun  3 15:31:10 bw3n25 kernel: LustreError: 
2231:0:(recover.c:68:ptlrpc_run_recovery_over_upcall()) Error invoking recovery 
upcall DEFAULT RECOVERY_OVER mds1_UUID: -2; check /proc/sys/lustre/upcall

I think i have done something wrong. I wrote the script based on the example of 
the Lustre How To and it works very well when i mount the filesystem on the MDS 
node. Below, are the script i have made:

# config.sh
rm -f miglustre.xml

# Create nodes
lmc -m miglustre.xml --add net --node bw3n25 --nid bw3n25 --nettype tcp
lmc -m miglustre.xml --add net --node bw3n26 --nid bw3n26 --nettype tcp
lmc -m miglustre.xml --add net --node bw3n27 --nid bw3n27 --nettype tcp
lmc -m miglustre.xml --add net --node bw3n28 --nid bw3n28 --nettype tcp
lmc -m miglustre.xml --add net --node bw3n29 --nid bw3n29 --nettype tcp
lmc -m miglustre.xml --add net --node bw3n30 --nid bw3n30 --nettype tcp
lmc -m miglustre.xml --add net --node bw3n31 --nid bw3n31 --nettype tcp
lmc -m miglustre.xml --add net --node bw3n32 --nid bw3n32 --nettype tcp
lmc -m miglustre.xml --add net --node client --nid ''*''
--nettype tcp

# Cofigure MDS
lmc -m miglustre.xml --add mds --node bw3n25 --mds mds1 --fstype ext3 --dev
/dev/md1

# Configures OSTs
lmc -m miglustre.xml --add lov --lov lov1 --mds mds1 --stripe_sz 16777216 
--stripe_cnt 0 --stripe_pattern 0
lmc -m miglustre.xml --add ost --node bw3n26 --lov lov1 --ost ost1 --fstype ext3
--dev /dev/md1
lmc -m miglustre.xml --add ost --node bw3n27 --lov lov1 --ost ost2 --fstype ext3
--dev /dev/md1
lmc -m miglustre.xml --add ost --node bw3n28 --lov lov1 --ost ost3 --fstype ext3
--dev /dev/md1
lmc -m miglustre.xml --add ost --node bw3n29 --lov lov1 --ost ost4 --fstype ext3
--dev /dev/md1
lmc -m miglustre.xml --add ost --node bw3n30 --lov lov1 --ost ost5 --fstype ext3
--dev /dev/md1
lmc -m miglustre.xml --add ost --node bw3n31 --lov lov1 --ost ost6 --fstype ext3
--dev /dev/md1
lmc -m miglustre.xml --add ost --node bw3n32 --lov lov1 --ost ost7 --fstype ext3
--dev /dev/md1

# Configure client (this is a ''generic'' client used for all
client mounts)
lmc -m miglustre.xml --add mtpt --node client --path /miglustre --mds mds1 --lov
lov1

When it works mostly without problems, i will put a hard test on it. After that,
if i got good results i will ask for a quote and will evalute the comercial
version.

Thanks for your help,

Regards,

Leandro Tavares Carneiro
Petrobras TI/TI-E&P/STEP Suporte Tecnico de E&P
Av Chile, 65 sala 1501 EDISE - Rio de Janeiro / RJ
Tel: (0xx21) 3224-1427


Andreas Dilger wrote:> On Jun 03, 2005  09:34 -0300, Leandro Tavares Carneiro wrote:
> 
>>I''m new using lustre and i''m evaluting this to use in
our production
>>environment. I have installed Lustre 1.2.4 on 8 nodes, 1 MDS 7 OSS, and 
>>everything works fine.
>>
>>My next step is mount this filesystem from this little cluster above in 
>>another cluster, with 128 dual Opteron machines. I installed Lustre in
one
>>node to test the client funcionality, and the local tests runned well, 
>>confirming the software are working.
>>
>>Call Trace: [<ffffffff802bcca8>]{sprintf+136} 
>>[<ffffffffa0133cb9>]{:obdclass:class_process_config+297}
>>       [<ffffffffa013889c>]{:obdclass:class_config_dump_llog+3660}
>>       [<ffffffffa00de8e7>]{:portals:kportal_nal_cmd+519}
>>      
[<ffffffffa013574f>]{:obdclass:class_config_llog_handler+1679}
>>       [<ffffffffa02861c3>]{:ptlrpc:llog_client_next_block+1795}
>>       [<ffffffffa01033c2>]{:obdclass:llog_process+3122} 
>>
>>I am using RedHat WS 3 update 4 on all this nodes. The server nodes are 
>>dual PIII 1.4GHz and are used only as lustre OSS and MDS. The client is
a
>>dual Opteron 244.
> 
> 
> The 1.2.4 version of Lustre does not support "zeroconf" mounting
across
> systems with different wordsize (i.e. i686 and x86_64), which appears
> to be the problem here.  If you instead mount the clients with "lconf
> --node client {config}.xml" (assuming your generic client config is
> called "client" as it is in most sample configs) then it should
work.
> This is fixed in the 1.4.2 release.
> 
> If you are seriously considering Lustre evaluation for your company you
> should contact sales@clusterfs.com to get an evaluation of the 1.4.2
> Lustre code.
> 
> Cheers, Andreas
> --
> Andreas Dilger
> Principal Software Engineer
> Cluster File Systems, Inc.
> 
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss@lists.clusterfs.com
> https://lists.clusterfs.com/mailman/listinfo/lustre-discuss
>

Jeffrey W. Baker

2006-May-19 07:36 UTC

head link

[Lustre-discuss] Kernel Crash

On Fri, 2005-06-03 at 15:41 -0300, Leandro Tavares Carneiro
wrote:> Andreas,
> 
> Thank you for your help. To evalute the comercial version, it need to run
on the
> public version... This is the condition my boss put on me to evalute the 
> comercial version...
> 
> Now, using lconf to mount the filesystem, it give to me this messages:
> 
> And after some time is mounted. The file system server messages are:
> Jun  3 15:25:24 bw3n25 kernel: Lustre: 
> 2227:0:(ldlm_lib.c:752:target_start_recovery_timer()) mds1: starting
recovery
> timer (250s)
This means you will be waiting at least 4 minutes, 10 seconds, while
Lustre waits for missing nodes to establish contact.
> Jun  3 15:25:24 bw3n25 kernel: LustreError: 
> 2227:0:(ldlm_lib.c:470:target_handle_connect()) denying connection for new 
> client aa896_MNT_client_391951a146: 1 clients in recovery for 250s
In the meantime, new nodes will be blocked from connecting.
> 2228:0:(ldlm_lib.c:470:target_handle_connect()) denying connection for new 
> client aa896_MNT_client_391951a146: 1 clients in recovery for 150s
Still waiting ....
> 2229:0:(ldlm_lib.c:470:target_handle_connect()) denying connection for new 
> client eed7a_MNT_client_ad7be1467d: 1 clients in recovery for 103s
Still waiting ....
> 2230:0:(ldlm_lib.c:470:target_handle_connect()) denying connection for new 
> client eed7a_MNT_client_ad7be1467d: 1 clients in recovery for 3s
Almost there....
> Jun  3 15:29:30 bw3n25 kernel: LustreError: 
> 2230:0:(ldlm_lib.c:1050:target_send_reply_msg()) @@@ processing error (-16)
> req@f6e7d400 x34/t0 o38-><?>@<?>:-1 lens 168/64 ref 0 fl
?phase?:/0/50000 rc -16/0
> Jun  3 15:29:34 bw3n25 kernel: LustreError: 
> 0:0:(ldlm_lib.c:713:target_recovery_expired()) recovery timed out, aborting
Any node that hasn''t checked in yet is now forgotten:
> 2231:0:(ldlm_lib.c:701:target_abort_recovery()) mds1: recovery period over;
> disconnecting unfinished clients.
> Jun  3 15:31:10 bw3n25 kernel: LustreError: 
> 2231:0:(genops.c:689:class_disconnect_stale_exports()) mds1: disconnecting
1
> stale clients
Including this anonymous node.
> Jun  3 15:31:10 bw3n25 kernel: Lustre: 
> 2231:0:(ldlm_lib.c:596:target_finish_recovery()) mds1: sending delayed
replies
> to recovered clients
But the other nodes are still fine.
> Jun  3 15:31:10 bw3n25 kernel: Lustre: 
> 2231:0:(ldlm_lib.c:605:target_finish_recovery()) mds1: all clients
recovered, 0
> MDS orphans deleted
No problem.
> Jun  3 15:31:10 bw3n25 kernel: LustreError: 
> 2231:0:(recover.c:68:ptlrpc_run_recovery_over_upcall()) Error invoking
recovery
> upcall DEFAULT RECOVERY_OVER mds1_UUID: -2; check /proc/sys/lustre/upcall
This is harmless.
> I think i have done something wrong.
No, I think you did it properly.  It''s just that Lustre is very
verbose,
and it''s difficult to tell what''s wrong in all that debugging
information.

ERROR: SUCCESS!

-jwb

Leandro Tavares Carneiro

2006-May-19 07:36 UTC

head link

[Lustre-discuss] Kernel Crash

Well, it is working. Now i can star more serious tests.

Thank you all for your help!

Leandro Tavares Carneiro
Petrobras TI/TI-E&P/STEP Suporte Tecnico de E&P
Av Chile, 65 sala 1501 EDISE - Rio de Janeiro / RJ
Tel: (0xx21) 3224-1427


Jeffrey W. Baker wrote:> On Fri, 2005-06-03 at 15:41 -0300, Leandro Tavares Carneiro wrote:
> 
>>Andreas,
>>
>>Thank you for your help. To evalute the comercial version, it need to
run on the
>>public version... This is the condition my boss put on me to evalute the
>>comercial version...
>>
>>Now, using lconf to mount the filesystem, it give to me this messages:
>>
> 
> 
>>And after some time is mounted. The file system server messages are:
> 
> 
>>Jun  3 15:25:24 bw3n25 kernel: Lustre: 
>>2227:0:(ldlm_lib.c:752:target_start_recovery_timer()) mds1: starting
recovery
>>timer (250s)
> 
> 
> This means you will be waiting at least 4 minutes, 10 seconds, while
> Lustre waits for missing nodes to establish contact.
> 
> 
>>Jun  3 15:25:24 bw3n25 kernel: LustreError: 
>>2227:0:(ldlm_lib.c:470:target_handle_connect()) denying connection for
new
>>client aa896_MNT_client_391951a146: 1 clients in recovery for 250s
> 
> 
> In the meantime, new nodes will be blocked from connecting.
> 
> 
>>2228:0:(ldlm_lib.c:470:target_handle_connect()) denying connection for
new
>>client aa896_MNT_client_391951a146: 1 clients in recovery for 150s
> 
> 
> Still waiting ....
> 
> 
>>2229:0:(ldlm_lib.c:470:target_handle_connect()) denying connection for
new
>>client eed7a_MNT_client_ad7be1467d: 1 clients in recovery for 103s
> 
> 
> Still waiting ....
> 
> 
>>2230:0:(ldlm_lib.c:470:target_handle_connect()) denying connection for
new
>>client eed7a_MNT_client_ad7be1467d: 1 clients in recovery for 3s
> 
> 
> Almost there....
> 
> 
>>Jun  3 15:29:30 bw3n25 kernel: LustreError: 
>>2230:0:(ldlm_lib.c:1050:target_send_reply_msg()) @@@ processing error
(-16)
>>req@f6e7d400 x34/t0 o38-><?>@<?>:-1 lens 168/64 ref 0 fl
?phase?:/0/50000 rc -16/0
>>Jun  3 15:29:34 bw3n25 kernel: LustreError: 
>>0:0:(ldlm_lib.c:713:target_recovery_expired()) recovery timed out,
aborting
> 
> 
> Any node that hasn''t checked in yet is now forgotten:
> 
> 
>>2231:0:(ldlm_lib.c:701:target_abort_recovery()) mds1: recovery period
over;
>>disconnecting unfinished clients.
>>Jun  3 15:31:10 bw3n25 kernel: LustreError: 
>>2231:0:(genops.c:689:class_disconnect_stale_exports()) mds1:
disconnecting 1
>>stale clients
> 
> 
> Including this anonymous node.
> 
> 
>>Jun  3 15:31:10 bw3n25 kernel: Lustre: 
>>2231:0:(ldlm_lib.c:596:target_finish_recovery()) mds1: sending delayed
replies
>>to recovered clients
> 
> 
> But the other nodes are still fine.
> 
> 
>>Jun  3 15:31:10 bw3n25 kernel: Lustre: 
>>2231:0:(ldlm_lib.c:605:target_finish_recovery()) mds1: all clients
recovered, 0
>>MDS orphans deleted
> 
> 
> No problem.
> 
> 
>>Jun  3 15:31:10 bw3n25 kernel: LustreError: 
>>2231:0:(recover.c:68:ptlrpc_run_recovery_over_upcall()) Error invoking
recovery
>>upcall DEFAULT RECOVERY_OVER mds1_UUID: -2; check
/proc/sys/lustre/upcall
> 
> 
> This is harmless.
> 
> 
>>I think i have done something wrong.
> 
> 
> No, I think you did it properly.  It''s just that Lustre is very
verbose,
> and it''s difficult to tell what''s wrong in all that
debugging
> information.
> 
> ERROR: SUCCESS!
> 
> -jwb
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss@lists.clusterfs.com
> https://lists.clusterfs.com/mailman/listinfo/lustre-discuss
>

Leandro Tavares Carneiro

2006-May-19 07:36 UTC

head link

[Lustre-discuss] Kernel Crash

Hi,

I''m new using lustre and i''m evaluting this to use in our
production
environment. I have installed Lustre 1.2.4 on 8 nodes, 1 MDS 7 OSS, and 
everything works fine.

My next step is mount this filesystem from this little cluster above in 
another cluster, with 128 dual Opteron machines. I installed Lustre in one 
node to test the client funcionality, and the local tests runned well, 
confirming the software are working.

The problem begins when I mount the filesystem created on the small cluster on 
this client node. I simply got a kernel crash and errors from lustre. I
didn''t
find a log with the lustre errors at the moment, but a kernel message is below.

Lustre: 9807:0:(module.c:724:init_kportals_module()) maximum lustre stack 16384
Unable to handle kernel paging request at virtual address 0000010103d66960
  printing rip:
ffffffffa012e8b8
PML4 8063 PGD 0
Oops: 0000
CPU 1
Pid: 9806, comm: mount.lustre Not tainted
RIP: 0010:[<ffffffffa012e8b8>]{:obdclass:class_attach+440}
RSP: 0000:00000100f3b3f7f8  EFLAGS: 00010202
RAX: 0000000009472d58 RBX: 00000100fa8f3b80 RCX: 0000000000000012
RDX: 0000000000003fee RSI: ffffffffffffffee RDI: 00000100fa8f3b80
RBP: 00000100fa8f3c08 R08: 0000000000000001 R09: 0000000000000000
R10: 000001007957bb00 R11: 0000000000000012 R12: 00000100f3b3fd08
R13: 0000000000000000 R14: 00000000000000a0 R15: 00000100fa8f3be8
FS:  0000002a95c6b4c0(0000) GS:ffffffff805e5c00(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000010103d66960 CR3: 0000000007dc1000 CR4: 00000000000006e0

Call Trace: [<ffffffff802bcca8>]{sprintf+136} 
[<ffffffffa0133cb9>]{:obdclass:class_process_config+297}
        [<ffffffffa013889c>]{:obdclass:class_config_dump_llog+3660}
        [<ffffffffa00de8e7>]{:portals:kportal_nal_cmd+519}
        [<ffffffffa013574f>]{:obdclass:class_config_llog_handler+1679}
        [<ffffffffa02861c3>]{:ptlrpc:llog_client_next_block+1795}
        [<ffffffffa01033c2>]{:obdclass:llog_process+3122} 
[<ffffffffa0286a35>]{:ptlrpc:llog_client_read_header+1909}
        [<ffffffffa0148dc0>]{:obdclass:obd_dev+736} 
[<ffffffffa02bad20>]{:ptlrpc:llog_client_ops+0}
        [<ffffffffa01350c0>]{:obdclass:class_config_llog_handler+0}
        [<ffffffffa02bad20>]{:ptlrpc:llog_client_ops+0} 
[<ffffffffa01362e0>]{:obdclass:class_config_parse_llog+1648}
        [<ffffffffa0148ae0>]{:obdclass:obd_dev+0} 
[<ffffffffa011d798>]{:obdclass:class_conn2export+1256}
        [<ffffffffa0148ae0>]{:obdclass:obd_dev+0} 
[<ffffffffa03482b1>]{:llite:lustre_process_log+3761}
        [<ffffffff802bcca8>]{sprintf+136} 
[<ffffffffa036ee8c>]{:llite:.rodata.str1.1+3916}
        [<ffffffffa036ee94>]{:llite:.rodata.str1.1+3924} 
[<ffffffffa0349795>]{:llite:lustre_fill_super+3781}
        [<ffffffffa036ee5a>]{:llite:.rodata.str1.1+3866} 
[<ffffffff80127cfb>]{release_task+763}
        [<ffffffff80129854>]{wait_task_zombie+372} 
[<ffffffff80129d5f>]{sys_wait4+799}
        [<ffffffffa036db0e>]{:llite:lustre_read_super+238}
        [<ffffffffa037aa60>]{:llite:lustre_fs_type+0} 
[<ffffffffa037aa60>]{:llite:lustre_fs_type+0}
        [<ffffffff801673fe>]{get_sb_nodev+78} 
[<ffffffffa037aa60>]{:llite:lustre_fs_type+0}
        [<ffffffff801675f4>]{do_kern_mount+164} 
[<ffffffff8017f181>]{do_add_mount+161}
        [<ffffffff8017f4c3>]{do_mount+371}
[<ffffffff8017f8e5>]{sys_mount+197}
        [<ffffffff801102a7>]{system_call+119}
Process mount.lustre (pid: 9806, stackpage=100f3b3d000)
Stack: 00000100f3b3f7f8 0000000000000000 00000000000000a0 0000000000000000
        ffffffff802bcca8 0000003000000020 00000100f3b3f8e8 00000100f3b3f828
        0000000100000001 0000000000000202 00000100f3a7c080 00000100fa8f3be8
        00000100fa8f3b80 0000000000000012 00000100f3b3fd08 0000000000000000
        00000000000000a0 00000100fa8f3be8 ffffffffa0133cb9 00000000000000a0
        ffffffffa013889c 00000100f3a7c080 ffffffffa00de8e7 0000000000000001
        0000000000000202 0000010087d681c0 00000100f3a7c0e0 00000100f3a7c0f0
        0000000000000246 0000000000000246 00000100fa8f3b80 0000000000000012
        ffffffffa013574f 0000010037f44048 ffffffffa02861c3 0000000000000000
        00000100f3b3c000 0000000000000000 0000001200000000 000001007957bb00
Call Trace: [<ffffffff802bcca8>]{sprintf+136} 
[<ffffffffa0133cb9>]{:obdclass:class_process_config+297}
        [<ffffffffa013889c>]{:obdclass:class_config_dump_llog+3660}
        [<ffffffffa00de8e7>]{:portals:kportal_nal_cmd+519}
        [<ffffffffa013574f>]{:obdclass:class_config_llog_handler+1679}
        [<ffffffffa02861c3>]{:ptlrpc:llog_client_next_block+1795}
        [<ffffffffa01033c2>]{:obdclass:llog_process+3122} 
[<ffffffffa0286a35>]{:ptlrpc:llog_client_read_header+1909}
        [<ffffffffa0148dc0>]{:obdclass:obd_dev+736} 
[<ffffffffa02bad20>]{:ptlrpc:llog_client_ops+0}
        [<ffffffffa01350c0>]{:obdclass:class_config_llog_handler+0}
        [<ffffffffa02bad20>]{:ptlrpc:llog_client_ops+0} 
[<ffffffffa01362e0>]{:obdclass:class_config_parse_llog+1648}
        [<ffffffffa0148ae0>]{:obdclass:obd_dev+0} 
[<ffffffffa011d798>]{:obdclass:class_conn2export+1256}
        [<ffffffffa0148ae0>]{:obdclass:obd_dev+0} 
[<ffffffffa03482b1>]{:llite:lustre_process_log+3761}
        [<ffffffff802bcca8>]{sprintf+136} 
[<ffffffffa036ee8c>]{:llite:.rodata.str1.1+3916}
        [<ffffffffa036ee94>]{:llite:.rodata.str1.1+3924} 
[<ffffffffa0349795>]{:llite:lustre_fill_super+3781}
        [<ffffffffa036ee5a>]{:llite:.rodata.str1.1+3866} 
[<ffffffff80127cfb>]{release_task+763}
        [<ffffffff80129854>]{wait_task_zombie+372} 
[<ffffffff80129d5f>]{sys_wait4+799}
        [<ffffffffa036db0e>]{:llite:lustre_read_super+238}
        [<ffffffffa037aa60>]{:llite:lustre_fs_type+0} 
[<ffffffffa037aa60>]{:llite:lustre_fs_type+0}
        [<ffffffff801673fe>]{get_sb_nodev+78} 
[<ffffffffa037aa60>]{:llite:lustre_fs_type+0}
        [<ffffffff801675f4>]{do_kern_mount+164} 
[<ffffffff8017f181>]{do_add_mount+161}
        [<ffffffff8017f4c3>]{do_mount+371}
[<ffffffff8017f8e5>]{sys_mount+197}
        [<ffffffff801102a7>]{system_call+119}

Code: 80 3c 28 00 0f 84 7e 01 00 00 8b 0d a0 a5 fc ff 85 c9 0f 84

Kernel panic: Fatal exception


I am using RedHat WS 3 update 4 on all this nodes. The server nodes are dual 
PIII 1.4GHz and are used only as lustre OSS and MDS. The client is a dual 
Opteron 244.

Thanks in advance,

-- 

Leandro Tavares Carneiro
Petrobras TI/TI-E&P/STEP Suporte Tecnico de E&P
Av Chile, 65 sala 1501 EDISE - Rio de Janeiro / RJ
Tel: (0xx21) 3224-1427

Lustre discuss - May 2006 - Kernel Crash

[Lustre-discuss] Kernel Crash

[Lustre-discuss] Kernel Crash

[Lustre-discuss] Kernel Crash

[Lustre-discuss] Kernel Crash

[Lustre-discuss] Kernel Crash