thr3ads.net - Lustre discuss - [Lustre-discuss] question about __alloc_pages: 0-order allocation failed (gfp=0x20/0) [May 2006]

If this information is useful, please help other people find it:
Share via:

Andreas Dilger

2006-May-19 07:36 UTC

[Lustre-discuss] question about __alloc_pages: 0-order allocation failed (gfp=0x20/0)

--ZkK3DH7HIvxYLwhR
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline

On Oct 08, 2004  17:57 +0200, Martin Vogt wrote:> No. These are seperate machines.
> The error is new btw. Before that (1.4b) I could run 350 iteration,
> without this bug.(Then I stopped the benmark, 350 iterations
> are close to infinity :-)
> >>when I run IOZone in an endless loop I have one client which
> >>after 10 Iterations stops working.
> >>
> >>I get this error in dmesg:
> >>
> >>
> >>__alloc_pages: 0-order allocation failed (gfp=0x20/0)
> >>__alloc_pages: 0-order allocation failed (gfp=0x20/0)
> >>__alloc_pages: 0-order allocation failed (gfp=0x20/0)
> >>__alloc_pages: 0-order allocation failed (gfp=0x20/0)
> >>__alloc_pages: 0-order allocation failed (gfp=0x20/0)
> >>
> >>
> >>All the clients are the same (RAM/CPU, etc...)
> >>After the next iteration the iozone binary is started, lustre
re-imports
> >>its OSTs on the client with this error message and the binary runs.
Martin, can you please set "sysctl -w vm.vm_gfp_debug=1" on your
system
(depends what kernel you are running) and send me the stack traces it
generates when getting the allocation failures.  That would give me an
idea of what is failing.  The lustre allocations should normally print
an error message to the syslog if they ever fail, so it would be good
to know what is going on here.  Thanks.

Cheers, Andreas
--
Andreas Dilger
http://sourceforge.net/projects/ext2resize/
http://members.shaw.ca/adilger/             http://members.shaw.ca/golinux/


--ZkK3DH7HIvxYLwhR
Content-Type: application/pgp-signature
Content-Disposition: inline

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.3 (GNU/Linux)

iD8DBQFBZvQ2pIg59Q01vtYRAm5HAJ9KNyl2N8J3lBbIbUMpXB60wNQLVQCg55+6
pEeZFEKWlnXwyQNhOV+lGqs=q/X0
-----END PGP SIGNATURE-----

--ZkK3DH7HIvxYLwhR--

Martin Vogt

2006-May-19 07:36 UTC

head link

[Lustre-discuss] Re: question about __alloc_pages: 0-order allocation failed (gfp=0x20/0)

On Oct 08, 2004  17:57 +0200, Martin Vogt wrote:
>> No. These are seperate machines.
>> The error is new btw. Before that (1.4b) I could run 350 iteration,
>> without this bug.(Then I stopped the benmark, 350 iterations
>> are close to infinity  :-) 
>  
>
>>>> >>when I run IOZone in an endless loop I have one client
which
>>>> >>after 10 Iterations stops working.
>>>> >>
>>>> >>I get this error in dmesg:
>>>> >>
>>>> >>
>>>> >>__alloc_pages: 0-order allocation failed (gfp=0x20/0)
>>>> >>__alloc_pages: 0-order allocation failed (gfp=0x20/0)
>>>> >>__alloc_pages: 0-order allocation failed (gfp=0x20/0)
>>>> >>__alloc_pages: 0-order allocation failed (gfp=0x20/0)
>>>> >>__alloc_pages: 0-order allocation failed (gfp=0x20/0)
>>>> >>
>>>> >>
>>>> >>All the clients are the same (RAM/CPU, etc...)
>>>> >>After the next iteration the iozone binary is started,
lustre re-imports
>>>> >>its OSTs on the client with this error message and the
binary runs.
>>>      
>>>
>Martin, can you please set "sysctl -w vm.vm_gfp_debug=1" on your
system
>(depends what kernel you are running) and send me the stack traces it
>generates when getting the allocation failures.  That would give me an
>idea of what is failing.  The lustre allocations should normally print
>an error message to the syslog if they ever fail, so it would be good
>to know what is going on here.  Thanks.

Hello,

I piped the trace through ksymoops, the trace is at the end of the mail.

But the oops was not lustres fault.(I think)

The trace shows that there is a vmware modules loaded.
I had tested on this maschine vmware then reinstalled 1.3.2
without rebuilding the vmware kernel modules and thus
using the old ones. I think this leads to the oops.

After removing vmnet/vmmon iozone runs again, and there are no oopses.(15
iterations)

But I have still these errors:
>LustreError: 1661:0:(client.c:816:ptlrpc_expire_one_request()) @@@ timeout
(sent at 1097573837) req@e7692800 >x910748/t657940
o4->media-ost3_UUID@NID_192.168.9.13_UUID:6 lens 288/248 ref 3 fl
?phase?:R/4/0 rc 0/0
After some time the upcall script is executed and the client continues.
Is it ok that I have these error messages on a regular basis?

I attached the oops, maybe that the vmware modules triggers something.

regards,

Martin

media4:/usr/local/lustre # ksymoops </tmp/a.txt
ksymoops 2.4.9 on i686 2.4.24-lustrevogt.  Options used
     -V (default)
     -k /proc/ksyms (default)
     -l /proc/modules (default)
     -o /lib/modules/2.4.24-lustrevogt/ (default)
     -m /boot/System.map-2.4.24-lustrevogt (default)

Warning: You did not tell me where to find symbol information.  I will
assume that the log matches the kernel and modules that are running
right now and I''ll use the default options above for symbol resolution.
If the current kernel and/or modules do not match the log, you can get
more accurate output by telling me the kernel version and where to find
map, modules, ksyms etc.  ksymoops -h explains the options.

f48479bc f48479f8 c013aa22 000004ed c1010030 00000001 0000000c c02d4bfc 
       c02d4db4 00000000 00000020 00000001 00000246 c2a1bdb8 00000000 f4847a00 
       c013aa57 f4847a20 c0137896 c2a1bdc0 00000003 00000020 c2a1bdb8 00000246 
Call Trace:    [<c013aa22>] [<c013aa57>] [<c0137896>]
[<c0137ddc>] [<c0227e4a>]
  [<f8a14112>] [<f8a13d89>] [<f8a14112>] [<f8a13d89>]
[<f8a139c8>] [<c010aad3>]
  [<c010acf5>] [<c010d398>] [<c02280c4>] [<c0228230>]
[<f8a9163a>] [<f8a90e65>]
  [<c0228230>] [<f8a93c73>] [<c022c20e>] [<c0237d62>]
[<c022c45d>] [<c0246821>]
  [<c0246ac4>] [<c01379cf>] [<c025e8c2>] [<c0258ca7>]
[<c025b4d0>] [<c0257630>]
  [<c010ad42>] [<c025f9cf>] [<c0260038>] [<c0243612>]
[<c0243a32>] [<c022cc45>]
  [<c022cd1d>] [<c022cec1>] [<c01232f7>] [<c010ad42>]
[<c010d398>] [<c026ca40>]
  [<c022499e>] [<f8b71a1a>] [<f8b71c2a>] [<f8b74bbe>]
[<f8b75609>] [<c01075c3>]
  [<f8b75470>]
Warning (Oops_read): Code line not seen, dumping what data is available

Trace; c013aa22 <__alloc_pages+272/280>
Trace; c013aa57 <__get_free_pages+27/30>
Trace; c0137896 <kmem_cache_grow+c6/280>
Trace; c0137ddc <kmalloc+8c/160>
Trace; c0227e4a <alloc_skb+ba/1f0>
Trace; f8a14112 <[e1000]e1000_alloc_rx_buffers+a2/120>
Trace; f8a13d89 <[e1000]e1000_clean_rx_irq+169/450>
Trace; f8a14112 <[e1000]e1000_alloc_rx_buffers+a2/120>
Trace; f8a13d89 <[e1000]e1000_clean_rx_irq+169/450>
Trace; f8a139c8 <[e1000]e1000_intr+28/80>
Trace; c010aad3 <handle_IRQ_event+63/a0>
Trace; c010acf5 <do_IRQ+a5/100>
Trace; c010d398 <call_do_IRQ+5/d>
Trace; c02280c4 <kfree_skbmem+4/70>
Trace; c0228230 <__kfree_skb+100/150>
Trace; f8a9163a <[vmnet]VNetHubReceive+8c/98>
Trace; f8a90e65 <[vmnet]VNetSend+33/62>
Trace; c0228230 <__kfree_skb+100/150>
Trace; f8a93c73 <[vmnet]VNetBridgeReceiveFromDev+1b7/1c4>
Trace; c022c20e <dev_queue_xmit_nit+9e/f0>
Trace; c0237d62 <qdisc_restart+122/1a0>
Trace; c022c45d <dev_queue_xmit+16d/320>
Trace; c0246821 <ip_output+121/1c0>
Trace; c0246ac4 <ip_queue_xmit+204/570>
Trace; c01379cf <kmem_cache_grow+1ff/280>
Trace; c025e8c2 <tcp_v4_send_check+82/d0>
Trace; c0258ca7 <tcp_transmit_skb+3a7/5f0>
Trace; c025b4d0 <tcp_send_ack+80/c0>
Trace; c0257630 <tcp_rcv_established+720/850>
Trace; c010ad42 <do_IRQ+f2/100>
Trace; c025f9cf <tcp_v4_do_rcv+11f/130>
Trace; c0260038 <tcp_v4_rcv+658/720>
Trace; c0243612 <ip_local_deliver+1a2/1d0>
Trace; c0243a32 <ip_rcv+3f2/440>
Trace; c022cc45 <netif_receive_skb+1c5/1f0>
Trace; c022cd1d <process_backlog+ad/160>
Trace; c022cec1 <net_rx_action+f1/180>
Trace; c01232f7 <do_softirq+87/f0>
Trace; c010ad42 <do_IRQ+f2/100>
Trace; c010d398 <call_do_IRQ+5/d>
Trace; c026ca40 <inet_recvmsg+0/50>
Trace; c022499e <sock_recvmsg+2e/c0>
Trace; f8b71a1a <[ksocknal]ksocknal_recv_kiov+ea/250>
Trace; f8b71c2a <[ksocknal]ksocknal_receive+aa/280>
Trace; f8b74bbe <[ksocknal]ksocknal_process_receive+ce/780>
Trace; f8b75609 <[ksocknal]ksocknal_scheduler+199/640>
Trace; c01075c3 <arch_kernel_thread+23/30>
Trace; f8b75470 <[ksocknal]ksocknal_scheduler+0/640>

Andreas Dilger

2006-May-19 07:36 UTC

head link

[Lustre-discuss] Re: question about __alloc_pages: 0-order allocation failed (gfp=0x20/0)

--7ff5AkW+g6+eEohN
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

On Oct 12, 2004  12:17 +0200, Martin Vogt wrote:> Andreas Dilger wrote:
> >Martin, can you please set "sysctl -w vm.vm_gfp_debug=3D1" on
your system
> >(depends what kernel you are running) and send me the stack traces it
> >generates when getting the allocation failures.  That would give me an
> >idea of what is failing.  The lustre allocations should normally print
> >an error message to the syslog if they ever fail, so it would be good
> >to know what is going on here.  Thanks.
>=20
>=20
> I piped the trace through ksymoops, the trace is at the end of the mail.
>=20
> But the oops was not lustres fault.(I think)
>=20
> The trace shows that there is a vmware modules loaded.
> I had tested on this maschine vmware then reinstalled 1.3.2
> without rebuilding the vmware kernel modules and thus
> using the old ones. I think this leads to the oops.
>=20
> After removing vmnet/vmmon iozone runs again, and there are no
oopses.(15=20
> iterations)
Just FYI, it isn''t really an "oops", just a stack dump for
the currently
running process when there is an allocation failure (this is what the
vm.vm_gfp_debug=3D1 setting does for us).  It tells us that the allocation
failure is coming from alloc_skb() in the e1000 receive path, though
I''m not sure why that is happening.  Do all of the stack traces look
similar?  This is sort of "statistical" debugging, since any
allocation
could be the one that fails, but if you examine enough of them it is
likely that the one that happens the most would become clear.

It is interesting to hear that removing the vmware modules avoids
this problem.  Does vmware do anything with incoming network traffic
(e.g. filter it for forwarding to the virtual machine)?
> But I have still these errors:
>=20
> >LustreError: 1661:0:(client.c:816:ptlrpc_expire_one_request()) @@@=20
> >timeout (sent at 1097573837) req@e7692800 >x910748/t657940=20
> >o4->media-ost3_UUID@NID_192.168.9.13_UUID:6 lens 288/248 ref 3 fl=20
> >?phase?:R/4/0 rc 0/0
>=20
> After some time the upcall script is executed and the client continues.
> Is it ok that I have these error messages on a regular basis?
It''s not really OK.  It means that your client is timing out in its
RPCs to the OST, possibly because these allocation failures are
causing the OST to drop incoming messages.  If the OSTs are running
in "recoverable" mode (configured with --failover in the lmc script)
then the clients will resend these RPCs.
> Trace; c013aa22 <__alloc_pages+272/280>
> Trace; c013aa57 <__get_free_pages+27/30>
> Trace; c0137896 <kmem_cache_grow+c6/280>
> Trace; c0137ddc <kmalloc+8c/160>
> Trace; c0227e4a <alloc_skb+ba/1f0>
> Trace; f8a14112 <[e1000]e1000_alloc_rx_buffers+a2/120>
> Trace; f8a13d89 <[e1000]e1000_clean_rx_irq+169/450>
> Trace; f8a14112 <[e1000]e1000_alloc_rx_buffers+a2/120>
> Trace; f8a13d89 <[e1000]e1000_clean_rx_irq+169/450>
> Trace; f8a139c8 <[e1000]e1000_intr+28/80>
> Trace; c010aad3 <handle_IRQ_event+63/a0>
> Trace; c010acf5 <do_IRQ+a5/100>
> Trace; c010d398 <call_do_IRQ+5/d>
> Trace; c02280c4 <kfree_skbmem+4/70>
> Trace; c0228230 <__kfree_skb+100/150>
> Trace; f8a9163a <[vmnet]VNetHubReceive+8c/98>
> Trace; f8a90e65 <[vmnet]VNetSend+33/62>
> Trace; c0228230 <__kfree_skb+100/150>
> Trace; f8a93c73 <[vmnet]VNetBridgeReceiveFromDev+1b7/1c4>
> Trace; c022c20e <dev_queue_xmit_nit+9e/f0>
> Trace; c0237d62 <qdisc_restart+122/1a0>
> Trace; c022c45d <dev_queue_xmit+16d/320>
> Trace; c0246821 <ip_output+121/1c0>
> Trace; c0246ac4 <ip_queue_xmit+204/570>
> Trace; c01379cf <kmem_cache_grow+1ff/280>
> Trace; c025e8c2 <tcp_v4_send_check+82/d0>
> Trace; c0258ca7 <tcp_transmit_skb+3a7/5f0>
> Trace; c025b4d0 <tcp_send_ack+80/c0>
> Trace; c0257630 <tcp_rcv_established+720/850>
> Trace; c010ad42 <do_IRQ+f2/100>
> Trace; c025f9cf <tcp_v4_do_rcv+11f/130>
> Trace; c0260038 <tcp_v4_rcv+658/720>
> Trace; c0243612 <ip_local_deliver+1a2/1d0>
> Trace; c0243a32 <ip_rcv+3f2/440>
> Trace; c022cc45 <netif_receive_skb+1c5/1f0>
> Trace; c022cd1d <process_backlog+ad/160>
> Trace; c022cec1 <net_rx_action+f1/180>
> Trace; c01232f7 <do_softirq+87/f0>
> Trace; c010ad42 <do_IRQ+f2/100>
> Trace; c010d398 <call_do_IRQ+5/d>
> Trace; c026ca40 <inet_recvmsg+0/50>
> Trace; c022499e <sock_recvmsg+2e/c0>
> Trace; f8b71a1a <[ksocknal]ksocknal_recv_kiov+ea/250>
> Trace; f8b71c2a <[ksocknal]ksocknal_receive+aa/280>
> Trace; f8b74bbe <[ksocknal]ksocknal_process_receive+ce/780>
> Trace; f8b75609 <[ksocknal]ksocknal_scheduler+199/640>
> Trace; c01075c3 <arch_kernel_thread+23/30>
> Trace; f8b75470 <[ksocknal]ksocknal_scheduler+0/640>
Cheers, Andreas
--
Andreas Dilger
http://sourceforge.net/projects/ext2resize/
http://members.shaw.ca/adilger/             http://members.shaw.ca/golinux/


--7ff5AkW+g6+eEohN
Content-Type: application/pgp-signature
Content-Disposition: inline

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.3 (GNU/Linux)

iD8DBQFBbAqjpIg59Q01vtYRAgPgAJ9Yd0DSJWcW6tfOaOmrRMP9mJVlKQCg1Yzp
cWWp478vnLV08Jznhn0BTNw=cpHC
-----END PGP SIGNATURE-----

--7ff5AkW+g6+eEohN--

Martin Vogt

2006-May-19 07:36 UTC

head link

[Lustre-discuss] question about __alloc_pages: 0-order allocation failed (gfp=0x20/0)

Capps@iozone.org wrote:
>Martin,
>
>    Are you running Iozone on an OST ?
>
>Enjoy,
>Don Capps
>
>  
>
Hello Don,

No. These are seperate machines.
The error is new btw. Before that (1.4b) I could run 350 iteration,
without this bug.(Then I stopped the benmark, 350 iterations
are close to infinity :-)

Clients on an OST do not work, I think I read this somewhere.

regards,

Martin

>----- Original Message ----- 
>From: "Martin Vogt" <vogt@itwm.fraunhofer.de>
>To: <lustre-discuss@lists.clusterfs.com>
>Sent: Friday, October 08, 2004 4:57 AM
>Subject: [Lustre-discuss] question about __alloc_pages: 0-order allocation
>failed (gfp=0x20/0)
>
>
>  
>
>>Hello,
>>
>>
>>when I run IOZone in an endless loop I have one client which
>>after 10 Iterations stops working.
>>
>>I get this error in dmesg:
>>
>>
>>__alloc_pages: 0-order allocation failed (gfp=0x20/0)
>>__alloc_pages: 0-order allocation failed (gfp=0x20/0)
>>__alloc_pages: 0-order allocation failed (gfp=0x20/0)
>>__alloc_pages: 0-order allocation failed (gfp=0x20/0)
>>__alloc_pages: 0-order allocation failed (gfp=0x20/0)
>>
>>
>>All the clients are the same (RAM/CPU, etc...)
>>After the next iteration the iozone binary is started, lustre re-imports
>>its OSTs on the client with this error message and the binary runs.
>>
>>regards,
>>
>>Martin
>>
>>_______________________________________________
>>Lustre-discuss mailing list
>>Lustre-discuss@lists.clusterfs.com
>>https://lists.clusterfs.com/mailman/listinfo/lustre-discuss
>>
>>    
>>
>
>  
>

Martin Vogt

2006-May-19 07:36 UTC

head link

[Lustre-discuss] Re: question about __alloc_pages: 0-order allocation failed (gfp=0x20/0)

Andreas Dilger wrote:
>On Oct 12, 2004  12:17 +0200, Martin Vogt wrote
>
>>I piped the trace through ksymoops, the trace is at the end of the mail.
>>
>>But the oops was not lustres fault.(I think)
>>
>>The trace shows that there is a vmware modules loaded.
>>I had tested on this maschine vmware then reinstalled 1.3.2
>>without rebuilding the vmware kernel modules and thus
>>using the old ones. I think this leads to the oops.
>>
>>After removing vmnet/vmmon iozone runs again, and there are no
oopses.(15
>>iterations)
>>    
>>
>
>Just FYI, it isn''t really an "oops", just a stack dump
for the currently
>running process when there is an allocation failure (this is what the
>vm.vm_gfp_debug=1 setting does for us).  It tells us that the allocation
>failure is coming from alloc_skb() in the e1000 receive path, though
>I''m not sure why that is happening.  Do all of the stack traces
look
>similar?  This is sort of "statistical" debugging, since any
allocation
>could be the one that fails, but if you examine enough of them it is
>likely that the one that happens the most would become clear.
>
>It is interesting to hear that removing the vmware modules avoids
>this problem.  Does vmware do anything with incoming network traffic
>(e.g. filter it for forwarding to the virtual machine)?
>
>  
>
I dont know, but I saw that the vmware modules were not necessary,
and after removing them I had no stack dumps/traces.
Before that (1.4b) the vmware modules were properly compiled against the
lustre 1.4b kernel and I had no stack dumps.
Now I''m running lustre-1.3.3 and I have no stack dumps. (but still the 
error below)

>>But I have still these errors:
>>
>>    
>>
>>>LustreError: 1661:0:(client.c:816:ptlrpc_expire_one_request()) @@@ 
>>>timeout (sent at 1097573837) req@e7692800 >x910748/t657940 
>>>o4->media-ost3_UUID@NID_192.168.9.13_UUID:6 lens 288/248 ref 3 fl
>>>?phase?:R/4/0 rc 0/0
>>>      
>>>
>>After some time the upcall script is executed and the client continues.
>>Is it ok that I have these error messages on a regular basis?
>>    
>>
>
>It''s not really OK.  It means that your client is timing out in its
>RPCs to the OST, possibly because these allocation failures are
>causing the OST to drop incoming messages.
>No, this error comes from a run where I have no alloction failures / 
stack dumps.
This error comes regulary  "under network load". (eg: run iozone in an
endless loop)
These are the modules paramaters for the nic. (Otherwise I made no changes)

 >options e1000 RxIntDelay=0,0,0 RxDescriptors=1024,1024,1024 
TxDescriptors=1024,1
 >024,1024 TxIntDelay=0,0,0

I have the "ptlrpc_expire_one_request" from the beginning of lustre
testing:

- on 2.6.x SLES9 kernels (lustre 2.4.x)
- on redhat EL kernels, pre-build and self-builed.
- on SuSE 2.4.21 kernels (from lnxi)
- on my patched SuSE kernels (2.4.24)

Although the combination gcc 3.3.3 + linux-2.4.24 works best.
The error is not "fatal" usually the client hangs for some time and
then
it "recovers".

On the redhat kernel (from clusterfs) version 1.2.4 I had these errors:
After 59 iterations (6 Hours of testing) of a parallel iozone I get this 
LustreError:

 >Aug 27 05:24:07 media4 kernel: LustreError: 
1765:0:(socknal_cb.c:1058:ksocknal_s
 >endmsg()) PORTALS: out of memory at socknal_cb.c:1058 (tried to alloc 
''ltx'' = 13
 >6)
 >Aug 27 05:24:07 media4 kernel: LustreError: 
1765:0:(socknal_cb.c:1058:ksocknal_s
 >endmsg()) PORTALS: out of memory at socknal_cb.c:1058 (tried to alloc 
''ltx'' = 13
 >6)

These error was with the newest intel drivers. (The drivers from RH 3.0 
dont know
the PCI ID of the nics, I cannot use them)

Maye it is an bug in the e1000 drivers?
When I had the allocation failure I made 4 dumps with ksymoops.
Here is another one: (in this one there is no vmware module included)
All off the dumps looked similar, except the part in the "middle".
sometime it was vmware here it is sun.rpc

regards,

Martin

Trace; c013aa22 <__alloc_pages+272/280>
Trace; c013aa57 <__get_free_pages+27/30>
Trace; c0137896 <kmem_cache_grow+c6/280>
Trace; c0137ddc <kmalloc+8c/160>
Trace; c0227e4a <alloc_skb+ba/1f0>
Trace; f8a14112 <[e1000]e1000_alloc_rx_buffers+a2/120>
Trace; f8a13d89 <[e1000]e1000_clean_rx_irq+169/450>
Trace; f8a14112 <[e1000]e1000_alloc_rx_buffers+a2/120>
Trace; f8a13d89 <[e1000]e1000_clean_rx_irq+169/450>
Trace; f8a139c8 <[e1000]e1000_intr+28/80>
Trace; c010aad3 <handle_IRQ_event+63/a0>
Trace; c010acf5 <do_IRQ+a5/100>
Trace; c010d398 <call_do_IRQ+5/d>
Trace; c02280c4 <kfree_skbmem+4/70>
Trace; c0228230 <__kfree_skb+100/150>
Trace; f8a9163a <[sunrpc]xdr_partial_copy_from_skb+ea/200>
Trace; f8a90e65 <[sunrpc].text.lock.pmap_clnt+12/3d>
Trace; c0228230 <__kfree_skb+100/150>
Trace; f8a93c73 <[sunrpc]__kstrtab_nlm_debug+fc3/2c50>
Trace; c022c20e <dev_queue_xmit_nit+9e/f0>
Trace; c0237d62 <qdisc_restart+122/1a0>
Trace; c022c45d <dev_queue_xmit+16d/320>
Trace; c0246821 <ip_output+121/1c0>
Trace; c0246ac4 <ip_queue_xmit+204/570>
Trace; c01379cf <kmem_cache_grow+1ff/280>
Trace; c025e8c2 <tcp_v4_send_check+82/d0>
Trace; c0258ca7 <tcp_transmit_skb+3a7/5f0>
Trace; c025b4d0 <tcp_send_ack+80/c0>
Trace; c0257630 <tcp_rcv_established+720/850>
Trace; c010ad42 <do_IRQ+f2/100>
Trace; c025f9cf <tcp_v4_do_rcv+11f/130>
Trace; c0260038 <tcp_v4_rcv+658/720>
Trace; c0243612 <ip_local_deliver+1a2/1d0>
Trace; c0243a32 <ip_rcv+3f2/440>
Trace; c022cc45 <netif_receive_skb+1c5/1f0>
Trace; c022cd1d <process_backlog+ad/160>
Trace; c022cec1 <net_rx_action+f1/180>
Trace; c01232f7 <do_softirq+87/f0>
Trace; c010ad42 <do_IRQ+f2/100>
Trace; c010d398 <call_do_IRQ+5/d>
Trace; c026ca40 <inet_recvmsg+0/50>
Trace; c022499e <sock_recvmsg+2e/c0>
Trace; f8b71a1a <[obdclass]llog_print_cb+41a/6e0>
Trace; f8b71c2a <[obdclass]llog_print_cb+62a/6e0>

Here is another one. But this is not done against a proper System-map:

Trace; c013aa22 <__alloc_pages+272/280>
Trace; c013aa57 <__get_free_pages+27/30>
Trace; c0137896 <kmem_cache_grow+c6/280>
Trace; c0137ddc <kmalloc+8c/160>
Trace; c0227e4a <alloc_skb+ba/1f0>
Trace; f8a14112 <[e1000]e1000_alloc_rx_buffers+a2/120>
Trace; f8a13d89 <[e1000]e1000_clean_rx_irq+169/450>
Trace; c010ad42 <do_IRQ+f2/100>
Trace; f8a139c8 <[e1000]e1000_intr+28/80>
Trace; c010aad3 <handle_IRQ_event+63/a0>
Trace; c010acf5 <do_IRQ+a5/100>
Trace; c010d398 <call_do_IRQ+5/d>
Trace; c0220018 <md_ioctl+478/7d0>
Trace; c027cb52 <packet_rcv_spkt+82/220>
Trace; c022c20e <dev_queue_xmit_nit+9e/f0>
Trace; c0237d62 <qdisc_restart+122/1a0>
Trace; c022c45d <dev_queue_xmit+16d/320>
Trace; c0246821 <ip_output+121/1c0>
Trace; c0246ac4 <ip_queue_xmit+204/570>
Trace; c01379cf <kmem_cache_grow+1ff/280>
Trace; c025e8c2 <tcp_v4_send_check+82/d0>
Trace; c0258ca7 <tcp_transmit_skb+3a7/5f0>
Trace; c025b4d0 <tcp_send_ack+80/c0>
Trace; c0257630 <tcp_rcv_established+720/850>
Trace; c010ad42 <do_IRQ+f2/100>
Trace; c025f9cf <tcp_v4_do_rcv+11f/130>
Trace; c0260038 <tcp_v4_rcv+658/720>

Kian_Chang_Low@veritasdgc.com

2006-May-19 07:36 UTC

head link

[Lustre-discuss] Lustre with Infiniband

This is a multipart message in MIME format.
--=_alternative 002FA8CA48256F41_Content-Type: text/plain;
charset="US-ASCII"

Hi,

Is there a version of Lustre that is IB (infiniband) ready? And is that 
native IB NAL (network abstraction layer), or sockets NAL over IPoIB or 
SDP (socket direct protocol)?

Thanks,
Kian Chang.
--=_alternative 002FA8CA48256F41_Content-Type: text/html;
charset="US-ASCII"


<br><font size=2 face="sans-serif">Hi,</font>
<br>
<br><font size=2 face="sans-serif">Is there a version of
Lustre that is
IB (infiniband) ready? And is that native IB NAL (network abstraction layer),
or sockets NAL over IPoIB or SDP (socket direct protocol)?</font>
<br>
<br><font size=2 face="sans-serif">Thanks,</font>
<br><font size=2 face="sans-serif">Kian
Chang.</font>
--=_alternative 002FA8CA48256F41_=--

Phil Schwan

2006-May-19 07:36 UTC

head link

[Lustre-discuss] Lustre with Infiniband

Hi Kian--

On 11/3/2004 3:35, Kian_Chang_Low@veritasdgc.com wrote:> 
> Is there a version of Lustre that is IB (infiniband) ready? And is that
> native IB NAL (network abstraction layer), or sockets NAL over IPoIB or
> SDP (socket direct protocol)?
Lustre 1.4.0 includes two NALs: one for the Mellanox/TopSpin/OpenIB-gen-1
stack, and one for the Infinicon stack.  Both are "beta" quality, not
ready
for production use, and not updated to the latest version of the IB stack.
Both NALs use verbs, not SDP or IPoIB.

You can use IP over IB with the normal socknal.  Lustre 1.4.1 will ship by
default with a setting that improves IPoIB performance.  It causes the
socknal to issue I/O to the TCP/IP stack differently.

Thanks--

-Phil

Martin Vogt

2006-May-19 07:36 UTC

head link

[Lustre-discuss] question about __alloc_pages: 0-order allocation failed (gfp=0x20/0)

Hello,


when I run IOZone in an endless loop I have one client which
after 10 Iterations stops working.

I get this error in dmesg:


__alloc_pages: 0-order allocation failed (gfp=0x20/0)
__alloc_pages: 0-order allocation failed (gfp=0x20/0)
__alloc_pages: 0-order allocation failed (gfp=0x20/0)
__alloc_pages: 0-order allocation failed (gfp=0x20/0)
__alloc_pages: 0-order allocation failed (gfp=0x20/0)


All the clients are the same (RAM/CPU, etc...)
After the next iteration the iozone binary is started, lustre re-imports
its OSTs on the client with this error message and the binary runs.

regards,

Martin

Lustre discuss - May 2006 - question about __alloc_pages: 0-order allocation failed (gfp=0x20/0)

[Lustre-discuss] question about __alloc_pages: 0-order allocation failed (gfp=0x20/0)

[Lustre-discuss] Re: question about __alloc_pages: 0-order allocation failed (gfp=0x20/0)

[Lustre-discuss] Re: question about __alloc_pages: 0-order allocation failed (gfp=0x20/0)

[Lustre-discuss] question about __alloc_pages: 0-order allocation failed (gfp=0x20/0)

[Lustre-discuss] Re: question about __alloc_pages: 0-order allocation failed (gfp=0x20/0)

[Lustre-discuss] Lustre with Infiniband

[Lustre-discuss] Lustre with Infiniband

[Lustre-discuss] question about __alloc_pages: 0-order allocation failed (gfp=0x20/0)