Lu Wang
2009-Mar-02 08:10 UTC
[Lustre-discuss] Process accessing Lustre be killed on Lustre client
Dear list, One of our application, which generates large files( about 500MB /file), always got killed on Lustre Client. Here is the log: Feb 2 11:55:17 bws0044 kernel: Free pages: 1449604kB (1433152kB HighMem) Feb 2 11:55:17 bws0044 kernel: Active:1348397 inactive:2249899 dirty:3044 writeback:445 unstable:0 free:362401 slab:185354 mapped:523156 page tables:3413 Feb 2 11:55:17 bws0044 kernel: DMA free:12532kB min:72kB low:144kB high:216kB active:0kB inactive:0kB present:16384kB pages_scanned:341 all_u nreclaimable? yes Feb 2 11:55:17 bws0044 kernel: protections[]: 0 0 0 Feb 2 11:55:17 bws0044 kernel: Normal free:3920kB min:4020kB low:8040kB high:12060kB active:312kB inactive:96kB present:901120kB pages_scanne d:6839 all_unreclaimable? yes Feb 2 11:55:17 bws0044 kernel: protections[]: 0 0 0 Feb 2 11:55:17 bws0044 kernel: HighMem free:1433152kB min:512kB low:1024kB high:1536kB active:5393276kB inactive:8999500kB present:16646144kB pages_scanned:0 all_unreclaimable? no Feb 2 11:55:17 bws0044 kernel: protections[]: 0 0 0 Feb 2 11:55:17 bws0044 kernel: DMA: 3*4kB 3*8kB 3*16kB 3*32kB 3*64kB 3*128kB 2*256kB 0*512kB 1*1024kB 1*2048kB 2*4096kB = 12532kB Feb 2 11:55:17 bws0044 kernel: Normal: 0*4kB 0*8kB 1*16kB 0*32kB 1*64kB 0*128kB 1*256kB 1*512kB 1*1024kB 1*2048kB 0*4096kB = 3920kB Feb 2 11:55:17 bws0044 kernel: HighMem: 19608*4kB 2564*8kB 476*16kB 230*32kB 2363*64kB 223*128kB 1*256kB 1*512kB 0*1024kB 0*2048kB 278*4096kB = 1433152kB Feb 2 11:55:17 bws0044 kernel: Swap cache: add 0, delete 0, find 0/0, race 0+0 Feb 2 11:55:17 bws0044 kernel: 0 bounce buffer pages Feb 2 11:55:17 bws0044 kernel: Free swap: 8193140kB Feb 2 11:55:17 bws0044 kernel: 4390912 pages of RAM Feb 2 11:55:17 bws0044 kernel: 3964860 pages of HIGHMEM Feb 2 11:55:17 bws0044 kernel: 232558 reserved pages Feb 2 11:55:17 bws0044 kernel: 3103013 pages shared Feb 2 11:55:17 bws0044 kernel: 0 pages swap cached Feb 2 11:55:17 bws0044 kernel: Out of Memory: Killed process 17313 (boss.exe). Feb 2 11:55:17 bws0044 kernel: oom-killer: gfp_mask=0xd0 My question is: 1.Does Lustre client requires a lot of low memory? 2. for our application, need I update clients to 64 bit? ------------------------------------------------------------------------ Lu WANG Computing Center?IHEP Tel: (+86) 10 8823 6012 ext 607 P.O. Box 918-7 Email: wanglu at ihep.ac.cn Beijing 100049?China
Johann Lombardi
2009-Mar-02 10:07 UTC
[Lustre-discuss] Process accessing Lustre be killed on Lustre client
On Mon, Mar 02, 2009 at 04:10:46PM +0800, Lu Wang wrote:> My question is: > 1.Does Lustre client requires a lot of low memory?There is one known issue with the lru resize feature on i686 (it can consume almost all the low memory). To know whether or not this is the same problem, could you please try to disable lru resize on the client side and see if you hit this bug again? To do so, you have to run the following commands on the client(s): lctl set_param ldlm.namespaces.*osc*.lru_size=$((NR_CPU*100)) lctl set_param ldlm.namespaces.*mdc*.lru_size=$((NR_CPU*100)) where NR_CPU is the number of cpus on the client. Cheers, Johann
Lu Wang
2009-Mar-03 02:18 UTC
[Lustre-discuss] Process accessing Lustre be killed on Lustreclient
# lctl get_param ldlm.namespaces.*osc*.lru_size ldlm.namespaces.besfs-OST0000-osc-f7dfe400.lru_size=0 ldlm.namespaces.besfs-OST0001-osc-f7dfe400.lru_size=0 ldlm.namespaces.besfs-OST0002-osc-f7dfe400.lru_size=0 ldlm.namespaces.besfs-OST0003-osc-f7dfe400.lru_size=0 ldlm.namespaces.besfs-OST0004-osc-f7dfe400.lru_size=1 ldlm.namespaces.besfs-OST0005-osc-f7dfe400.lru_size=0 ldlm.namespaces.besfs-OST0006-osc-f7dfe400.lru_size=0 ldlm.namespaces.besfs-OST0007-osc-f7dfe400.lru_size=1 ldlm.namespaces.besfs-OST0008-osc-f7dfe400.lru_size=0 ldlm.namespaces.besfs-OST0009-osc-f7dfe400.lru_size=0 ldlm.namespaces.besfs-OST000a-osc-f7dfe400.lru_size=0 ldlm.namespaces.besfs-OST000b-osc-f7dfe400.lru_size=0 ldlm.namespaces.besfs-OST000c-osc-f7dfe400.lru_size=0 .... I got "0" for lru_size, according to the Lustre manual, it means "automatic resizing...". Is the memory pressusre caused by uncontrolled lru size? ---------------- Lu Wang 2009-03-03 ------------------------------------------------------------- ????Johann Lombardi ?????2009-03-02 18:04:13 ????Lu Wang ???lustre-discuss ???Re: [Lustre-discuss] Process accessing Lustre be killed on Lustreclient On Mon, Mar 02, 2009 at 04:10:46PM +0800, Lu Wang wrote:> My question is: > 1.Does Lustre client requires a lot of low memory?There is one known issue with the lru resize feature on i686 (it can consume almost all the low memory). To know whether or not this is the same problem, could you please try to disable lru resize on the client side and see if you hit this bug again? To do so, you have to run the following commands on the client(s): lctl set_param ldlm.namespaces.*osc*.lru_size=$((NR_CPU*100)) lctl set_param ldlm.namespaces.*mdc*.lru_size=$((NR_CPU*100)) where NR_CPU is the number of cpus on the client. Cheers, Johann
Lu Wang
2009-Mar-03 03:32 UTC
[Lustre-discuss] Process accessing Lustre be killed onLustreclient
Dear list , When I sent testjobs( dd 5G files, 8jobs /node) to these nodes, I got errors like : Mar 3 11:14:48 bws0091 kernel: LustreError: 11-0: an error occurred while communicating with 192.168.50.66 at tcp. The obd_ping operation failed with -107 Mar 3 11:14:48 bws0091 kernel: LustreError: Skipped 69 previous similar messages Mar 3 11:14:48 bws0091 kernel: LustreError: 167-0: This client was evicted by besfs-OST0010; in progress operations using this service will fail. Mar 3 11:15:51 bws0091 kernel: LustreError: 4959:0:(lib-move.c:95:lnet_try_match_md()) Matching packet from 12345-192.168.50.32 at tcp, match 4016570 length 1408 too big: 1008 left, 1008 allowed Mar 3 11:27:17 bws0091 kernel: LustreError: 11-0: an error occurred while communicating with 192.168.50.66 at tcp. The obd_ping operation failed with -107 Mar 3 11:27:17 bws0091 kernel: LustreError: Skipped 66 previous similar messages Mar 3 11:27:17 bws0091 kernel: LustreError: 167-0: This client was evicted by besfs-OST0010; in progress operations using this service will fail. ------------------ Lu Wang 2009-03-03 ------------------------------------------------------------- ????Lu Wang ?????2009-03-03 10:14:58 ???? ???lustre-discuss ???Re: [Lustre-discuss] Process accessing Lustre be killed onLustreclient # lctl get_param ldlm.namespaces.*osc*.lru_size ldlm.namespaces.besfs-OST0000-osc-f7dfe400.lru_size=0 ldlm.namespaces.besfs-OST0001-osc-f7dfe400.lru_size=0 ldlm.namespaces.besfs-OST0002-osc-f7dfe400.lru_size=0 ldlm.namespaces.besfs-OST0003-osc-f7dfe400.lru_size=0 ldlm.namespaces.besfs-OST0004-osc-f7dfe400.lru_size=1 ldlm.namespaces.besfs-OST0005-osc-f7dfe400.lru_size=0 ldlm.namespaces.besfs-OST0006-osc-f7dfe400.lru_size=0 ldlm.namespaces.besfs-OST0007-osc-f7dfe400.lru_size=1 ldlm.namespaces.besfs-OST0008-osc-f7dfe400.lru_size=0 ldlm.namespaces.besfs-OST0009-osc-f7dfe400.lru_size=0 ldlm.namespaces.besfs-OST000a-osc-f7dfe400.lru_size=0 ldlm.namespaces.besfs-OST000b-osc-f7dfe400.lru_size=0 ldlm.namespaces.besfs-OST000c-osc-f7dfe400.lru_size=0 .... I got "0" for lru_size, according to the Lustre manual, it means "automatic resizing...". Is the memory pressusre caused by uncontrolled lru size? ---------------- Lu Wang 2009-03-03 ------------------------------------------------------------- ????Johann Lombardi ?????2009-03-02 18:04:13 ????Lu Wang ???lustre-discuss ???Re: [Lustre-discuss] Process accessing Lustre be killed on Lustreclient On Mon, Mar 02, 2009 at 04:10:46PM +0800, Lu Wang wrote:> My question is: > 1.Does Lustre client requires a lot of low memory?There is one known issue with the lru resize feature on i686 (it can consume almost all the low memory). To know whether or not this is the same problem, could you please try to disable lru resize on the client side and see if you hit this bug again? To do so, you have to run the following commands on the client(s): lctl set_param ldlm.namespaces.*osc*.lru_size=$((NR_CPU*100)) lctl set_param ldlm.namespaces.*mdc*.lru_size=$((NR_CPU*100)) where NR_CPU is the number of cpus on the client. Cheers, Johann _______________________________________________ Lustre-discuss mailing list Lustre-discuss at lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Lu Wang
2009-Mar-03 09:38 UTC
[Lustre-discuss] Process accessing Lustre be killed on Lustre client
I solved this problem by reduce "/proc/fs/lustre/llite/*/max_cached_mb" ------------------ Lu Wang 2009-03-03 ------------------------------------------------------------- ????Lu Wang ?????2009-03-02 16:07:43 ????lustre-discuss ??? ???[Lustre-discuss] Process accessing Lustre be killed on Lustre client Dear list, One of our application, which generates large files( about 500MB /file), always got killed on Lustre Client. Here is the log: Feb 2 11:55:17 bws0044 kernel: Free pages: 1449604kB (1433152kB HighMem) Feb 2 11:55:17 bws0044 kernel: Active:1348397 inactive:2249899 dirty:3044 writeback:445 unstable:0 free:362401 slab:185354 mapped:523156 page tables:3413 Feb 2 11:55:17 bws0044 kernel: DMA free:12532kB min:72kB low:144kB high:216kB active:0kB inactive:0kB present:16384kB pages_scanned:341 all_u nreclaimable? yes Feb 2 11:55:17 bws0044 kernel: protections[]: 0 0 0 Feb 2 11:55:17 bws0044 kernel: Normal free:3920kB min:4020kB low:8040kB high:12060kB active:312kB inactive:96kB present:901120kB pages_scanne d:6839 all_unreclaimable? yes Feb 2 11:55:17 bws0044 kernel: protections[]: 0 0 0 Feb 2 11:55:17 bws0044 kernel: HighMem free:1433152kB min:512kB low:1024kB high:1536kB active:5393276kB inactive:8999500kB present:16646144kB pages_scanned:0 all_unreclaimable? no Feb 2 11:55:17 bws0044 kernel: protections[]: 0 0 0 Feb 2 11:55:17 bws0044 kernel: DMA: 3*4kB 3*8kB 3*16kB 3*32kB 3*64kB 3*128kB 2*256kB 0*512kB 1*1024kB 1*2048kB 2*4096kB = 12532kB Feb 2 11:55:17 bws0044 kernel: Normal: 0*4kB 0*8kB 1*16kB 0*32kB 1*64kB 0*128kB 1*256kB 1*512kB 1*1024kB 1*2048kB 0*4096kB = 3920kB Feb 2 11:55:17 bws0044 kernel: HighMem: 19608*4kB 2564*8kB 476*16kB 230*32kB 2363*64kB 223*128kB 1*256kB 1*512kB 0*1024kB 0*2048kB 278*4096kB = 1433152kB Feb 2 11:55:17 bws0044 kernel: Swap cache: add 0, delete 0, find 0/0, race 0+0 Feb 2 11:55:17 bws0044 kernel: 0 bounce buffer pages Feb 2 11:55:17 bws0044 kernel: Free swap: 8193140kB Feb 2 11:55:17 bws0044 kernel: 4390912 pages of RAM Feb 2 11:55:17 bws0044 kernel: 3964860 pages of HIGHMEM Feb 2 11:55:17 bws0044 kernel: 232558 reserved pages Feb 2 11:55:17 bws0044 kernel: 3103013 pages shared Feb 2 11:55:17 bws0044 kernel: 0 pages swap cached Feb 2 11:55:17 bws0044 kernel: Out of Memory: Killed process 17313 (boss.exe). Feb 2 11:55:17 bws0044 kernel: oom-killer: gfp_mask=0xd0 My question is: 1.Does Lustre client requires a lot of low memory? 2. for our application, need I update clients to 64 bit? ------------------------------------------------------------------------ Lu WANG Computing Center?IHEP Tel: (+86) 10 8823 6012 ext 607 P.O. Box 918-7 Email: wanglu at ihep.ac.cn Beijing 100049?China _______________________________________________ Lustre-discuss mailing list Lustre-discuss at lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss