On machines running many HVM (stubdom-based) domains, I often see errors like this: [77176.524094] qemu-dm invoked oom-killer: gfp_mask=0xd0, order=0, oom_adj=0 [77176.524102] Pid: 7478, comm: qemu-dm Not tainted 2.6.32.25-g80f7e08 #2 [77176.524109] Call Trace: [77176.524123] [<ffffffff810897fd>] ? T.413+0xcd/0x290 [77176.524129] [<ffffffff81089ad3>] ? __out_of_memory+0x113/0x180 [77176.524133] [<ffffffff81089b9e>] ? out_of_memory+0x5e/0xc0 [77176.524140] [<ffffffff8108d1cb>] ? __alloc_pages_nodemask+0x69b/0x6b0 [77176.524144] [<ffffffff8108d1f2>] ? __get_free_pages+0x12/0x60 [77176.524152] [<ffffffff810c94e7>] ? __pollwait+0xb7/0x110 [77176.524161] [<ffffffff81262b93>] ? n_tty_poll+0x183/0x1d0 [77176.524165] [<ffffffff8125ea42>] ? tty_poll+0x92/0xa0 [77176.524169] [<ffffffff810c8a92>] ? do_select+0x362/0x670 [77176.524173] [<ffffffff810c9430>] ? __pollwait+0x0/0x110 [77176.524178] [<ffffffff810c9540>] ? pollwake+0x0/0x60 [77176.524183] [<ffffffff810c9540>] ? pollwake+0x0/0x60 [77176.524188] [<ffffffff810c9540>] ? pollwake+0x0/0x60 [77176.524193] [<ffffffff810c9540>] ? pollwake+0x0/0x60 [77176.524197] [<ffffffff810c9540>] ? pollwake+0x0/0x60 [77176.524202] [<ffffffff810c9540>] ? pollwake+0x0/0x60 [77176.524207] [<ffffffff810c9540>] ? pollwake+0x0/0x60 [77176.524212] [<ffffffff810c9540>] ? pollwake+0x0/0x60 [77176.524217] [<ffffffff810c9540>] ? pollwake+0x0/0x60 [77176.524222] [<ffffffff810c8fb5>] ? core_sys_select+0x215/0x350 [77176.524231] [<ffffffff810100af>] ? xen_restore_fl_direct_end+0x0/0x1 [77176.524236] [<ffffffff8100c48d>] ? xen_mc_flush+0x8d/0x1b0 [77176.524243] [<ffffffff81014ffb>] ? xen_hypervisor_callback+0x1b/0x20 [77176.524251] [<ffffffff814b0f5a>] ? error_exit+0x2a/0x60 [77176.524255] [<ffffffff8101485d>] ? retint_restore_args+0x5/0x6 [77176.524263] [<ffffffff8102fd3d>] ? pvclock_clocksource_read+0x4d/0xb0 [77176.524268] [<ffffffff8102fd3d>] ? pvclock_clocksource_read+0x4d/0xb0 [77176.524276] [<ffffffff810663d1>] ? ktime_get_ts+0x61/0xd0 [77176.524281] [<ffffffff810c9354>] ? sys_select+0x44/0x120 [77176.524286] [<ffffffff81013f02>] ? system_call_fastpath+0x16/0x1b [77176.524290] Mem-Info: [77176.524293] DMA per-cpu: [77176.524296] CPU 0: hi: 0, btch: 1 usd: 0 [77176.524300] CPU 1: hi: 0, btch: 1 usd: 0 [77176.524303] CPU 2: hi: 0, btch: 1 usd: 0 [77176.524306] CPU 3: hi: 0, btch: 1 usd: 0 [77176.524310] CPU 4: hi: 0, btch: 1 usd: 0 [77176.524313] CPU 5: hi: 0, btch: 1 usd: 0 [77176.524316] CPU 6: hi: 0, btch: 1 usd: 0 [77176.524318] CPU 7: hi: 0, btch: 1 usd: 0 [77176.524322] CPU 8: hi: 0, btch: 1 usd: 0 [77176.524324] CPU 9: hi: 0, btch: 1 usd: 0 [77176.524327] CPU 10: hi: 0, btch: 1 usd: 0 [77176.524330] CPU 11: hi: 0, btch: 1 usd: 0 [77176.524333] CPU 12: hi: 0, btch: 1 usd: 0 [77176.524336] CPU 13: hi: 0, btch: 1 usd: 0 [77176.524339] CPU 14: hi: 0, btch: 1 usd: 0 [77176.524342] CPU 15: hi: 0, btch: 1 usd: 0 [77176.524345] CPU 16: hi: 0, btch: 1 usd: 0 [77176.524348] CPU 17: hi: 0, btch: 1 usd: 0 [77176.524351] CPU 18: hi: 0, btch: 1 usd: 0 [77176.524354] CPU 19: hi: 0, btch: 1 usd: 0 [77176.524358] CPU 20: hi: 0, btch: 1 usd: 0 [77176.524364] CPU 21: hi: 0, btch: 1 usd: 0 [77176.524367] CPU 22: hi: 0, btch: 1 usd: 0 [77176.524370] CPU 23: hi: 0, btch: 1 usd: 0 [77176.524372] DMA32 per-cpu: [77176.524374] CPU 0: hi: 186, btch: 31 usd: 81 [77176.524377] CPU 1: hi: 186, btch: 31 usd: 66 [77176.524380] CPU 2: hi: 186, btch: 31 usd: 49 [77176.524385] CPU 3: hi: 186, btch: 31 usd: 67 [77176.524387] CPU 4: hi: 186, btch: 31 usd: 93 [77176.524390] CPU 5: hi: 186, btch: 31 usd: 73 [77176.524393] CPU 6: hi: 186, btch: 31 usd: 50 [77176.524396] CPU 7: hi: 186, btch: 31 usd: 79 [77176.524399] CPU 8: hi: 186, btch: 31 usd: 21 [77176.524402] CPU 9: hi: 186, btch: 31 usd: 38 [77176.524406] CPU 10: hi: 186, btch: 31 usd: 0 [77176.524409] CPU 11: hi: 186, btch: 31 usd: 75 [77176.524412] CPU 12: hi: 186, btch: 31 usd: 1 [77176.524414] CPU 13: hi: 186, btch: 31 usd: 4 [77176.524417] CPU 14: hi: 186, btch: 31 usd: 9 [77176.524420] CPU 15: hi: 186, btch: 31 usd: 0 [77176.524423] CPU 16: hi: 186, btch: 31 usd: 56 [77176.524426] CPU 17: hi: 186, btch: 31 usd: 35 [77176.524429] CPU 18: hi: 186, btch: 31 usd: 32 [77176.524432] CPU 19: hi: 186, btch: 31 usd: 39 [77176.524435] CPU 20: hi: 186, btch: 31 usd: 24 [77176.524438] CPU 21: hi: 186, btch: 31 usd: 0 [77176.524441] CPU 22: hi: 186, btch: 31 usd: 35 [77176.524444] CPU 23: hi: 186, btch: 31 usd: 51 [77176.524447] Normal per-cpu: [77176.524449] CPU 0: hi: 186, btch: 31 usd: 29 [77176.524453] CPU 1: hi: 186, btch: 31 usd: 1 [77176.524456] CPU 2: hi: 186, btch: 31 usd: 30 [77176.524459] CPU 3: hi: 186, btch: 31 usd: 30 [77176.524463] CPU 4: hi: 186, btch: 31 usd: 30 [77176.524466] CPU 5: hi: 186, btch: 31 usd: 31 [77176.524469] CPU 6: hi: 186, btch: 31 usd: 0 [77176.524471] CPU 7: hi: 186, btch: 31 usd: 0 [77176.524474] CPU 8: hi: 186, btch: 31 usd: 30 [77176.524477] CPU 9: hi: 186, btch: 31 usd: 28 [77176.524480] CPU 10: hi: 186, btch: 31 usd: 0 [77176.524483] CPU 11: hi: 186, btch: 31 usd: 30 [77176.524486] CPU 12: hi: 186, btch: 31 usd: 0 [77176.524489] CPU 13: hi: 186, btch: 31 usd: 0 [77176.524492] CPU 14: hi: 186, btch: 31 usd: 0 [77176.524495] CPU 15: hi: 186, btch: 31 usd: 0 [77176.524498] CPU 16: hi: 186, btch: 31 usd: 0 [77176.524501] CPU 17: hi: 186, btch: 31 usd: 0 [77176.524504] CPU 18: hi: 186, btch: 31 usd: 0 [77176.524507] CPU 19: hi: 186, btch: 31 usd: 0 [77176.524510] CPU 20: hi: 186, btch: 31 usd: 0 [77176.524513] CPU 21: hi: 186, btch: 31 usd: 0 [77176.524516] CPU 22: hi: 186, btch: 31 usd: 0 [77176.524518] CPU 23: hi: 186, btch: 31 usd: 0 [77176.524524] active_anon:5675 inactive_anon:4676 isolated_anon:0 [77176.524526] active_file:146373 inactive_file:153543 isolated_file:480 [77176.524527] unevictable:0 dirty:167539 writeback:322 unstable:0 [77176.524528] free:5017 slab_reclaimable:15640 slab_unreclaimable:8972 [77176.524529] mapped:1114 shmem:7 pagetables:1908 bounce:0 [77176.524536] DMA free:9820kB min:32kB low:40kB high:48kB active_anon:4kB inactive_anon:0kB active_file:616kB inactive_file:2212kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:12740kB mlocked:0kB dirty:2292kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:72kB slab_unreclaimable:108kB kernel_stack:0kB pagetables:12kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:3040 all_unreclaimable? no [77176.524541] lowmem_reserve[]: 0 1428 2452 2452 [77176.524551] DMA32 free:7768kB min:3680kB low:4600kB high:5520kB active_anon:22696kB inactive_anon:18704kB active_file:584580kB inactive_file:608508kB unevictable:0kB isolated(anon):0kB isolated(file):1920kB present:1462496kB mlocked:0kB dirty:664128kB writeback:1276kB mapped:4456kB shmem:28kB slab_reclaimable:62076kB slab_unreclaimable:32292kB kernel_stack:5120kB pagetables:7620kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:1971808 all_unreclaimable? yes [77176.524556] lowmem_reserve[]: 0 0 1024 1024 [77176.524564] Normal free:2480kB min:2636kB low:3292kB high:3952kB active_anon:0kB inactive_anon:0kB active_file:296kB inactive_file:3452kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:1048700kB mlocked:0kB dirty:3736kB writeback:12kB mapped:0kB shmem:0kB slab_reclaimable:412kB slab_unreclaimable:3488kB kernel_stack:80kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:8192 all_unreclaimable? yes [77176.524569] lowmem_reserve[]: 0 0 0 0 [77176.524574] DMA: 4*4kB 25*8kB 11*16kB 7*32kB 8*64kB 8*128kB 8*256kB 3*512kB 0*1024kB 0*2048kB 1*4096kB = 9832kB [77176.524587] DMA32: 742*4kB 118*8kB 3*16kB 3*32kB 2*64kB 0*128kB 0*256kB 1*512kB 1*1024kB 1*2048kB 0*4096kB = 7768kB [77176.524600] Normal: 1*4kB 1*8kB 2*16kB 13*32kB 14*64kB 2*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 1612kB [77176.524613] 302308 total pagecache pages [77176.524615] 1619 pages in swap cache [77176.524617] Swap cache stats: add 40686, delete 39067, find 24687/26036 [77176.524619] Free swap = 10141956kB [77176.524621] Total swap = 10239992kB [77176.577607] 793456 pages RAM [77176.577611] 436254 pages reserved [77176.577613] 308627 pages shared [77176.577615] 49249 pages non-shared [77176.577620] Out of memory: kill process 5755 (python2.6) score 110492 or a child [77176.577623] Killed process 5757 (python2.6) Depending on what gets nuked by the OOM-killer, I am frequently left with an unusable system that needs to be rebooted. The machine always has plenty of memory available (1.5 GB devoted to dom0, of which >1 GB is always just in "cached" state). For instance, right now, on this same machine: # free total used free shared buffers cached Mem: 1536512 1493112 43400 0 10284 1144904 -/+ buffers/cache: 337924 1198588 Swap: 10239992 74444 10165548 I have seen this OOM problem on a wide range of Xen versions, stretching as far back as I can remember, including the most recent 4.1-unstable and 2.6.32 pvops kernel (from yesterday, tested in the hope that they would fix this). I haven''t found a way to reliably reproduce it yet, but I suspect that the problem relates to reasonably heavy disk or network activity -- during this last one, I see that a domain was briefly doing ~200 Mbps of downloads. Anyone have any ideas on what this could be? Is RAM getting spontaneously filled because a buffer somewhere grows too quickly, or something like that? What can I try here? -John _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
> On machines running many HVM (stubdom-based) domains, I often see errors > like this: > > [77176.524094] qemu-dm invoked oom-killer: gfp_mask=0xd0, order=0, > oom_adj=0What do the guests use for storage? (e.g. "blktap2 for VHD files on an iscsi mounted ext3 volume") It might be worth looking at /proc/kernel/slabinfo to see if there''s anything suspicious. BTW: 24 vCPUs in dom0 seems a excessive, especially if you''re using stubdoms. You may get better performance by dropping that to e.g. 2 or 3. Ian _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
> What do the guests use for storage? (e.g. "blktap2 for VHD files onan iscsi mounted ext3 volume") Simple sparse .img files on a local ext4 RAID volume, using "file:". > It might be worth looking at /proc/kernel/slabinfo to see if there''s anything suspicious. I didn''t see anything suspicious in there, but I''m not sure what I''m looking for. Here is the first page of slabtop as it currently stands, if that helps. It looks a bit easier to read. Active / Total Objects (% used) : 274753 / 507903 (54.1%) Active / Total Slabs (% used) : 27573 / 27582 (100.0%) Active / Total Caches (% used) : 85 / 160 (53.1%) Active / Total Size (% used) : 75385.52K / 107127.41K (70.4%) Minimum / Average / Maximum Object : 0.02K / 0.21K / 4096.00K OBJS ACTIVE USE OBJ SIZE SLABS OBJ/SLAB CACHE SIZE NAME 306397 110621 36% 0.10K 8281 37 33124K buffer_head 37324 26606 71% 0.54K 5332 7 21328K radix_tree_node 25640 25517 99% 0.19K 1282 20 5128K size-192 23472 23155 98% 0.08K 489 48 1956K sysfs_dir_cache 19964 19186 96% 0.95K 4991 4 19964K ext4_inode_cache 17860 13026 72% 0.19K 893 20 3572K dentry 14896 13057 87% 0.03K 133 112 532K size-32 8316 6171 74% 0.17K 378 22 1512K vm_area_struct 8142 5053 62% 0.06K 138 59 552K size-64 4320 3389 78% 0.12K 144 30 576K size-128 3760 2226 59% 0.19K 188 20 752K filp 3456 1875 54% 0.02K 24 144 96K anon_vma 3380 3001 88% 1.00K 845 4 3380K size-1024 3380 3365 99% 0.76K 676 5 2704K shmem_inode_cache 2736 2484 90% 0.50K 342 8 1368K size-512 2597 2507 96% 0.07K 49 53 196K Acpi-Operand 2100 1095 52% 0.25K 140 15 560K skbuff_head_cache 1920 819 42% 0.12K 64 30 256K cred_jar 1361 1356 99% 4.00K 1361 1 5444K size-4096 1230 628 51% 0.12K 41 30 164K pid 1008 907 89% 0.03K 9 112 36K Acpi-Namespace 959 496 51% 0.57K 137 7 548K inode_cache 891 554 62% 0.81K 99 9 792K signal_cache 888 115 12% 0.10K 24 37 96K ext4_prealloc_space 885 122 13% 0.06K 15 59 60K fs_cache 850 642 75% 1.45K 170 5 1360K task_struct 820 769 93% 0.19K 41 20 164K bio-0 666 550 82% 2.06K 222 3 1776K sighand_cache 576 211 36% 0.50K 72 8 288K task_xstate 529 379 71% 0.16K 23 23 92K cfq_queue 518 472 91% 2.00K 259 2 1036K size-2048 506 375 74% 0.16K 22 23 88K cfq_io_context 495 353 71% 0.33K 45 11 180K blkdev_requests 465 422 90% 0.25K 31 15 124K size-256 418 123 29% 0.69K 38 11 304K files_cache 360 207 57% 0.69K 72 5 288K sock_inode_cache 360 251 69% 0.12K 12 30 48K scsi_sense_cache 336 115 34% 0.08K 7 48 28K blkdev_ioc 285 236 82% 0.25K 19 15 76K scsi_cmd_cache > BTW: 24 vCPUs in dom0 seems a excessive, especially if you''re using stubdoms. You may get better performance by dropping that to e.g. 2 or 3. I will test that. Do you think it will make a difference in this case? -John _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
> > What do the guests use for storage? (e.g. "blktap2 for VHD files on > an iscsi mounted ext3 volume") > > Simple sparse .img files on a local ext4 RAID volume, using "file:".Ah, if you''re using loop it may be that you''re just filling memory with dirty pages. Older kernels certainly did this, not sure about newer ones. I''d be inclined to use blktap2 in raw file mode, with "aio:". Ian> > It might be worth looking at /proc/kernel/slabinfo to see if there''s > anything suspicious. > > I didn''t see anything suspicious in there, but I''m not sure what I''m > looking for. > > Here is the first page of slabtop as it currently stands, if that helps. > It looks a bit easier to read. > > Active / Total Objects (% used) : 274753 / 507903 (54.1%) > Active / Total Slabs (% used) : 27573 / 27582 (100.0%) > Active / Total Caches (% used) : 85 / 160 (53.1%) > Active / Total Size (% used) : 75385.52K / 107127.41K (70.4%) > Minimum / Average / Maximum Object : 0.02K / 0.21K / 4096.00K > > OBJS ACTIVE USE OBJ SIZE SLABS OBJ/SLAB CACHE SIZE NAME > 306397 110621 36% 0.10K 8281 37 33124K buffer_head > 37324 26606 71% 0.54K 5332 7 21328K radix_tree_node > 25640 25517 99% 0.19K 1282 20 5128K size-192 > 23472 23155 98% 0.08K 489 48 1956K sysfs_dir_cache > 19964 19186 96% 0.95K 4991 4 19964K ext4_inode_cache > 17860 13026 72% 0.19K 893 20 3572K dentry > 14896 13057 87% 0.03K 133 112 532K size-32 > 8316 6171 74% 0.17K 378 22 1512K vm_area_struct > 8142 5053 62% 0.06K 138 59 552K size-64 > 4320 3389 78% 0.12K 144 30 576K size-128 > 3760 2226 59% 0.19K 188 20 752K filp > 3456 1875 54% 0.02K 24 144 96K anon_vma > 3380 3001 88% 1.00K 845 4 3380K size-1024 > 3380 3365 99% 0.76K 676 5 2704K shmem_inode_cache > 2736 2484 90% 0.50K 342 8 1368K size-512 > 2597 2507 96% 0.07K 49 53 196K Acpi-Operand > 2100 1095 52% 0.25K 140 15 560K skbuff_head_cache > 1920 819 42% 0.12K 64 30 256K cred_jar > 1361 1356 99% 4.00K 1361 1 5444K size-4096 > 1230 628 51% 0.12K 41 30 164K pid > 1008 907 89% 0.03K 9 112 36K Acpi-Namespace > 959 496 51% 0.57K 137 7 548K inode_cache > 891 554 62% 0.81K 99 9 792K signal_cache > 888 115 12% 0.10K 24 37 96K > ext4_prealloc_space > 885 122 13% 0.06K 15 59 60K fs_cache > 850 642 75% 1.45K 170 5 1360K task_struct > 820 769 93% 0.19K 41 20 164K bio-0 > 666 550 82% 2.06K 222 3 1776K sighand_cache > 576 211 36% 0.50K 72 8 288K task_xstate > 529 379 71% 0.16K 23 23 92K cfq_queue > 518 472 91% 2.00K 259 2 1036K size-2048 > 506 375 74% 0.16K 22 23 88K cfq_io_context > 495 353 71% 0.33K 45 11 180K blkdev_requests > 465 422 90% 0.25K 31 15 124K size-256 > 418 123 29% 0.69K 38 11 304K files_cache > 360 207 57% 0.69K 72 5 288K sock_inode_cache > 360 251 69% 0.12K 12 30 48K scsi_sense_cache > 336 115 34% 0.08K 7 48 28K blkdev_ioc > 285 236 82% 0.25K 19 15 76K scsi_cmd_cache > > > > BTW: 24 vCPUs in dom0 seems a excessive, especially if you''re using > stubdoms. You may get better performance by dropping that to e.g. 2 or 3. > > I will test that. Do you think it will make a difference in this case? > > -John_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On 11/13/2010 1:13 AM, Ian Pratt wrote:>> > What do the guests use for storage? (e.g. "blktap2 for VHD files on >> an iscsi mounted ext3 volume") >> >> Simple sparse .img files on a local ext4 RAID volume, using "file:". > Ah, if you''re using loop it may be that you''re just filling memory > with dirty pages. Older kernels certainly did this, not sure about > newer ones. > > I''d be inclined to use blktap2 in raw file mode, with "aio:". > > IanThat makes sense. tap/tap2 didn''t work for me in prior releases, so I had to stick to file. It seems to work now (well, tap2:tapdisk:aio does; tap:tapdisk:aio still doesn''t), so I''ll switch everything over to it, and cross my fingers. -John _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On 11/13/2010 1:13 AM, Ian Pratt wrote: > Ah, if you''re using loop it may be that you''re just filling memory with dirty pages. Older kernels certainly did this, not sure about newer ones. > I''d be inclined to use blktap2 in raw file mode, with "aio:". With blktap2, is free RAM in dom0 still used for a disk cache at all? I have this dom0 set to 1.5 GB mainly to help with caching; if that RAM is not needed, I''ll retool it down to a smaller number. Thanks, John _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
This kind of bug is in debian kernel seems most visible, but I was able to reproduce it in all available kernels (SUSE 2.6.34 and rhel 2.6.18). I found single solution to stop OOM killer coming for innocent processes - disable memory overcommitment. 1) Set up swap as 50% of RAM or higher 2) set up vm.overcommit_memory = 2 In this condition only Debian Lenny kernel are still buggling (forget and throw away), all other kernels works fine: they NEVER create an OOM state (but, still can make MemoryError in case of ''no memory'' state). If you disable swap file all overcommited memory will be used from real memory and cause MemoryError state before real memory running out. В Птн, 12/11/2010 в 23:57 -0800, John Weekes пишет:> On machines running many HVM (stubdom-based) domains, I often see errors > like this: > > [77176.524094] qemu-dm invoked oom-killer: gfp_mask=0xd0, order=0, oom_adj=0 > [77176.524102] Pid: 7478, comm: qemu-dm Not tainted 2.6.32.25-g80f7e08 #2 > [77176.524109] Call Trace: > [77176.524123] [<ffffffff810897fd>] ? T.413+0xcd/0x290 > [77176.524129] [<ffffffff81089ad3>] ? __out_of_memory+0x113/0x180 > [77176.524133] [<ffffffff81089b9e>] ? out_of_memory+0x5e/0xc0 > [77176.524140] [<ffffffff8108d1cb>] ? __alloc_pages_nodemask+0x69b/0x6b0 > [77176.524144] [<ffffffff8108d1f2>] ? __get_free_pages+0x12/0x60 > [77176.524152] [<ffffffff810c94e7>] ? __pollwait+0xb7/0x110 > [77176.524161] [<ffffffff81262b93>] ? n_tty_poll+0x183/0x1d0 > [77176.524165] [<ffffffff8125ea42>] ? tty_poll+0x92/0xa0 > [77176.524169] [<ffffffff810c8a92>] ? do_select+0x362/0x670 > [77176.524173] [<ffffffff810c9430>] ? __pollwait+0x0/0x110 > [77176.524178] [<ffffffff810c9540>] ? pollwake+0x0/0x60 > [77176.524183] [<ffffffff810c9540>] ? pollwake+0x0/0x60 > [77176.524188] [<ffffffff810c9540>] ? pollwake+0x0/0x60 > [77176.524193] [<ffffffff810c9540>] ? pollwake+0x0/0x60 > [77176.524197] [<ffffffff810c9540>] ? pollwake+0x0/0x60 > [77176.524202] [<ffffffff810c9540>] ? pollwake+0x0/0x60 > [77176.524207] [<ffffffff810c9540>] ? pollwake+0x0/0x60 > [77176.524212] [<ffffffff810c9540>] ? pollwake+0x0/0x60 > [77176.524217] [<ffffffff810c9540>] ? pollwake+0x0/0x60 > [77176.524222] [<ffffffff810c8fb5>] ? core_sys_select+0x215/0x350 > [77176.524231] [<ffffffff810100af>] ? xen_restore_fl_direct_end+0x0/0x1 > [77176.524236] [<ffffffff8100c48d>] ? xen_mc_flush+0x8d/0x1b0 > [77176.524243] [<ffffffff81014ffb>] ? xen_hypervisor_callback+0x1b/0x20 > [77176.524251] [<ffffffff814b0f5a>] ? error_exit+0x2a/0x60 > [77176.524255] [<ffffffff8101485d>] ? retint_restore_args+0x5/0x6 > [77176.524263] [<ffffffff8102fd3d>] ? pvclock_clocksource_read+0x4d/0xb0 > [77176.524268] [<ffffffff8102fd3d>] ? pvclock_clocksource_read+0x4d/0xb0 > [77176.524276] [<ffffffff810663d1>] ? ktime_get_ts+0x61/0xd0 > [77176.524281] [<ffffffff810c9354>] ? sys_select+0x44/0x120 > [77176.524286] [<ffffffff81013f02>] ? system_call_fastpath+0x16/0x1b > [77176.524290] Mem-Info: > [77176.524293] DMA per-cpu: > [77176.524296] CPU 0: hi: 0, btch: 1 usd: 0 > [77176.524300] CPU 1: hi: 0, btch: 1 usd: 0 > [77176.524303] CPU 2: hi: 0, btch: 1 usd: 0 > [77176.524306] CPU 3: hi: 0, btch: 1 usd: 0 > [77176.524310] CPU 4: hi: 0, btch: 1 usd: 0 > [77176.524313] CPU 5: hi: 0, btch: 1 usd: 0 > [77176.524316] CPU 6: hi: 0, btch: 1 usd: 0 > [77176.524318] CPU 7: hi: 0, btch: 1 usd: 0 > [77176.524322] CPU 8: hi: 0, btch: 1 usd: 0 > [77176.524324] CPU 9: hi: 0, btch: 1 usd: 0 > [77176.524327] CPU 10: hi: 0, btch: 1 usd: 0 > [77176.524330] CPU 11: hi: 0, btch: 1 usd: 0 > [77176.524333] CPU 12: hi: 0, btch: 1 usd: 0 > [77176.524336] CPU 13: hi: 0, btch: 1 usd: 0 > [77176.524339] CPU 14: hi: 0, btch: 1 usd: 0 > [77176.524342] CPU 15: hi: 0, btch: 1 usd: 0 > [77176.524345] CPU 16: hi: 0, btch: 1 usd: 0 > [77176.524348] CPU 17: hi: 0, btch: 1 usd: 0 > [77176.524351] CPU 18: hi: 0, btch: 1 usd: 0 > [77176.524354] CPU 19: hi: 0, btch: 1 usd: 0 > [77176.524358] CPU 20: hi: 0, btch: 1 usd: 0 > [77176.524364] CPU 21: hi: 0, btch: 1 usd: 0 > [77176.524367] CPU 22: hi: 0, btch: 1 usd: 0 > [77176.524370] CPU 23: hi: 0, btch: 1 usd: 0 > [77176.524372] DMA32 per-cpu: > [77176.524374] CPU 0: hi: 186, btch: 31 usd: 81 > [77176.524377] CPU 1: hi: 186, btch: 31 usd: 66 > [77176.524380] CPU 2: hi: 186, btch: 31 usd: 49 > [77176.524385] CPU 3: hi: 186, btch: 31 usd: 67 > [77176.524387] CPU 4: hi: 186, btch: 31 usd: 93 > [77176.524390] CPU 5: hi: 186, btch: 31 usd: 73 > [77176.524393] CPU 6: hi: 186, btch: 31 usd: 50 > [77176.524396] CPU 7: hi: 186, btch: 31 usd: 79 > [77176.524399] CPU 8: hi: 186, btch: 31 usd: 21 > [77176.524402] CPU 9: hi: 186, btch: 31 usd: 38 > [77176.524406] CPU 10: hi: 186, btch: 31 usd: 0 > [77176.524409] CPU 11: hi: 186, btch: 31 usd: 75 > [77176.524412] CPU 12: hi: 186, btch: 31 usd: 1 > [77176.524414] CPU 13: hi: 186, btch: 31 usd: 4 > [77176.524417] CPU 14: hi: 186, btch: 31 usd: 9 > [77176.524420] CPU 15: hi: 186, btch: 31 usd: 0 > [77176.524423] CPU 16: hi: 186, btch: 31 usd: 56 > [77176.524426] CPU 17: hi: 186, btch: 31 usd: 35 > [77176.524429] CPU 18: hi: 186, btch: 31 usd: 32 > [77176.524432] CPU 19: hi: 186, btch: 31 usd: 39 > [77176.524435] CPU 20: hi: 186, btch: 31 usd: 24 > [77176.524438] CPU 21: hi: 186, btch: 31 usd: 0 > [77176.524441] CPU 22: hi: 186, btch: 31 usd: 35 > [77176.524444] CPU 23: hi: 186, btch: 31 usd: 51 > [77176.524447] Normal per-cpu: > [77176.524449] CPU 0: hi: 186, btch: 31 usd: 29 > [77176.524453] CPU 1: hi: 186, btch: 31 usd: 1 > [77176.524456] CPU 2: hi: 186, btch: 31 usd: 30 > [77176.524459] CPU 3: hi: 186, btch: 31 usd: 30 > [77176.524463] CPU 4: hi: 186, btch: 31 usd: 30 > [77176.524466] CPU 5: hi: 186, btch: 31 usd: 31 > [77176.524469] CPU 6: hi: 186, btch: 31 usd: 0 > [77176.524471] CPU 7: hi: 186, btch: 31 usd: 0 > [77176.524474] CPU 8: hi: 186, btch: 31 usd: 30 > [77176.524477] CPU 9: hi: 186, btch: 31 usd: 28 > [77176.524480] CPU 10: hi: 186, btch: 31 usd: 0 > [77176.524483] CPU 11: hi: 186, btch: 31 usd: 30 > [77176.524486] CPU 12: hi: 186, btch: 31 usd: 0 > [77176.524489] CPU 13: hi: 186, btch: 31 usd: 0 > [77176.524492] CPU 14: hi: 186, btch: 31 usd: 0 > [77176.524495] CPU 15: hi: 186, btch: 31 usd: 0 > [77176.524498] CPU 16: hi: 186, btch: 31 usd: 0 > [77176.524501] CPU 17: hi: 186, btch: 31 usd: 0 > [77176.524504] CPU 18: hi: 186, btch: 31 usd: 0 > [77176.524507] CPU 19: hi: 186, btch: 31 usd: 0 > [77176.524510] CPU 20: hi: 186, btch: 31 usd: 0 > [77176.524513] CPU 21: hi: 186, btch: 31 usd: 0 > [77176.524516] CPU 22: hi: 186, btch: 31 usd: 0 > [77176.524518] CPU 23: hi: 186, btch: 31 usd: 0 > [77176.524524] active_anon:5675 inactive_anon:4676 isolated_anon:0 > [77176.524526] active_file:146373 inactive_file:153543 isolated_file:480 > [77176.524527] unevictable:0 dirty:167539 writeback:322 unstable:0 > [77176.524528] free:5017 slab_reclaimable:15640 slab_unreclaimable:8972 > [77176.524529] mapped:1114 shmem:7 pagetables:1908 bounce:0 > [77176.524536] DMA free:9820kB min:32kB low:40kB high:48kB > active_anon:4kB inactive_anon:0kB active_file:616kB inactive_file:2212kB > unevictable:0kB isolated(anon):0kB isolated(file):0kB present:12740kB > mlocked:0kB dirty:2292kB writeback:0kB mapped:0kB shmem:0kB > slab_reclaimable:72kB slab_unreclaimable:108kB kernel_stack:0kB > pagetables:12kB unstable:0kB bounce:0kB writeback_tmp:0kB > pages_scanned:3040 all_unreclaimable? no > [77176.524541] lowmem_reserve[]: 0 1428 2452 2452 > [77176.524551] DMA32 free:7768kB min:3680kB low:4600kB high:5520kB > active_anon:22696kB inactive_anon:18704kB active_file:584580kB > inactive_file:608508kB unevictable:0kB isolated(anon):0kB > isolated(file):1920kB present:1462496kB mlocked:0kB dirty:664128kB > writeback:1276kB mapped:4456kB shmem:28kB slab_reclaimable:62076kB > slab_unreclaimable:32292kB kernel_stack:5120kB pagetables:7620kB > unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:1971808 > all_unreclaimable? yes > [77176.524556] lowmem_reserve[]: 0 0 1024 1024 > [77176.524564] Normal free:2480kB min:2636kB low:3292kB high:3952kB > active_anon:0kB inactive_anon:0kB active_file:296kB inactive_file:3452kB > unevictable:0kB isolated(anon):0kB isolated(file):0kB present:1048700kB > mlocked:0kB dirty:3736kB writeback:12kB mapped:0kB shmem:0kB > slab_reclaimable:412kB slab_unreclaimable:3488kB kernel_stack:80kB > pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB > pages_scanned:8192 all_unreclaimable? yes > [77176.524569] lowmem_reserve[]: 0 0 0 0 > [77176.524574] DMA: 4*4kB 25*8kB 11*16kB 7*32kB 8*64kB 8*128kB 8*256kB > 3*512kB 0*1024kB 0*2048kB 1*4096kB = 9832kB > [77176.524587] DMA32: 742*4kB 118*8kB 3*16kB 3*32kB 2*64kB 0*128kB > 0*256kB 1*512kB 1*1024kB 1*2048kB 0*4096kB = 7768kB > [77176.524600] Normal: 1*4kB 1*8kB 2*16kB 13*32kB 14*64kB 2*128kB > 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 1612kB > [77176.524613] 302308 total pagecache pages > [77176.524615] 1619 pages in swap cache > [77176.524617] Swap cache stats: add 40686, delete 39067, find 24687/26036 > [77176.524619] Free swap = 10141956kB > [77176.524621] Total swap = 10239992kB > [77176.577607] 793456 pages RAM > [77176.577611] 436254 pages reserved > [77176.577613] 308627 pages shared > [77176.577615] 49249 pages non-shared > [77176.577620] Out of memory: kill process 5755 (python2.6) score 110492 > or a child > [77176.577623] Killed process 5757 (python2.6) > > Depending on what gets nuked by the OOM-killer, I am frequently left > with an unusable system that needs to be rebooted. > > The machine always has plenty of memory available (1.5 GB devoted to > dom0, of which >1 GB is always just in "cached" state). For instance, > right now, on this same machine: > > # free > total used free shared buffers cached > Mem: 1536512 1493112 43400 0 10284 1144904 > -/+ buffers/cache: 337924 1198588 > Swap: 10239992 74444 10165548 > > I have seen this OOM problem on a wide range of Xen versions, stretching > as far back as I can remember, including the most recent 4.1-unstable > and 2.6.32 pvops kernel (from yesterday, tested in the hope that they > would fix this). I haven''t found a way to reliably reproduce it yet, > but I suspect that the problem relates to reasonably heavy disk or > network activity -- during this last one, I see that a domain was > briefly doing ~200 Mbps of downloads. > > Anyone have any ideas on what this could be? Is RAM getting > spontaneously filled because a buffer somewhere grows too quickly, or > something like that? What can I try here? > > -John > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xensource.com > http://lists.xensource.com/xen-devel_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On Sat, 2010-11-13 at 05:19 -0500, John Weekes wrote:> On 11/13/2010 1:13 AM, Ian Pratt wrote: > > Ah, if you''re using loop it may be that you''re just filling memory > with dirty pages. Older kernels certainly did this, not sure about newer > ones. > > I''d be inclined to use blktap2 in raw file mode, with "aio:". > > With blktap2, is free RAM in dom0 still used for a disk cache at all? I > have this dom0 set to 1.5 GB mainly to help with caching; if that RAM is > not needed, I''ll retool it down to a smaller number.If you''re not using cloned images deriving from a shared parent image, that caching won''t buy anyone much. Memory better spent on the guests themselves then, thereby their own caches. Keep an eye on /proc/meminfo, it largely depends on number/type of guests, but probably safe to reassign ~800M straight away. blktap2 with aio will move the datapath to direct I/O. Comparend to buffered loops, there''s also some notable benefit to crash consistency resulting from that. Daniel _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
>>> On 13.11.10 at 10:13, Ian Pratt <Ian.Pratt@eu.citrix.com> wrote: >> > What do the guests use for storage? (e.g. "blktap2 for VHD files on >> an iscsi mounted ext3 volume") >> >> Simple sparse .img files on a local ext4 RAID volume, using "file:". > > Ah, if you''re using loop it may be that you''re just filling memory with > dirty pages. Older kernels certainly did this, not sure about newer ones.Shouldn''t this lead to the calling process being throttled, instead of the system running into OOM? Further, having got reports of similar problems lately, too, we have indications that using pv drivers also gets us around the issue, which makes me think that it''s rather qemu-dm misbehaving (and not getting stopped doing so by the kernel for whatever reason - possibly just missing some non-infinite rlimit setting). Not knowing much about the workings of stubdom, one thing I don''t really understand is how qemu-dm in Dom0 would be heavily resource consuming here (actually I would have expected no qemu-dm in Dom0 at all in this case). Aren''t the main I/O paths going from qemu-stubdom directly to the backends? Jan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On Mon, 2010-11-15 at 03:55 -0500, Jan Beulich wrote:> >>> On 13.11.10 at 10:13, Ian Pratt <Ian.Pratt@eu.citrix.com> wrote: > >> > What do the guests use for storage? (e.g. "blktap2 for VHD files on > >> an iscsi mounted ext3 volume") > >> > >> Simple sparse .img files on a local ext4 RAID volume, using "file:". > > > > Ah, if you''re using loop it may be that you''re just filling memory with > > dirty pages. Older kernels certainly did this, not sure about newer ones. > > Shouldn''t this lead to the calling process being throttled, instead of > the system running into OOM?They are throttled, but the single control I''m aware of is /proc/sys/vm/dirty_ratio (or dirty_bytes, nowadays). Which is only per process, not a global limit. Could well be that''s part of the problem -- outwitting mm with just too many writers on too many cores? We had a bit of trouble when switching dom0 to 2.6.32, buffered writes made it much easier than with e.g. 2.6.27 to drive everybody else into costly reclaims. The Oom shown here reports about ~650M in dirty pages. The fact alone that this counts as on oom condition doesn''t sound quite right in itself. That qemu might just have dared to ask at the wrong point in time. Just to get an idea -- how many guests did this box carry?> Further, having got reports of similar problems lately, too, we have > indications that using pv drivers also gets us around the issue, > which makes me think that it''s rather qemu-dm misbehaving (and > not getting stopped doing so by the kernel for whatever reason - > possibly just missing some non-infinite rlimit setting). > > Not knowing much about the workings of stubdom, one thing I > don''t really understand is how qemu-dm in Dom0 would be > heavily resource consuming here (actually I would have expected > no qemu-dm in Dom0 at all in this case). Aren''t the main I/O paths > going from qemu-stubdom directly to the backends? > > Jan > > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xensource.com > http://lists.xensource.com/xen-devel_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
>>> On 15.11.10 at 10:40, Daniel Stodden <daniel.stodden@citrix.com> wrote: > On Mon, 2010-11-15 at 03:55 -0500, Jan Beulich wrote: >> >>> On 13.11.10 at 10:13, Ian Pratt <Ian.Pratt@eu.citrix.com> wrote: >> >> > What do the guests use for storage? (e.g. "blktap2 for VHD files on >> >> an iscsi mounted ext3 volume") >> >> >> >> Simple sparse .img files on a local ext4 RAID volume, using "file:". >> > >> > Ah, if you''re using loop it may be that you''re just filling memory with >> > dirty pages. Older kernels certainly did this, not sure about newer ones. >> >> Shouldn''t this lead to the calling process being throttled, instead of >> the system running into OOM? > > They are throttled, but the single control I''m aware of > is /proc/sys/vm/dirty_ratio (or dirty_bytes, nowadays). Which is only > per process, not a global limit. Could well be that''s part of the > problem -- outwitting mm with just too many writers on too many cores? > > We had a bit of trouble when switching dom0 to 2.6.32, buffered writes > made it much easier than with e.g. 2.6.27 to drive everybody else into > costly reclaims. > > The Oom shown here reports about ~650M in dirty pages. The fact alone > that this counts as on oom condition doesn''t sound quite right in > itself. That qemu might just have dared to ask at the wrong point in > time.Indeed - dirty pages alone shouldn''t resolve to OOM.> Just to get an idea -- how many guests did this box carry?>From what we know this requires just a single (Windows 7 or somesuch) guest, provided the guest has more memory than Dom0. Jan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On Mon, 15 Nov 2010, Jan Beulich wrote:> >>> On 13.11.10 at 10:13, Ian Pratt <Ian.Pratt@eu.citrix.com> wrote: > >> > What do the guests use for storage? (e.g. "blktap2 for VHD files on > >> an iscsi mounted ext3 volume") > >> > >> Simple sparse .img files on a local ext4 RAID volume, using "file:". > > > > Ah, if you''re using loop it may be that you''re just filling memory with > > dirty pages. Older kernels certainly did this, not sure about newer ones. > > Shouldn''t this lead to the calling process being throttled, instead of > the system running into OOM? > > Further, having got reports of similar problems lately, too, we have > indications that using pv drivers also gets us around the issue, > which makes me think that it''s rather qemu-dm misbehaving (and > not getting stopped doing so by the kernel for whatever reason - > possibly just missing some non-infinite rlimit setting). > > Not knowing much about the workings of stubdom, one thing I > don''t really understand is how qemu-dm in Dom0 would be > heavily resource consuming here (actually I would have expected > no qemu-dm in Dom0 at all in this case). Aren''t the main I/O paths > going from qemu-stubdom directly to the backends? >Qemu-dm in a stubdom uses the blkfront and netfront drivers in Minios to communicate with the backends in dom0. In a stubdom-only scenario qemu-dm in dom0 only provides the xenfb backend for the vesa framebuffer. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
> They are throttled, but the single control I''m aware of > is /proc/sys/vm/dirty_ratio (or dirty_bytes, nowadays). Which is only > per process, not a global limit. Could well be that''s part of the > problem -- outwitting mm with just too many writers on too many cores? > > We had a bit of trouble when switching dom0 to 2.6.32, buffered writes > made it much easier than with e.g. 2.6.27 to drive everybody else into > costly reclaims. > > The Oom shown here reports about ~650M in dirty pages. The fact alone > that this counts as on oom condition doesn''t sound quite right in > itself. That qemu might just have dared to ask at the wrong point in > time. > > Just to get an idea -- how many guests did this box carry?It carries about two dozen guests, with a mix of mostly HVMs (all stubdom-based, some with PV-on-HVM drivers) and some PV. This problem occurred more often for me under 2.6.32 than 2.6.31, I noticed. Since I made the switch to aio, I haven''t seen a crash, but it hasn''t been long enough for that to mean much. Having extra caching in the dom0 is nice because it allows for domUs to get away with having small amounts of free memory, while still having very good (much faster than hardware) write performance. If you have a large number of domUs that are all memory-constrained and use the disk in infrequent, large bursts, this can work out pretty well, since the big communal pool provides a better value proposition than giving each domU a few more megabytes of RAM. If the OOM problem isn''t something that can be fixed, it might be a good idea to print out a warning to the user when a domain using "file:" is started. Or, to go a step further and automatically run "file" based domains as though "aio" was specified, possibly with a warning and a way to override that behavior. It''s not really intuitive that "file" would cause crashes. -John _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Performance is noticeably lower with aio on these bursty write workloads; I''ve been getting a number of complaints. I see that 2.6.36 has some page_writeback changes: http://www.kernel.org/diff/diffview.cgi?file=%2Fpub%2Flinux%2Fkernel%2Fv2.6%2Fpatch-2.6.36.bz2;z=8379 . Any thoughts on whether these would make a difference for the problems with "file:"? I''m still trying to find a way to reproduce the issue in the lab, so I''d have to test the patch in production -- that''s not a tantalizing prospect, unless there is a real chance that it will affect it. -John On 11/15/2010 9:59 AM, John Weekes wrote:> >> They are throttled, but the single control I''m aware of >> is /proc/sys/vm/dirty_ratio (or dirty_bytes, nowadays). Which is only >> per process, not a global limit. Could well be that''s part of the >> problem -- outwitting mm with just too many writers on too many cores? >> >> We had a bit of trouble when switching dom0 to 2.6.32, buffered writes >> made it much easier than with e.g. 2.6.27 to drive everybody else into >> costly reclaims. >> >> The Oom shown here reports about ~650M in dirty pages. The fact alone >> that this counts as on oom condition doesn''t sound quite right in >> itself. That qemu might just have dared to ask at the wrong point in >> time. >> >> Just to get an idea -- how many guests did this box carry? > > It carries about two dozen guests, with a mix of mostly HVMs (all > stubdom-based, some with PV-on-HVM drivers) and some PV. > > This problem occurred more often for me under 2.6.32 than 2.6.31, I > noticed. Since I made the switch to aio, I haven''t seen a crash, but > it hasn''t been long enough for that to mean much. > > Having extra caching in the dom0 is nice because it allows for domUs > to get away with having small amounts of free memory, while still > having very good (much faster than hardware) write performance. If you > have a large number of domUs that are all memory-constrained and use > the disk in infrequent, large bursts, this can work out pretty well, > since the big communal pool provides a better value proposition than > giving each domU a few more megabytes of RAM. > > If the OOM problem isn''t something that can be fixed, it might be a > good idea to print out a warning to the user when a domain using > "file:" is started. Or, to go a step further and automatically run > "file" based domains as though "aio" was specified, possibly with a > warning and a way to override that behavior. It''s not really intuitive > that "file" would cause crashes. > > -John > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xensource.com > http://lists.xensource.com/xen-devel_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
> Performance is noticeably lower with aio on these bursty write > workloads; I've been getting a number of complaints.That's the cost of having guest data safely committed to disk before being ACK'ed. The users will presumably be happier when a host failure doesn't trash their filesystems due to the total loss of any of the write ordering the filesystem implementer intended. Personally, I wouldn't want any data of mine stored on such a system, but I guess others mileage may vary. If unsafe write buffering is desired, I'd be inclined to implement it explicitly in tapdisk rather than rely on the total vagaries of the linux buffer cache. It would thus be possible to bound the amount of outstanding data, continue to respect ordering, and still respect explicit flushes. Ian> I see that 2.6.36 has some page_writeback changes: > http://www.kernel.org/diff/diffview.cgi?file=%2Fpub%2Flinux%2Fkernel%2Fv2. > 6%2Fpatch-2.6.36.bz2;z=8379 > . Any thoughts on whether these would make a difference for the problems > with "file:"? I'm still trying to find a way to reproduce the issue in > the lab, so I'd have to test the patch in production -- that's not a > tantalizing prospect, unless there is a real chance that it will affect > it. > > -John > > On 11/15/2010 9:59 AM, John Weekes wrote: > > > >> They are throttled, but the single control I'm aware of > >> is /proc/sys/vm/dirty_ratio (or dirty_bytes, nowadays). Which is only > >> per process, not a global limit. Could well be that's part of the > >> problem -- outwitting mm with just too many writers on too many cores? > >> > >> We had a bit of trouble when switching dom0 to 2.6.32, buffered writes > >> made it much easier than with e.g. 2.6.27 to drive everybody else into > >> costly reclaims. > >> > >> The Oom shown here reports about ~650M in dirty pages. The fact alone > >> that this counts as on oom condition doesn't sound quite right in > >> itself. That qemu might just have dared to ask at the wrong point in > >> time. > >> > >> Just to get an idea -- how many guests did this box carry? > > > > It carries about two dozen guests, with a mix of mostly HVMs (all > > stubdom-based, some with PV-on-HVM drivers) and some PV. > > > > This problem occurred more often for me under 2.6.32 than 2.6.31, I > > noticed. Since I made the switch to aio, I haven't seen a crash, but > > it hasn't been long enough for that to mean much. > > > > Having extra caching in the dom0 is nice because it allows for domUs > > to get away with having small amounts of free memory, while still > > having very good (much faster than hardware) write performance. If you > > have a large number of domUs that are all memory-constrained and use > > the disk in infrequent, large bursts, this can work out pretty well, > > since the big communal pool provides a better value proposition than > > giving each domU a few more megabytes of RAM. > > > > If the OOM problem isn't something that can be fixed, it might be a > > good idea to print out a warning to the user when a domain using > > "file:" is started. Or, to go a step further and automatically run > > "file" based domains as though "aio" was specified, possibly with a > > warning and a way to override that behavior. It's not really intuitive > > that "file" would cause crashes. > > > > -John > > > > _______________________________________________ > > Xen-devel mailing list > > Xen-devel@lists.xensource.com > > http://lists.xensource.com/xen-devel_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
There is certainly a trade-off, and historically, we''ve had problems with stability under Xen, so crashes are definitely a concern. Implementation in tapdisk would be great. I found today that tapdisk2 (at least on the latest 4.0-testing/unstable and latest pv_ops) is causing data corruption for Windows guests; I can see this by copying a few thousand files to another folder inside the guest, totalling a bit more than a GB, then running "fc" to check for differences (I tried with and without GPLPV). That''s obviously a huge deal in production (and an even bigger deal than crashes), so in the short term, I may have to switch back to the uglier, crashier file: setup. I''ve been trying to find a workaround for the corruption all day without much luck. -John On 11/17/2010 12:10 PM, Ian Pratt wrote:>> Performance is noticeably lower with aio on these bursty write >> workloads; I''ve been getting a number of complaints. > That''s the cost of having guest data safely committed to disk before being ACK''ed. The users will presumably be happier when a host failure doesn''t trash their filesystems due to the total loss of any of the write ordering the filesystem implementer intended. > > Personally, I wouldn''t want any data of mine stored on such a system, but I guess others mileage may vary. > > If unsafe write buffering is desired, I''d be inclined to implement it explicitly in tapdisk rather than rely on the total vagaries of the linux buffer cache. It would thus be possible to bound the amount of outstanding data, continue to respect ordering, and still respect explicit flushes. > > Ian > >> I see that 2.6.36 has some page_writeback changes: >> http://www.kernel.org/diff/diffview.cgi?file=%2Fpub%2Flinux%2Fkernel%2Fv2. >> 6%2Fpatch-2.6.36.bz2;z=8379 >> . Any thoughts on whether these would make a difference for the problems >> with "file:"? I''m still trying to find a way to reproduce the issue in >> the lab, so I''d have to test the patch in production -- that''s not a >> tantalizing prospect, unless there is a real chance that it will affect >> it. >> >> -John >> >> On 11/15/2010 9:59 AM, John Weekes wrote: >>>> They are throttled, but the single control I''m aware of >>>> is /proc/sys/vm/dirty_ratio (or dirty_bytes, nowadays). Which is only >>>> per process, not a global limit. Could well be that''s part of the >>>> problem -- outwitting mm with just too many writers on too many cores? >>>> >>>> We had a bit of trouble when switching dom0 to 2.6.32, buffered writes >>>> made it much easier than with e.g. 2.6.27 to drive everybody else into >>>> costly reclaims. >>>> >>>> The Oom shown here reports about ~650M in dirty pages. The fact alone >>>> that this counts as on oom condition doesn''t sound quite right in >>>> itself. That qemu might just have dared to ask at the wrong point in >>>> time. >>>> >>>> Just to get an idea -- how many guests did this box carry? >>> It carries about two dozen guests, with a mix of mostly HVMs (all >>> stubdom-based, some with PV-on-HVM drivers) and some PV. >>> >>> This problem occurred more often for me under 2.6.32 than 2.6.31, I >>> noticed. Since I made the switch to aio, I haven''t seen a crash, but >>> it hasn''t been long enough for that to mean much. >>> >>> Having extra caching in the dom0 is nice because it allows for domUs >>> to get away with having small amounts of free memory, while still >>> having very good (much faster than hardware) write performance. If you >>> have a large number of domUs that are all memory-constrained and use >>> the disk in infrequent, large bursts, this can work out pretty well, >>> since the big communal pool provides a better value proposition than >>> giving each domU a few more megabytes of RAM. >>> >>> If the OOM problem isn''t something that can be fixed, it might be a >>> good idea to print out a warning to the user when a domain using >>> "file:" is started. Or, to go a step further and automatically run >>> "file" based domains as though "aio" was specified, possibly with a >>> warning and a way to override that behavior. It''s not really intuitive >>> that "file" would cause crashes. >>> >>> -John >>> >>> _______________________________________________ >>> Xen-devel mailing list >>> Xen-devel@lists.xensource.com >>> http://lists.xensource.com/xen-devel_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
> I found today that tapdisk2 (at least on the latest 4.0-testing/unstable > and latest pv_ops) is causing data corruption for Windows guests; I can > see this by copying a few thousand files to another folder inside the > guest, totalling a bit more than a GB, then running "fc" to check for > differences (I tried with and without GPLPV). That's obviously a huge > deal in production (and an even bigger deal than crashes), so in the > short term, I may have to switch back to the uglier, crashier file: > setup. I've been trying to find a workaround for the corruption all day > without much luck.That's disturbing. It might be worth trying to drop the number of VCPUs in dom0 to 1 and then try to repro. BTW: for production use I'd currently be strongly inclined to use the XCP 2.6.32 kernel. Ian _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On Wed, 2010-11-17 at 17:02 -0500, John Weekes wrote:> There is certainly a trade-off, and historically, we''ve had problems > with stability under Xen, so crashes are definitely a concern. > > Implementation in tapdisk would be great. > > I found today that tapdisk2 (at least on the latest 4.0-testing/unstable > and latest pv_ops) is causing data corruption for Windows guests; I can > see this by copying a few thousand files to another folder inside the > guest, totalling a bit more than a GB, then running "fc" to check for > differences (I tried with and without GPLPV). That''s obviously a huge > deal in production (and an even bigger deal than crashes), so in the > short term, I may have to switch back to the uglier, crashier file: > setup. I''ve been trying to find a workaround for the corruption all day > without much luck.Which branch/revision does latest pvops mean? Would you be willing to try and reproduce that again with the XCP blktap (userspace, not kernel) sources? Just to further isolate the problem. Those see a lot of testing. I certainly can''t come up with a single fix to the aio layer, in ages. But I''m never sure about other stuff potentially broken in userland. If dio is definitely not what you feel you need, let''s get back your original OOM problem. Did reducing dom0 vcpus help? 24 of them is quite aggressive, to say the least. If that alone doesn''t help, I''d definitely try and check vm.dirty_ratio. There must be a tradeoff which doesn''t imply scribbling the better half of 1.5GB main memory. Daniel> -John > > On 11/17/2010 12:10 PM, Ian Pratt wrote: > >> Performance is noticeably lower with aio on these bursty write > >> workloads; I''ve been getting a number of complaints. > > That''s the cost of having guest data safely committed to disk before being ACK''ed. The users will presumably be happier when a host failure doesn''t trash their filesystems due to the total loss of any of the write ordering the filesystem implementer intended. > > > > Personally, I wouldn''t want any data of mine stored on such a system, but I guess others mileage may vary. > > > > If unsafe write buffering is desired, I''d be inclined to implement it explicitly in tapdisk rather than rely on the total vagaries of the linux buffer cache. It would thus be possible to bound the amount of outstanding data, continue to respect ordering, and still respect explicit flushes. > > > > Ian > > > >> I see that 2.6.36 has some page_writeback changes: > >> http://www.kernel.org/diff/diffview.cgi?file=%2Fpub%2Flinux%2Fkernel%2Fv2. > >> 6%2Fpatch-2.6.36.bz2;z=8379 > >> . Any thoughts on whether these would make a difference for the problems > >> with "file:"? I''m still trying to find a way to reproduce the issue in > >> the lab, so I''d have to test the patch in production -- that''s not a > >> tantalizing prospect, unless there is a real chance that it will affect > >> it. > >> > >> -John > >> > >> On 11/15/2010 9:59 AM, John Weekes wrote: > >>>> They are throttled, but the single control I''m aware of > >>>> is /proc/sys/vm/dirty_ratio (or dirty_bytes, nowadays). Which is only > >>>> per process, not a global limit. Could well be that''s part of the > >>>> problem -- outwitting mm with just too many writers on too many cores? > >>>> > >>>> We had a bit of trouble when switching dom0 to 2.6.32, buffered writes > >>>> made it much easier than with e.g. 2.6.27 to drive everybody else into > >>>> costly reclaims. > >>>> > >>>> The Oom shown here reports about ~650M in dirty pages. The fact alone > >>>> that this counts as on oom condition doesn''t sound quite right in > >>>> itself. That qemu might just have dared to ask at the wrong point in > >>>> time. > >>>> > >>>> Just to get an idea -- how many guests did this box carry? > >>> It carries about two dozen guests, with a mix of mostly HVMs (all > >>> stubdom-based, some with PV-on-HVM drivers) and some PV. > >>> > >>> This problem occurred more often for me under 2.6.32 than 2.6.31, I > >>> noticed. Since I made the switch to aio, I haven''t seen a crash, but > >>> it hasn''t been long enough for that to mean much. > >>> > >>> Having extra caching in the dom0 is nice because it allows for domUs > >>> to get away with having small amounts of free memory, while still > >>> having very good (much faster than hardware) write performance. If you > >>> have a large number of domUs that are all memory-constrained and use > >>> the disk in infrequent, large bursts, this can work out pretty well, > >>> since the big communal pool provides a better value proposition than > >>> giving each domU a few more megabytes of RAM. > >>> > >>> If the OOM problem isn''t something that can be fixed, it might be a > >>> good idea to print out a warning to the user when a domain using > >>> "file:" is started. Or, to go a step further and automatically run > >>> "file" based domains as though "aio" was specified, possibly with a > >>> warning and a way to override that behavior. It''s not really intuitive > >>> that "file" would cause crashes. > >>> > >>> -John > >>> > >>> _______________________________________________ > >>> Xen-devel mailing list > >>> Xen-devel@lists.xensource.com > >>> http://lists.xensource.com/xen-devel_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Daniel: > Which branch/revision does latest pvops mean? stable-2.6.32, using the latest pull as of today. (I also tried next-2.6.37, but it wouldn''t boot for me.)> Would you be willing to try and reproduce that again with the XCP blktap > (userspace, not kernel) sources? Just to further isolate the problem. > Those see a lot of testing. I certainly can''t come up with a single fix > to the aio layer, in ages. But I''m never sure about other stuff > potentially broken in userland.I''ll have to give it a try. Normal blktap still isn''t working with pv_ops, though, so I hope this is a drop-in for blktap2. In my last bit of troubleshooting, I took O_DIRECT out of the open call in tools/blktap2/drivers/block-aio.c, and preliminary testing indicates that this might have eliminated the problem with corruption. I''m testing further now, but could there be an issue with alignment (since the kernel is apparently very strict about it with direct I/O)? (Removing this flag also brings back in use of the page cache, of course.)> If dio is definitely not what you feel you need, let''s get back your > original OOM problem. Did reducing dom0 vcpus help? 24 of them is quite > aggressive, to say the least.When I switched to aio, I reduced the vcpus to 2 (I needed to do this with dom0_max_vcpus, rather than through xend-config.sxp -- the latter wouldn''t always boot). I haven''t separately tried cached I/O with reduced CPUs yet, except in the lab; and unfortunately I still can''t get the problem to happen in the lab, no matter what I try.> If that alone doesn''t help, I''d definitely try and check vm.dirty_ratio. > There must be a tradeoff which doesn''t imply scribbling the better half > of 1.5GB main memory.The default for dirty_ratio is 20. I tried halving that to 10, but it didn''t help. I could try lower, but I like the thought of keeping this in user space, if possible, so I''ve been pursuing the blktap2 path most aggressively. Ian:> That''s disturbing. It might be worth trying to drop the number of VCPUs in dom0 to 1 and then try to repro. > BTW: for production use I''d currently be strongly inclined to use the XCP 2.6.32 kernel.Interesting, ok. -John _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On Wed, 2010-11-17 at 22:29 -0500, John Weekes wrote:> Daniel: > > > Which branch/revision does latest pvops mean? > > stable-2.6.32, using the latest pull as of today. (I also tried > next-2.6.37, but it wouldn''t boot for me.) > > Would you be willing to try and reproduce that again with the XCP blktap > > (userspace, not kernel) sources? Just to further isolate the problem. > > Those see a lot of testing. I certainly can''t come up with a single fix > > to the aio layer, in ages. But I''m never sure about other stuff > > potentially broken in userland. > > I''ll have to give it a try. Normal blktap still isn''t working with > pv_ops, though, so I hope this is a drop-in for blktap2.I think it should work fine, or wouldn''t ask. If not, lemme know.> In my last bit of troubleshooting, I took O_DIRECT out of the open call > in tools/blktap2/drivers/block-aio.c, and preliminary testing indicates > that this might have eliminated the problem with corruption. I''m testing > further now, but could there be an issue with alignment (since the > kernel is apparently very strict about it with direct I/O)?Nope. It is, but they''re 4k-aligned all over the place. You''d see syslog yelling quite miserably in cases like that. Keeping an eye on syslog (the daemon and kern facilites) is a generally good idea btw.> (Removing > this flag also brings back in use of the page cache, of course.)I/O-wise it''s not much different from the file:-path. Meaning it should have carried you directly back into the Oom realm.> > If dio is definitely not what you feel you need, let''s get back your > > original OOM problem. Did reducing dom0 vcpus help? 24 of them is quite > > aggressive, to say the least. > > When I switched to aio, I reduced the vcpus to 2 (I needed to do this > with dom0_max_vcpus, rather than through xend-config.sxp -- the latter > wouldn''t always boot). I haven''t separately tried cached I/O with > reduced CPUs yet, except in the lab; and unfortunately I still can''t get > the problem to happen in the lab, no matter what I try.Just reducing the cpu count alone sounds like sth worth trying even on a production box, if the current state of things already tends to take the system down. Also, the dirty_ratio sysctl should be pretty safe to tweak at runtime.> > If that alone doesn''t help, I''d definitely try and check vm.dirty_ratio. > > There must be a tradeoff which doesn''t imply scribbling the better half > > of 1.5GB main memory. > > The default for dirty_ratio is 20. I tried halving that to 10, but it > didn''t help.Still too much. That''s meant to be %/task. Try 2, with 1.5G that''s still a decent 30M write cache and should block all out of 24 disks after some 700M, worst case. Or so I think...> I could try lower, but I like the thought of keeping this > in user space, if possible, so I''ve been pursuing the blktap2 path most > aggressively.Okay. I''m sending you a tbz to try. Daniel> Ian: > > > That''s disturbing. It might be worth trying to drop the number of VCPUs in dom0 to 1 and then try to repro. > > BTW: for production use I''d currently be strongly inclined to use the XCP 2.6.32 kernel. > > Interesting, ok. > > -John_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
> I think [XCP blktap] should work fine, or wouldn''t ask. If not, lemme know.k.>> In my last bit of troubleshooting, I took O_DIRECT out of the open call >> in tools/blktap2/drivers/block-aio.c, and preliminary testing indicates >> that this might have eliminated the problem with corruption. I''m testing >> further now, but could there be an issue with alignment (since the >> kernel is apparently very strict about it with direct I/O)? > Nope. It is, but they''re 4k-aligned all over the place. You''d see syslog > yelling quite miserably in cases like that. Keeping an eye on syslog > (the daemon and kern facilites) is a generally good idea btw.I''ve been doing that and haven''t seen any unusual output so far, which I guess is good.>> (Removing >> this flag also brings back in use of the page cache, of course.) > I/O-wise it''s not much different from the file:-path. Meaning it should > have carried you directly back into the Oom realm.Does it make a difference that it''s not using "loop" and instead the CPU usage (and presumably some blocking) occurs in user-space? There''s not too much information on this out there, but it seems at though the OOM issue might be at least somewhat loop device-specific. One document that references loop OOM problems that I found is this one: http://sources.redhat.com/lvm2/wiki/DMLoop. My initial take on it was that it might be saying that it mattered when these things were being done in the kernel, but now I''m not so certain -- ".. [their method and loop] submit[s] [I/O requests] via a kernel thread to the VFS layer using traditional I/O calls (read, write etc.). This has the advantage that it should work with any file system type supported by the Linux VFS (including networked file systems), but has some drawbacks that may affect performance and scalability. This is because it is hard to predict what a file system may attempt to do when an I/O request is submitted; for example, it may need to allocate memory to handle the request and the loopback driver has no control over this. Particularly under low-memory or intensive I/O scenarios this can lead to out of memory (OOM) problems or deadlocks as the kernel tries to make memory available to the VFS layer while satisfying a request from the block layer. " Would there be an advantage to using blktap/blktap2 over loop, if I leave off O_DIRECT? Would it be faster, or anything like that?> Just reducing the cpu count alone sounds like sth worth trying even on a > production box, if the current state of things already tends to take the > system down. Also, the dirty_ratio sysctl should be pretty safe to tweak > at runtime.That''s good to hear.>> The default for dirty_ratio is 20. I tried halving that to 10, but it >> didn''t help. > Still too much. That''s meant to be %/task. Try 2, with 1.5G that''s still > a decent 30M write cache and should block all out of 24 disks after some > 700M, worst case. Or so I think...Ah, ok. I was thinking that it was global. With a small per-process cache like that, it becomes much closer to AIO for writes, but at least the leftover memory could still be used for the read cache. -John _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On Thu, 2010-11-18 at 02:15 -0500, John Weekes wrote:> > I think [XCP blktap] should work fine, or wouldn''t ask. If not, lemme know. > > k. > > >> In my last bit of troubleshooting, I took O_DIRECT out of the open call > >> in tools/blktap2/drivers/block-aio.c, and preliminary testing indicates > >> that this might have eliminated the problem with corruption. I''m testing > >> further now, but could there be an issue with alignment (since the > >> kernel is apparently very strict about it with direct I/O)? > > Nope. It is, but they''re 4k-aligned all over the place. You''d see syslog > > yelling quite miserably in cases like that. Keeping an eye on syslog > > (the daemon and kern facilites) is a generally good idea btw. > > I''ve been doing that and haven''t seen any unusual output so far, which I > guess is good. > > >> (Removing > >> this flag also brings back in use of the page cache, of course.) > > I/O-wise it''s not much different from the file:-path. Meaning it should > > have carried you directly back into the Oom realm. > > Does it make a difference that it''s not using "loop" and instead the CPU > usage (and presumably some blocking) occurs in user-space?It''s certainly a different path taken. I just meant to say file access has about the same properties, so you''re likely back to the original issue.> There''s not > too much information on this out there, but it seems at though the OOM > issue might be at least somewhat loop device-specific. One document that > references loop OOM problems that I found is this one: > http://sources.redhat.com/lvm2/wiki/DMLoop.> My initial take on it was > that it might be saying that it mattered when these things were being > done in the kernel, but now I''m not so certain -- > > ".. [their method and loop] submit[s] [I/O requests] via a kernel thread > to the VFS layer using traditional I/O calls (read, write etc.). This > has the advantage that it should work with any file system type > supported by the Linux VFS (including networked file systems), but has > some drawbacks that may affect performance and scalability. This is > because it is hard to predict what a file system may attempt to do when > an I/O request is submitted; for example, it may need to allocate memory > to handle the request and the loopback driver has no control over this. > Particularly under low-memory or intensive I/O scenarios this can lead > to out of memory (OOM) problems or deadlocks as the kernel tries to make > memory available to the VFS layer while satisfying a request from the > block layer. " > > Would there be an advantage to using blktap/blktap2 over loop, if I > leave off O_DIRECT? Would it be faster, or anything like that?No, it''s essentially the same thing. Both blktap and loopdevs sit on the vfs in a similar fashion, without O_DIRECT even more so. The deadlocking and OOM hazards are also the same, btw. Deadlocks are a fairly general problem whenever you layer two subsystems depending on the same resource on top of each other. Both in the blktap and loopback case the system has several opportunities to hang itself, because there''s even more stuff stacked than normal. The layers are, top to bottom (1) potential caching of {tap/loop}dev writes (Xen doesn''t do that) (2) The block device, which needs some minimum amount of memory to run its request queue (3) Cached writes on the file layer (4) The filesystem needs memory to launder those pages (5) The disk''s block device, equivalent to 2. (6) The driver driver running the data transfers. The shared resource is memory. Now consider what happens when upper layers in combination grab everything the lower layers need to make progress. The upper layer can''t roll back, so won''t get off their memory before that happened. So we''re stuck. It shouldn''t happen, the kernel has a bunch of mechanisms to prevent that. It obviously doesn''t quite work here. That''s why I''m suggesting that the most obvious fix for your case is to limit the cache dirtying rate.> > Just reducing the cpu count alone sounds like sth worth trying even on a > > production box, if the current state of things already tends to take the > > system down. Also, the dirty_ratio sysctl should be pretty safe to tweak > > at runtime. > > That''s good to hear. > > >> The default for dirty_ratio is 20. I tried halving that to 10, but it > >> didn''t help. > > Still too much. That''s meant to be %/task. Try 2, with 1.5G that''s still > > a decent 30M write cache and should block all out of 24 disks after some > > 700M, worst case. Or so I think... > > Ah, ok. I was thinking that it was global. With a small per-process > cache like that, it becomes much closer to AIO for writes, but at least > the leftover memory could still be used for the read cache.I agree it doesn''t do what you want. I have no idea why there''s no global limit, seriously. Note that in theory, 24*2% would still approach the oom state you were in with the log you sent. I think it''s going to be less likely though. With all guests going mad at the same time, it may still not be low enough. In case that happens, you could resort to pumping even more memory into dom0. Daniel _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Daniel, thank you for the help and in-depth information, as well as the test code off-list. The corruption problem with blktap2 O_DIRECT is easily reproducible for me on multiple machines, so I hope that we''ll be able to nail this one down pretty quickly. To follow up on my question about the potential performance difference between blktap2 without O_DIRECT and loop (both of which use the page cache), I did some tests inside a sparse file-backed domU by timing copying a folder containing 7419 files and folders totalling 1.6 GB (of mixed sizes), and found that loop returned this: real 1m18.257s user 0m0.050s sys 0m6.550s While tapdisk2 aio w/o O_DIRECT clocked in at: real 0m55.373s user 0m0.050s sys 0m6.690s With each, I saw a few more seconds of disk activity on dom0, since dirty_ratio was set to 2. I ran the tests several times and dropped caches on dom0 between each one; all of the results were within a second or two of each other. This represents a significant ~41% performance bump for that particular workload. In light of this, I would recommend to anyone who is using "file:" that they try out tapdisk2 aio with a modified block-aio.c to remove O_DIRECT, and see how it goes. If you find results similar to mine, it might be worth modifying this into another blktap2 driver. -John On 11/18/2010 2:41 AM, Daniel Stodden wrote:> On Thu, 2010-11-18 at 02:15 -0500, John Weekes wrote: >>> I think [XCP blktap] should work fine, or wouldn''t ask. If not, lemme know. >> k. >> >>>> In my last bit of troubleshooting, I took O_DIRECT out of the open call >>>> in tools/blktap2/drivers/block-aio.c, and preliminary testing indicates >>>> that this might have eliminated the problem with corruption. I''m testing >>>> further now, but could there be an issue with alignment (since the >>>> kernel is apparently very strict about it with direct I/O)? >>> Nope. It is, but they''re 4k-aligned all over the place. You''d see syslog >>> yelling quite miserably in cases like that. Keeping an eye on syslog >>> (the daemon and kern facilites) is a generally good idea btw. >> I''ve been doing that and haven''t seen any unusual output so far, which I >> guess is good. >> >>>> (Removing >>>> this flag also brings back in use of the page cache, of course.) >>> I/O-wise it''s not much different from the file:-path. Meaning it should >>> have carried you directly back into the Oom realm. >> Does it make a difference that it''s not using "loop" and instead the CPU >> usage (and presumably some blocking) occurs in user-space? > It''s certainly a different path taken. I just meant to say file access > has about the same properties, so you''re likely back to the original > issue. > >> There''s not >> too much information on this out there, but it seems at though the OOM >> issue might be at least somewhat loop device-specific. One document that >> references loop OOM problems that I found is this one: >> http://sources.redhat.com/lvm2/wiki/DMLoop. >> My initial take on it was >> that it might be saying that it mattered when these things were being >> done in the kernel, but now I''m not so certain -- >> >> ".. [their method and loop] submit[s] [I/O requests] via a kernel thread >> to the VFS layer using traditional I/O calls (read, write etc.). This >> has the advantage that it should work with any file system type >> supported by the Linux VFS (including networked file systems), but has >> some drawbacks that may affect performance and scalability. This is >> because it is hard to predict what a file system may attempt to do when >> an I/O request is submitted; for example, it may need to allocate memory >> to handle the request and the loopback driver has no control over this. >> Particularly under low-memory or intensive I/O scenarios this can lead >> to out of memory (OOM) problems or deadlocks as the kernel tries to make >> memory available to the VFS layer while satisfying a request from the >> block layer. " >> >> Would there be an advantage to using blktap/blktap2 over loop, if I >> leave off O_DIRECT? Would it be faster, or anything like that? > No, it''s essentially the same thing. Both blktap and loopdevs sit on the > vfs in a similar fashion, without O_DIRECT even more so. The deadlocking > and OOM hazards are also the same, btw. > > Deadlocks are a fairly general problem whenever you layer two subsystems > depending on the same resource on top of each other. Both in the blktap > and loopback case the system has several opportunities to hang itself, > because there''s even more stuff stacked than normal. The layers are, top > to bottom > > (1) potential caching of {tap/loop}dev writes (Xen doesn''t do that) > (2) The block device, which needs some minimum amount of memory to run > its request queue > (3) Cached writes on the file layer > (4) The filesystem needs memory to launder those pages > (5) The disk''s block device, equivalent to 2. > (6) The driver driver running the data transfers. > > The shared resource is memory. Now consider what happens when upper > layers in combination grab everything the lower layers need to make > progress. The upper layer can''t roll back, so won''t get off their memory > before that happened. So we''re stuck. > > It shouldn''t happen, the kernel has a bunch of mechanisms to prevent > that. It obviously doesn''t quite work here. > > That''s why I''m suggesting that the most obvious fix for your case is to > limit the cache dirtying rate. > >>> Just reducing the cpu count alone sounds like sth worth trying even on a >>> production box, if the current state of things already tends to take the >>> system down. Also, the dirty_ratio sysctl should be pretty safe to tweak >>> at runtime. >> That''s good to hear. >> >>>> The default for dirty_ratio is 20. I tried halving that to 10, but it >>>> didn''t help. >>> Still too much. That''s meant to be %/task. Try 2, with 1.5G that''s still >>> a decent 30M write cache and should block all out of 24 disks after some >>> 700M, worst case. Or so I think... >> Ah, ok. I was thinking that it was global. With a small per-process >> cache like that, it becomes much closer to AIO for writes, but at least >> the leftover memory could still be used for the read cache. > I agree it doesn''t do what you want. I have no idea why there''s no > global limit, seriously. > > Note that in theory, 24*2% would still approach the oom state you were > in with the log you sent. I think it''s going to be less likely though. > With all guests going mad at the same time, it may still not be low > enough. In case that happens, you could resort to pumping even more > memory into dom0. > > Daniel > > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xensource.com > http://lists.xensource.com/xen-devel_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel