James Harper
2008-Dec-31 02:46 UTC
[Xen-devel] freezing when using GPLPV drivers (including Dom0)
I''m trying to resolve an issue in my GPLPV drivers that has come about in doing some restores using Backup Exec across the network. The server running Backup Exec can be a DomU or a completely separate machine (connected via gigabit Ethernet). When restoring a large file (30G exchange mailbox store), everything locks up for a bit, long enough for ARP to timeout and the TCP connection for the backup data to drop, failing the backup. This can happen anywhere from 500MB to 20G into the restore, but normally around the 2G mark. Investigating is a bit tricky as even Dom0 is not usable - any command I type at a shell doesn''t do anything until it unfreezes. When everything comes back, it all comes back at once. There are no messages in the kernel logs or the xen logs. I am suspecting that maybe the problem is disk starvation but I don''t quite understand why the lockup happens for so long. I''m also not sure why I''m only seeing the problem when using my GPLPV drivers - one possibility is that the increased performance puts more load on the storage system. Any suggestions? Thanks James _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Tian, Kevin
2008-Dec-31 02:53 UTC
[Xen-devel] RE: freezing when using GPLPV drivers (including Dom0)
Did restore process finish even when you see freeze in the middle? Thanks, Kevin>From: James Harper >Sent: Wednesday, December 31, 2008 10:46 AM > >I''m trying to resolve an issue in my GPLPV drivers that has come about >in doing some restores using Backup Exec across the network. > >The server running Backup Exec can be a DomU or a completely separate >machine (connected via gigabit Ethernet). > >When restoring a large file (30G exchange mailbox store), everything >locks up for a bit, long enough for ARP to timeout and the TCP >connection for the backup data to drop, failing the backup. This can >happen anywhere from 500MB to 20G into the restore, but normally around >the 2G mark. > >Investigating is a bit tricky as even Dom0 is not usable - any >command I >type at a shell doesn''t do anything until it unfreezes. When everything >comes back, it all comes back at once. There are no messages in the >kernel logs or the xen logs. > >I am suspecting that maybe the problem is disk starvation but I don''t >quite understand why the lockup happens for so long. I''m also not sure >why I''m only seeing the problem when using my GPLPV drivers - one >possibility is that the increased performance puts more load on the >storage system. > >Any suggestions? > >Thanks > >James > >_______________________________________________ >Xen-devel mailing list >Xen-devel@lists.xensource.com >http://lists.xensource.com/xen-devel >_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
James Harper
2008-Dec-31 02:55 UTC
[Xen-devel] RE: freezing when using GPLPV drivers (including Dom0)
> > Did restore process finish even when you see freeze in the middle? >No. the TCP connection is closed because of the delay (arp cache times out) which causes the restore to fail. James _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Tian, Kevin
2008-Dec-31 03:00 UTC
[Xen-devel] RE: freezing when using GPLPV drivers (including Dom0)
>From: James Harper >Sent: Wednesday, December 31, 2008 10:56 AM > >> >> Did restore process finish even when you see freeze in the middle? >> > >No. the TCP connection is closed because of the delay (arp cache times >out) which causes the restore to fail. > >James >Then if you kill this windows guest, does dom0 come back normal? If yes, it''s possible due to servicing windows guest activity such as heavy disk i/o as you guess. If not, it may indicate some hang condition happening either within dom0 and Xen, and then you may first find out the hang point and then dig into for detail. Also it''d be good to check both dom0/xen dmesg to see any warning reported already. Thanks, Kevin _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Tian, Kevin
2008-Dec-31 03:07 UTC
[Xen-devel] RE: freezing when using GPLPV drivers (including Dom0)
>From: James Harper >Sent: Wednesday, December 31, 2008 10:46 AM > >I am suspecting that maybe the problem is disk starvation but I don''t >quite understand why the lockup happens for so long. I''m also not sure >why I''m only seeing the problem when using my GPLPV drivers - one >possibility is that the increased performance puts more load on the >storage system. >Maybe you can check cycles spent on kernel thread/event handler in backend driver side. I''m not sure whether heavy communication between be/fe could disturb dom0 scheduler if care is not taken in current design. E.g. back kernel thread may eat too many cycles before giving up, or your GPLPV fe driver may issue too many events to break be side... Just my two cents. :-) Thanks, Kevin _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
James Harper
2008-Dec-31 03:16 UTC
RE: [Xen-devel] RE: freezing when using GPLPV drivers (including Dom0)
> > >From: James Harper > >Sent: Wednesday, December 31, 2008 10:46 AM > > > >I am suspecting that maybe the problem is disk starvation but I don''t > >quite understand why the lockup happens for so long. I''m also notsure> >why I''m only seeing the problem when using my GPLPV drivers - one > >possibility is that the increased performance puts more load on the > >storage system. > > > > Maybe you can check cycles spent on kernel thread/event handler > in backend driver side. I''m not sure whether heavy communication > between be/fe could disturb dom0 scheduler if care is not taken in > current design. E.g. back kernel thread may eat too many cycles > before giving up, or your GPLPV fe driver may issue too many events > to break be side... >I am running the restore again and monitoring using: . xentop running in dom0 . arping to the DomU running from an external machine . ping to Dom0 running from an external machine With arping and ping running I have noticed that the freeze is not always long enough to cause the TCP connections to time out - I was only noticing the ones that were long enough. During the freeze, xentop shows very low Dom0 and DomU CPU, arping stops receiving replies to the arp requests, but the ping to Dom0 keeps going. The freeze that just occurred was not long enough for me to tell if the DomU xentop counters for network and disk were increasing or not. (xentop keeps running, lending weight to the freeze only concerning tasks that want to access the disk). Is there a way under Linux of monitoring disk queue length? I am using LVM on top of a low end HP ''Smart Array'' (E200) running two RAID1 volumes using SATA disks. Thanks James _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Tian, Kevin
2008-Dec-31 03:23 UTC
RE: [Xen-devel] RE: freezing when using GPLPV drivers (including Dom0)
>From: James Harper [mailto:james.harper@bendigoit.com.au] >Sent: Wednesday, December 31, 2008 11:16 AM > >> >> >From: James Harper >> >Sent: Wednesday, December 31, 2008 10:46 AM >> > >> >I am suspecting that maybe the problem is disk starvation >but I don''t >> >quite understand why the lockup happens for so long. I''m also not >sure >> >why I''m only seeing the problem when using my GPLPV drivers - one >> >possibility is that the increased performance puts more load on the >> >storage system. >> > >> >> Maybe you can check cycles spent on kernel thread/event handler >> in backend driver side. I''m not sure whether heavy communication >> between be/fe could disturb dom0 scheduler if care is not taken in >> current design. E.g. back kernel thread may eat too many cycles >> before giving up, or your GPLPV fe driver may issue too many events >> to break be side... >> > >I am running the restore again and monitoring using: >. xentop running in dom0 >. arping to the DomU running from an external machine >. ping to Dom0 running from an external machine > >With arping and ping running I have noticed that the freeze is not >always long enough to cause the TCP connections to time out - >I was only >noticing the ones that were long enough. > >During the freeze, xentop shows very low Dom0 and DomU CPU, >arping stops >receiving replies to the arp requests, but the ping to Dom0 >keeps going. >The freeze that just occurred was not long enough for me to tell if the >DomU xentop counters for network and disk were increasing or not. >(xentop keeps running, lending weight to the freeze only concerning >tasks that want to access the disk). > >Is there a way under Linux of monitoring disk queue length? I am using >LVM on top of a low end HP ''Smart Array'' (E200) running two RAID1 >volumes using SATA disks. >''sar'' could provide such info, IMO. Thanks, Kevin _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
James Harper
2008-Dec-31 03:37 UTC
RE: [Xen-devel] RE: freezing when using GPLPV drivers (including Dom0)
> > >Is there a way under Linux of monitoring disk queue length? I amusing> >LVM on top of a low end HP ''Smart Array'' (E200) running two RAID1 > >volumes using SATA disks. > > > > ''sar'' could provide such info, IMO. >iostat shows very very low disk usage when things are frozen. I am finding that I can type ''sync'' and things will unfreeze again... unfreezing before the sync completes. I haven''t done this enough times to know if things would have unfrozen on their own though. James _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Dear Gentlemen Suppose you need to know the overall load on the host, in terms if CPU, bandwidth and disk IO, not per domu, but aggregated, and split per domu and dom0. Xentop does not show aggregated totals, and also it does not show percentages relative to available resources, so for management is kind of useless. The only tool that shows (somehow) that information is graphic, the libvirt virtual machine manager, but is there a text mode tool to manage a node? Federico _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Tian, Kevin
2008-Dec-31 04:18 UTC
RE: [Xen-devel] RE: freezing when using GPLPV drivers (including Dom0)
>From: James Harper [mailto:james.harper@bendigoit.com.au] >Sent: Wednesday, December 31, 2008 11:37 AM >> >Is there a way under Linux of monitoring disk queue length? I am >using >> >LVM on top of a low end HP ''Smart Array'' (E200) running two RAID1 >> >volumes using SATA disks. >> > >> >> ''sar'' could provide such info, IMO. >> > >iostat shows very very low disk usage when things are frozen. I am >finding that I can type ''sync'' and things will unfreeze again... >unfreezing before the sync completes. I haven''t done this enough times >to know if things would have unfrozen on their own though. >That looks interesting. Now both cpu/disk utilizations are low, but system is not responsive for unknown time... Does time in dom0 look sane? I guess you may have to check behavior/statistics of fe/be drivers in depth, e.g. event count/s, whether kernel thread is waken effectively, how many requests handled per event notification, etc. and then may judge whether those stats are expected. Thanks, Kevin _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
James Harper
2008-Dec-31 04:51 UTC
RE: [Xen-devel] RE: freezing when using GPLPV drivers (including Dom0)
> > >From: James Harper [mailto:james.harper@bendigoit.com.au] > >Sent: Wednesday, December 31, 2008 11:37 AM > >> >Is there a way under Linux of monitoring disk queue length? I am > >using > >> >LVM on top of a low end HP ''Smart Array'' (E200) running two RAID1 > >> >volumes using SATA disks. > >> > > >> > >> ''sar'' could provide such info, IMO. > >> > > > >iostat shows very very low disk usage when things are frozen. I am > >finding that I can type ''sync'' and things will unfreeze again... > >unfreezing before the sync completes. I haven''t done this enoughtimes> >to know if things would have unfrozen on their own though. > > > > That looks interesting. Now both cpu/disk utilizations are low, butsystem> is not responsive for unknown time... Does time in dom0 look sane? I > guess you may have to check behavior/statistics of fe/be drivers indepth,> e.g. event count/s, whether kernel thread is waken effectively, howmany> requests handled per event notification, etc. and then may judgewhether> those stats are expected. >I have written a script that does ''sync ; sleep 5'' in a loop. My restore is now at 20G and still going. I''ll follow up if it completes. I''m not sure where to look for this problem though... When I use the qemu emulated devices instead of GPLPV, the restore runs to completion, but it also runs slower, so maybe the problem isn''t the GPLPV drivers but more that the qemu drivers can''t get the i/o load up high enough to see the problem. As I said earlier in the thread, the system is using a HP E200 ''Smart'' array controller, with no battery backup, and 2 pairs of RAID1 arrays on SATA disks. Obviously not the highest performing setup ever. I have a 500G disk I can attach to one of the onboard SATA ports, but I''m not sure that that will actually prove anything either way. One other thing I didn''t mention - I am using sparse files as my disk images, using ''file:'' under Xen. Again, not the highest performing configuration, but the restore process we are using needs to see disks at least as big as those that were backed up originally, and I just don''t have 2TB of disk lying around! The data access is DomU -> blkback -> /dev/loopX -> file(sparse) -> filesystem(xfs) -> LVM -> E200... that''s a lot of room for stuff to go wrong in isn''t it? I could try switching to tap:aio but I don''t think that my GPLPV drivers work in that configuration... maybe time to find out why :) Thanks James _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Ian Pratt
2008-Dec-31 09:21 UTC
RE: [Xen-devel] RE: freezing when using GPLPV drivers (including Dom0)
> One other thing I didn''t mention - I am using sparse files as my disk > images, using ''file:'' under Xen. Again, not the highest performing > configuration, but the restore process we are using needs to see disks > at least as big as those that were backed up originally, and I just > don''t have 2TB of disk lying around! The data access is DomU ->blkback> -> /dev/loopX -> file(sparse) -> filesystem(xfs) -> LVM -> E200... > that''s a lot of room for stuff to go wrong in isn''t it?Which loop driver are you using? The std loop driver is well known to deadlock under high write load. I think this may have been fixed with loop-ng, but you''d likely be better off using tap:aio. Ian> I could try switching to tap:aio but I don''t think that my GPLPV > drivers > work in that configuration... maybe time to find out why :) > > Thanks > > James > > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xensource.com > http://lists.xensource.com/xen-devel_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
James Harper
2008-Dec-31 10:21 UTC
RE: [Xen-devel] RE: freezing when using GPLPV drivers (including Dom0)
> > > One other thing I didn''t mention - I am using sparse files as mydisk> > images, using ''file:'' under Xen. Again, not the highest performing > > configuration, but the restore process we are using needs to seedisks> > at least as big as those that were backed up originally, and I just > > don''t have 2TB of disk lying around! The data access is DomU -> > blkback > > -> /dev/loopX -> file(sparse) -> filesystem(xfs) -> LVM -> E200... > > that''s a lot of room for stuff to go wrong in isn''t it? > > Which loop driver are you using? The std loop driver is well known to > deadlock under high write load. I think this may have been fixed with > loop-ng, but you''d likely be better off using tap:aio. >I''ve never even heard of loop-ng... I just did a ''find'' for any kernel module with ''loop'' in the name and didn''t see anything called ''loop-ng''... is it something I need to enable in the kernel config? I just tried tap:aio but the DomU hung for ages after starting the restore... I''m just about to investigate. Thanks James _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Ian Pratt
2008-Dec-31 10:41 UTC
RE: [Xen-devel] RE: freezing when using GPLPV drivers (including Dom0)
> I''ve never even heard of loop-ng... I just did a ''find'' for any kernel > module with ''loop'' in the name and didn''t see anything called > ''loop-ng''... is it something I need to enable in the kernel config?It''s now part of device mapper, and called dm-loop. The key improvement is that it''s supposed to avoid dirtying unbounded amounts of memory and then deadlocking. [However, this type of deadlock is fairly terminal -- I''ve never tried, but I don''t think ''sync'' would unwedge it, so you may have a different issue.]> I just tried tap:aio but the DomU hung for ages after starting the > restore... I''m just about to investigate.Blktap certainly doesn''t suffer from memory deadlock issues as it opens the file O_DIRECT. BTW: To my mind we should switch over from blktap to blktap2 soon. Blktap2 isn''t as mature yet, but its more aesthetically pleasing and has equivalent performance. Ian _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Keir Fraser
2008-Dec-31 10:50 UTC
Re: [Xen-devel] RE: freezing when using GPLPV drivers (including Dom0)
On 31/12/2008 10:41, "Ian Pratt" <Ian.Pratt@eu.citrix.com> wrote:> BTW: To my mind we should switch over from blktap to blktap2 soon. > Blktap2 isn''t as mature yet, but its more aesthetically pleasing and has > equivalent performance.There''s currently discussion between Andy''s team and Intel to get blktap2 working with Intel''s test setup. When that works, blktap2 will be going into xen-unstable. -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel