Do any of the Windows PV drivers make use of tmem in any way? I''m exploring how I could integrate this into GPLPV... so far I''m thinking of a filesystem filter that detects pagefile reads and writes and redirects them to tmem where appropriate. Thanks James
> > Do any of the Windows PV drivers make use of tmem in any way? I''m > exploring how I could integrate this into GPLPV... so far I''m thinking of a > filesystem filter that detects pagefile reads and writes and redirects them to > tmem where appropriate. >I''d mulled it over a while ago for the Citrix PV drivers but failed to really find a usecase. I agree that pagefile *would* be a good usecase but to detect what was pagefile I suspect we''d need a filter driver somewhere fairly high up the storage stack and I didn''t really fancy get into that at the time. Paul
> > > > Do any of the Windows PV drivers make use of tmem in any way? I''m > > exploring how I could integrate this into GPLPV... so far I''m thinking of a > > filesystem filter that detects pagefile reads and writes and redirects them > > to tmem where appropriate. > > > > I''d mulled it over a while ago for the Citrix PV drivers but failed to really find a > usecase. I agree that pagefile *would* be a good usecase but to detect what > was pagefile I suspect we''d need a filter driver somewhere fairly high up the > storage stack and I didn''t really fancy get into that at the time. >I''ve just been dipping my toe in the windows fs filters and am finding them quite simple to deal with, so a straightforward implementation could be very easy. Detecting if the file is a pagefile is easy enough, there is a call to query that (although it has some IRQL and errata limitations). The one thing I''m not sure about is if windows provides a way to know that the page is done with. In theory, I would divert a write to tmem, then divert the subsequent read from tmem, and the read would signify that the page of memory could be removed from tmem, but that doesn''t necessarily follow. I haven''t solutions for the following: . differentiating pagefile metadata writes from actual page writes. Eg if the swapfile was grown on the fly (windows does this) then windows would presumably update the signature of the swapfile, and this signature is likely undocumented. I suspect actual page writes will have particular buffer characteristics so maybe this isn''t going to be too difficult. . windows could optimistically write a page out during periods of idle io, even though the page is still in use, and then later throw the memory page out if required, but the application the memory belongs to could be unloaded before then so there may be a write but never a read. . windows could read a page back into memory, later throw the memory page out, and then still expect the pagefile to hold the data. I imagine this sort of behaviour is documented nowhere and even if I prove that a write is always later followed by exactly one read, this isn''t guaranteed for the future. A less risky implementation would just use tmem as a write-thru cache and then just throw out old pages on an LRU basis or something. Or discard the pages from tmem on read but write them back to disk. It kind of sucks the usefulness out of it though if you can''t avoid the writes, and if windows is doing some trickery to page out during periods of low io then I''d be upsetting that too. Anyway I have written a skeleton fs filter so I can monitor what is going on in better detail when I get a few minutes. Later versions of Windows might make use of discard (trim/unmap) which would solve most of the above problems. There do seem to be some (windows equivalent of) page cache operations that could be hooked too... or else the api callback naming is leading me astray. Thanks James
> -----Original Message----- > From: James Harper [mailto:james.harper@bendigoit.com.au] > Sent: 28 May 2013 09:58 > To: Paul Durrant; xen-devel@lists.xen.org > Subject: RE: windows tmem > > > > > > > Do any of the Windows PV drivers make use of tmem in any way? I''m > > > exploring how I could integrate this into GPLPV... so far I''m thinking of a > > > filesystem filter that detects pagefile reads and writes and redirects them > > > to tmem where appropriate. > > > > > > > I''d mulled it over a while ago for the Citrix PV drivers but failed to really find > a > > usecase. I agree that pagefile *would* be a good usecase but to detect > what > > was pagefile I suspect we''d need a filter driver somewhere fairly high up > the > > storage stack and I didn''t really fancy get into that at the time. > > > > I''ve just been dipping my toe in the windows fs filters and am finding them > quite simple to deal with, so a straightforward implementation could be very > easy. Detecting if the file is a pagefile is easy enough, there is a call to query > that (although it has some IRQL and errata limitations). > > The one thing I''m not sure about is if windows provides a way to know that > the page is done with. In theory, I would divert a write to tmem, then divert > the subsequent read from tmem, and the read would signify that the page of > memory could be removed from tmem, but that doesn''t necessarily follow. I > haven''t solutions for the following: > > . differentiating pagefile metadata writes from actual page writes. Eg if the > swapfile was grown on the fly (windows does this) then windows would > presumably update the signature of the swapfile, and this signature is likely > undocumented. I suspect actual page writes will have particular buffer > characteristics so maybe this isn''t going to be too difficult. > > . windows could optimistically write a page out during periods of idle io, even > though the page is still in use, and then later throw the memory page out if > required, but the application the memory belongs to could be unloaded > before then so there may be a write but never a read. > > . windows could read a page back into memory, later throw the memory > page out, and then still expect the pagefile to hold the data. I imagine this > sort of behaviour is documented nowhere and even if I prove that a write is > always later followed by exactly one read, this isn''t guaranteed for the > future. > > A less risky implementation would just use tmem as a write-thru cache and > then just throw out old pages on an LRU basis or something. Or discard the > pages from tmem on read but write them back to disk. It kind of sucks the > usefulness out of it though if you can''t avoid the writes, and if windows is > doing some trickery to page out during periods of low io then I''d be upsetting > that too. >This sounds a lot less fragile and the saving on reads to the storage backend could still be significant.> Anyway I have written a skeleton fs filter so I can monitor what is going on in > better detail when I get a few minutes. Later versions of Windows might > make use of discard (trim/unmap) which would solve most of the above > problems. > > There do seem to be some (windows equivalent of) page cache operations > that could be hooked too... or else the api callback naming is leading me > astray. >Sounds interesting. Presumably, if you can reliably intercept all IO on a pagefile then I guess you could use tmem as a write-back cache in front of doing your own file i/o down the storage stack, as long as you could reliably flush it out when necessary. E.g. does windows assume anything about the pagefile content on resume from S3 or S4? Paul
> > A less risky implementation would just use tmem as a write-thru cache and > > then just throw out old pages on an LRU basis or something. Or discard the > > pages from tmem on read but write them back to disk. It kind of sucks the > > usefulness out of it though if you can''t avoid the writes, and if windows is > > doing some trickery to page out during periods of low io then I''d be > > upsetting that too. > > > > This sounds a lot less fragile and the saving on reads to the storage backend > could still be significant.Should be easy enough to test I guess.> > Anyway I have written a skeleton fs filter so I can monitor what is going on > > in > > better detail when I get a few minutes. Later versions of Windows might > > make use of discard (trim/unmap) which would solve most of the above > > problems. > > > > There do seem to be some (windows equivalent of) page cache operations > > that could be hooked too... or else the api callback naming is leading me > > astray. > > > > Sounds interesting. Presumably, if you can reliably intercept all IO on a > pagefile then I guess you could use tmem as a write-back cache in front of > doing your own file i/o down the storage stack, as long as you could reliably > flush it out when necessary. E.g. does windows assume anything about the > pagefile content on resume from S3 or S4? >I need to look up if FS filter is notified about power state transitions. There may be a FLUSH of some sort that happens at that time. Newer versions of windows have a thing called ''hybrid suspend'', where the hibernate file is written out as if windows were about to be hibernated, but it goes to sleep instead of hibernating but if power is lost a resume is still possible. It may be acceptable to say that tmem = no hibernate. Migrate should be easy enough as I have direct control over that and can make tmem be written back out to the pagefile first. This all assumes that write back is possible too... James
On Tue, May 28, 2013 at 09:53:23AM +0000, James Harper wrote:> > > A less risky implementation would just use tmem as a write-thru cache and > > > then just throw out old pages on an LRU basis or something. Or discard the > > > pages from tmem on read but write them back to disk. It kind of sucks the > > > usefulness out of it though if you can''t avoid the writes, and if windows is > > > doing some trickery to page out during periods of low io then I''d be > > > upsetting that too. > > > > > > > This sounds a lot less fragile and the saving on reads to the storage backend > > could still be significant. > > Should be easy enough to test I guess. > > > > Anyway I have written a skeleton fs filter so I can monitor what is going on > > > in > > > better detail when I get a few minutes. Later versions of Windows might > > > make use of discard (trim/unmap) which would solve most of the above > > > problems. > > > > > > There do seem to be some (windows equivalent of) page cache operations > > > that could be hooked too... or else the api callback naming is leading me > > > astray. > > > > > > > Sounds interesting. Presumably, if you can reliably intercept all IO on a > > pagefile then I guess you could use tmem as a write-back cache in front of > > doing your own file i/o down the storage stack, as long as you could reliably > > flush it out when necessary. E.g. does windows assume anything about the > > pagefile content on resume from S3 or S4? > > > > I need to look up if FS filter is notified about power state transitions. There may be a FLUSH of some sort that happens at that time. Newer versions of windows have a thing called ''hybrid suspend'', where the hibernate file is written out as if windows were about to be hibernated, but it goes to sleep instead of hibernating but if power is lost a resume is still possible. It may be acceptable to say that tmem = no hibernate. Migrate should be easy enough as I have direct control over that and can make tmem be written back out to the pagefile first. > > This all assumes that write back is possible too...I am not familiar with the Windows APIs, but it sounds like you want to use the tmem ephermeal disk cache as an secondary cache (which is BTW what Linux does too). That is OK the only thing you need to keep in mind that the hypervisor might flush said cache out if it decides to do it (say a new guest is launched and it needs the memory that said cache is using). So the tmem_get might tell that it does not have the page anymore.> > James > > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xen.org > http://lists.xen.org/xen-devel >
On Tue, May 28, 2013 at 10:17:00AM -0400, Konrad Rzeszutek Wilk wrote:> On Tue, May 28, 2013 at 09:53:23AM +0000, James Harper wrote: > > > > A less risky implementation would just use tmem as a write-thru cache and > > > > then just throw out old pages on an LRU basis or something. Or discard the > > > > pages from tmem on read but write them back to disk. It kind of sucks the > > > > usefulness out of it though if you can''t avoid the writes, and if windows is > > > > doing some trickery to page out during periods of low io then I''d be > > > > upsetting that too. > > > > > > > > > > This sounds a lot less fragile and the saving on reads to the storage backend > > > could still be significant. > > > > Should be easy enough to test I guess. > > > > > > Anyway I have written a skeleton fs filter so I can monitor what is going on > > > > in > > > > better detail when I get a few minutes. Later versions of Windows might > > > > make use of discard (trim/unmap) which would solve most of the above > > > > problems. > > > > > > > > There do seem to be some (windows equivalent of) page cache operations > > > > that could be hooked too... or else the api callback naming is leading me > > > > astray. > > > > > > > > > > Sounds interesting. Presumably, if you can reliably intercept all IO on a > > > pagefile then I guess you could use tmem as a write-back cache in front of > > > doing your own file i/o down the storage stack, as long as you could reliably > > > flush it out when necessary. E.g. does windows assume anything about the > > > pagefile content on resume from S3 or S4? > > > > > > > I need to look up if FS filter is notified about power state transitions. There may be a FLUSH of some sort that happens at that time. Newer versions of windows have a thing called ''hybrid suspend'', where the hibernate file is written out as if windows were about to be hibernated, but it goes to sleep instead of hibernating but if power is lost a resume is still possible. It may be acceptable to say that tmem = no hibernate. Migrate should be easy enough as I have direct control over that and can make tmem be written back out to the pagefile first. > > > > This all assumes that write back is possible too... > > I am not familiar with the Windows APIs, but it sounds like you > want to use the tmem ephermeal disk cache as an secondary cache > (which is BTW what Linux does too). > > That is OK the only thing you need to keep in mind that the > hypervisor might flush said cache out if it decides to do it > (say a new guest is launched and it needs the memory that > said cache is using). > > So the tmem_get might tell that it does not have the page anymore.Oh and I should mention that I would be more than thrilled to try this out and see how it works.
> > I am not familiar with the Windows APIs, but it sounds like you > want to use the tmem ephermeal disk cache as an secondary cache > (which is BTW what Linux does too). > > That is OK the only thing you need to keep in mind that the > hypervisor might flush said cache out if it decides to do it > (say a new guest is launched and it needs the memory that > said cache is using). > > So the tmem_get might tell that it does not have the page anymore.Yes I''ve read the brief :) I actually wanted to implement the equivalent of ''frontswap'' originally by trapping writes to the pagefile. A bit of digging and testing suggests it may not be possible to determine when a page written to the pagefile is discarded, meaning that tmem use would just grow until fill and then stop being useful unless I eject pages on an LRU basis or something, so ephemeral tmem as a best-effort write-through cache might be the best and easiest starting point. James
On Wed, May 29, 2013 at 12:19:25AM +0000, James Harper wrote:> > > > I am not familiar with the Windows APIs, but it sounds like you > > want to use the tmem ephermeal disk cache as an secondary cache > > (which is BTW what Linux does too). > > > > That is OK the only thing you need to keep in mind that the > > hypervisor might flush said cache out if it decides to do it > > (say a new guest is launched and it needs the memory that > > said cache is using). > > > > So the tmem_get might tell that it does not have the page anymore. > > Yes I''ve read the brief :) > > I actually wanted to implement the equivalent of ''frontswap'' originally by trapping writes to the pagefile. A bit of digging and testing suggests it may not be possible to determine when a page written to the pagefile is discarded, meaning that tmem use would just grow until fill and then stop being useful unless I eject pages on an LRU basis or something, so ephemeral tmem as a best-effort write-through cache might be the best and easiest starting point. ><nods>> James
> > On Wed, May 29, 2013 at 12:19:25AM +0000, James Harper wrote: > > > > > > I am not familiar with the Windows APIs, but it sounds like you > > > want to use the tmem ephermeal disk cache as an secondary cache > > > (which is BTW what Linux does too). > > > > > > That is OK the only thing you need to keep in mind that the > > > hypervisor might flush said cache out if it decides to do it > > > (say a new guest is launched and it needs the memory that > > > said cache is using). > > > > > > So the tmem_get might tell that it does not have the page anymore. > > > > Yes I''ve read the brief :) > > > > I actually wanted to implement the equivalent of ''frontswap'' originally by > > trapping writes to the pagefile. A bit of digging and testing suggests it may > > not be possible to determine when a page written to the pagefile is > > discarded, meaning that tmem use would just grow until fill and then stop > > being useful unless I eject pages on an LRU basis or something, so ephemeral > > tmem as a best-effort write-through cache might be the best and easiest > > starting point. > > > > <nods>Unfortunately it gets worse... I''m testing on windows 2003 at the moment, and it seems to always write out data in 64k chunks, which are aligned to a 4k boundary. Then it reads in one or more of those pages, and maybe later re-uses the same part of the swapfile for something else. It seems that all reads are 4k in size, but there may be some grouping of those requests at a lower layer. So I would end up caching up to 16x the actual data, with no way of knowing which of those 16 pages are actually being swapped out and which are just optimistically being written to disk without actually being paged out. I''ll do a bit of analysis of the MDL being written as that may give me some more information but it''s not looking as good as I''d hoped. James
> > Unfortunately it gets worse... I''m testing on windows 2003 at the moment, > and it seems to always write out data in 64k chunks, which are aligned to a 4k > boundary. Then it reads in one or more of those pages, and maybe later re- > uses the same part of the swapfile for something else. It seems that all reads > are 4k in size, but there may be some grouping of those requests at a lower > layer. > > So I would end up caching up to 16x the actual data, with no way of knowing > which of those 16 pages are actually being swapped out and which are just > optimistically being written to disk without actually being paged out. > > I''ll do a bit of analysis of the MDL being written as that may give me some > more information but it''s not looking as good as I''d hoped. >I now have a working implementation that does write-through caching of pagefile writes to ephemeral tmem. It keeps some counters on get and put operations, and on a Windows 2003 server with 256MB memory assigned, after a bit of running and flipping between applications I get: put_success_count = 96565 put_fail_count = 0 get_success_count = 34514 get_fail_count = 5369 which is somewhere around 85% hit rate vs misses. That seems pretty good except that windows is quite aggressive about paging out, so there are a lot of unused writes (and therefore tmem usage) and I''m not sure if it''s a net win. Subjectively, windows does seem faster with my driver active, but I''m using it over a crappy adsl connection so it''s hard to measure in any precise way. I''m trying to see if it is possible to use tmem as a page cache cache which would be more useful but I''m not yet sure if the required hooks exist in fs minifilters. James
> -----Original Message----- > From: James Harper [mailto:james.harper@bendigoit.com.au] > Sent: 02 June 2013 08:37 > To: James Harper; Konrad Rzeszutek Wilk > Cc: Paul Durrant; xen-devel@lists.xen.org > Subject: RE: [Xen-devel] windows tmem > > > > > Unfortunately it gets worse... I''m testing on windows 2003 at the moment, > > and it seems to always write out data in 64k chunks, which are aligned to a > 4k > > boundary. Then it reads in one or more of those pages, and maybe later re- > > uses the same part of the swapfile for something else. It seems that all > reads > > are 4k in size, but there may be some grouping of those requests at a lower > > layer. > > > > So I would end up caching up to 16x the actual data, with no way of > knowing > > which of those 16 pages are actually being swapped out and which are just > > optimistically being written to disk without actually being paged out. > > > > I''ll do a bit of analysis of the MDL being written as that may give me some > > more information but it''s not looking as good as I''d hoped. > > > > I now have a working implementation that does write-through caching of > pagefile writes to ephemeral tmem. It keeps some counters on get and put > operations, and on a Windows 2003 server with 256MB memory assigned, > after a bit of running and flipping between applications I get: > > put_success_count = 96565 > put_fail_count = 0 > get_success_count = 34514 > get_fail_count = 5369 > > which is somewhere around 85% hit rate vs misses. That seems pretty good > except that windows is quite aggressive about paging out, so there are a lot > of unused writes (and therefore tmem usage) and I''m not sure if it''s a net > win. > > Subjectively, windows does seem faster with my driver active, but I''m using > it over a crappy adsl connection so it''s hard to measure in any precise way. > > I''m trying to see if it is possible to use tmem as a page cache cache which > would be more useful but I''m not yet sure if the required hooks exist in fs > minifilters. >Do you have any numbers for a more recent version of Windows? At least a 6.x kernel. Perhaps the pageout characteristics are better? Paul
> > Do you have any numbers for a more recent version of Windows? At least a > 6.x kernel. Perhaps the pageout characteristics are better? >Building a 2008R2 server now. It''s on an older AMD server (Dual-Core AMD Opteron(tm) Processor 1210) and seems to be installing really slowly... James
> > > > Do you have any numbers for a more recent version of Windows? At least a > > 6.x kernel. Perhaps the pageout characteristics are better? > > > > Building a 2008R2 server now. It''s on an older AMD server (Dual-Core AMD > Opteron(tm) Processor 1210) and seems to be installing really slowly... >[117301.344358] powernow-k8: fid trans failed, fid 0xa, curr 0x0 [117301.344417] powernow-k8: transition frequency failed [117301.348266] powernow-k8: fid trans failed, fid 0x2, curr 0x0 [117301.348326] powernow-k8: transition frequency failed I suspect that''s the reason... I normally disable the powernow-k8 module but forgot after the latest update. James
> > Do you have any numbers for a more recent version of Windows? At least a > 6.x kernel. Perhaps the pageout characteristics are better? >Fresh install of 2008r2 with 512kb of memory and tmem active, with updates installing for the last 30 minutes: put_success_count = 1286906 put_fail_count = 0 get_success_count = 511937 get_fail_count = 286789 a ''get fail'' is a ''miss''. James
> -----Original Message----- > From: James Harper [mailto:james.harper@bendigoit.com.au] > Sent: 04 June 2013 04:09 > To: Paul Durrant; Konrad Rzeszutek Wilk > Cc: xen-devel@lists.xen.org > Subject: RE: [Xen-devel] windows tmem > > > > > Do you have any numbers for a more recent version of Windows? At least a > > 6.x kernel. Perhaps the pageout characteristics are better? > > > > Fresh install of 2008r2 with 512kb of memory and tmem active, with updates > installing for the last 30 minutes: > > put_success_count = 1286906 > put_fail_count = 0 > get_success_count = 511937 > get_fail_count = 286789 > > a ''get fail'' is a ''miss''. >Hmm. On the face of it a much higher miss rate than 2K3, but the workload is different so it''s hard to tell how comparable the numbers are. I wonder whether use of ephemeral tmem is an issue because of the get-implies-flush characteristic. I guess you''d always expect a put between gets for a pagefile but it might be interesting to see what miss rate you get with persistent tmem. Paul
> > Fresh install of 2008r2 with 512kb of memory and tmem active, with > > updates > > installing for the last 30 minutes: > > > > put_success_count = 1286906 > > put_fail_count = 0 > > get_success_count = 511937 > > get_fail_count = 286789 > > > > a ''get fail'' is a ''miss''. > > > > Hmm. On the face of it a much higher miss rate than 2K3, but the workload is > different so it''s hard to tell how comparable the numbers are. I wonder > whether use of ephemeral tmem is an issue because of the get-implies-flush > characteristic. I guess you''d always expect a put between gets for a pagefile > but it might be interesting to see what miss rate you get with persistent > tmem. >After running for a while longer: put_success_count = 15732240 put_fail_count = 0 get_success_count = 10330032 get_fail_count = 4460352 which is a similar hit rate of get_success vs get_fail (~70%), but a much better hit rate of put_success vs get_success. If ephemeral pages are discarded on read then this tells me that around 66% of pages I put into tmem were read back in, vs around 40% in my first sample. For persistent tmem to work I''d need to know when Windows will not need the memory again which is information I don''t have access to, or alternatively maintain my own LRU structure. What I really need to know is when Windows discards a page from memory, but all I know so far is when it writes out a page of memory to disk which only tells me that at some future time it might discard the page from memory. I''m only testing this one vm on a physical machine, so xen isn''t trying to do any balancing of tmem pools against other vm''s. Assigning 384mb (I said 512mb before but I was mistaken) to a Windows 2008R2 server isn''t even close to a realistic scenario, and with a bunch of vm''s all competing for ephemeral tmem memory might mean that pages are mostly discarded before they need to be retrieved. James