All the resources allocated based on xen_blkif_reqs are global in blkback. While (without having measured anything) I think that this is bad from a QoS perspective (not the least implied from a warning issued by Citrix''es multi-page-ring patches: if (blkif_reqs < BLK_RING_SIZE(order)) printk(KERN_WARNING "WARNING: " "I/O request space (%d reqs) < ring order %ld, " "consider increasing %s.reqs to >= %ld.", blkif_reqs, order, KBUILD_MODNAME, roundup_pow_of_two(BLK_RING_SIZE(order))); indicating that this _is_ a bottleneck), I''m otoh hesitant to convert this to per-instance allocations, as the amount of memory taken away from Dom0 for this may be not insignificant when there are many devices. Does anyone have an opinion here, in particular regarding the original authors'' decision to make this global vs. the apparently made observation (by Daniel Stodden, the author of said patch, who I don''t have any current email of to ask directly), but also in the context of multi-page rings, the purpose of which is to allow for larger amounts of in-flight I/O? Thanks, Jan
On Mon, 2012-03-26 at 16:56 +0100, Jan Beulich wrote:> All the resources allocated based on xen_blkif_reqs are global in > blkback. While (without having measured anything) I think that this > is bad from a QoS perspective (not the least implied from a warning > issued by Citrix''es multi-page-ring patches: > > if (blkif_reqs < BLK_RING_SIZE(order)) > printk(KERN_WARNING "WARNING: " > "I/O request space (%d reqs) < ring order %ld, " > "consider increasing %s.reqs to >= %ld.", > blkif_reqs, order, KBUILD_MODNAME, > roundup_pow_of_two(BLK_RING_SIZE(order))); > > indicating that this _is_ a bottleneck), I''m otoh hesitant to convert > this to per-instance allocations, as the amount of memory taken > away from Dom0 for this may be not insignificant when there are > many devices. > > Does anyone have an opinion here, in particular regarding the > original authors'' decision to make this global vs. the apparently > made observation (by Daniel Stodden, the author of said patch, > who I don''t have any current email of to ask directly), but also > in the context of multi-page rings, the purpose of which is to > allow for larger amounts of in-flight I/O?Not really much to say other than we (well, mostly Wei Liu) have a similar issue with netback too. That used to have a global pool, then a pool-per-worker thread. When Wei added thread-per-vif support he solved this by adding a "page pool" which handles allocations. Possibly this could grow some sort of fairness etc nobs and be shared with blkback? Ian.
On Mon, Mar 26, 2012 at 05:20:08PM +0100, Ian Campbell wrote:> On Mon, 2012-03-26 at 16:56 +0100, Jan Beulich wrote: > > All the resources allocated based on xen_blkif_reqs are global in > > blkback. While (without having measured anything) I think that this > > is bad from a QoS perspective (not the least implied from a warning > > issued by Citrix''es multi-page-ring patches: > > > > if (blkif_reqs < BLK_RING_SIZE(order)) > > printk(KERN_WARNING "WARNING: " > > "I/O request space (%d reqs) < ring order %ld, " > > "consider increasing %s.reqs to >= %ld.", > > blkif_reqs, order, KBUILD_MODNAME, > > roundup_pow_of_two(BLK_RING_SIZE(order))); > > > > indicating that this _is_ a bottleneck), I''m otoh hesitant to convert > > this to per-instance allocations, as the amount of memory taken > > away from Dom0 for this may be not insignificant when there are > > many devices. > > > > Does anyone have an opinion here, in particular regarding the > > original authors'' decision to make this global vs. the apparently > > made observation (by Daniel Stodden, the author of said patch, > > who I don''t have any current email of to ask directly), but also > > in the context of multi-page rings, the purpose of which is to > > allow for larger amounts of in-flight I/O? > > Not really much to say other than we (well, mostly Wei Liu) have a > similar issue with netback too. That used to have a global pool, then a > pool-per-worker thread. When Wei added thread-per-vif support he solved > this by adding a "page pool" which handles allocations. Possibly this > could grow some sort of fairness etc nobs and be shared with blkback?Yes! That is what I was hoping for - as it would hopefully also have the shrinker API hooked up to combat excessive memory consumption.> > Ian. > > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xen.org > http://lists.xen.org/xen-devel
On Mon, Mar 26, 2012 at 09:20:08AM -0700, Ian Campbell wrote:> On Mon, 2012-03-26 at 16:56 +0100, Jan Beulich wrote: > > > > Does anyone have an opinion here, in particular regarding the > > original authors'' decision to make this global vs. the apparently > > made observation (by Daniel Stodden, the author of said patch, > > who I don''t have any current email of to ask directly), but also > > in the context of multi-page rings, the purpose of which is to > > allow for larger amounts of in-flight I/O? > > Not really much to say other than we (well, mostly Wei Liu) have a > similar issue with netback too. That used to have a global pool, then a > pool-per-worker thread. When Wei added thread-per-vif support he solved > this by adding a "page pool" which handles allocations. Possibly this > could grow some sort of fairness etc nobs and be shared with blkback?I''d definitely welcome an approach in blkback similar to the page pool changes to netback. Steve Noonan''s change to use a spinlock per blkfront device [1] should show increased contention on the global request list on the blkback side when multiple devices are attached. Does netback not also need to handle questions of fairness and resource starvation? Matt [1] http://git.kernel.org/?p=linux/kernel/git/axboe/linux-block.git;a=commitdiff;h=3467811e26660eb46bc655234573d22d6876d5f9
>>> On 26.03.12 at 18:53, Daniel Stodden <daniel.stodden@googlemail.com> wrote: > On Mon, 2012-03-26 at 17:06 +0100, Keir Fraser wrote: >> Cc''ing Daniel for you on this one, Jan. >> >> K. >> >> On 26/03/2012 16:56, "Jan Beulich" <JBeulich@suse.com> wrote: >> >> > All the resources allocated based on xen_blkif_reqs are global in >> > blkback. While (without having measured anything) I think that this >> > is bad from a QoS perspective (not the least implied from a warning >> > issued by Citrix''es multi-page-ring patches: >> > >> > if (blkif_reqs < BLK_RING_SIZE(order)) >> > printk(KERN_WARNING "WARNING: " >> > "I/O request space (%d reqs) < ring order %ld, " >> > "consider increasing %s.reqs to >= %ld.", >> > blkif_reqs, order, KBUILD_MODNAME, >> > roundup_pow_of_two(BLK_RING_SIZE(order))); >> > >> > indicating that this _is_ a bottleneck), I''m otoh hesitant to convert >> > this to per-instance allocations, as the amount of memory taken >> > away from Dom0 for this may be not insignificant when there are >> > many devices. >> > >> > Does anyone have an opinion here, in particular regarding the >> > original authors'' decision to make this global vs. the apparently >> > made observation (by Daniel Stodden, the author of said patch, >> > who I don''t have any current email of to ask directly), but also >> > in the context of multi-page rings, the purpose of which is to >> > allow for larger amounts of in-flight I/O? >> > >> > Thanks, Jan > > Re-CC''ing Andrei Lifchits, I think there''s been some work going on at > Citrix regarding that matter. > > Yes, just allocating a pfn pool per backend instance is way too much > memory balooned out. Otherwise this stuff would have never looked the > way it does now.This of course could be accounted for by having an initially non-empty (large enough) balloon (not sure how easy it is these days to do this for pv-ops, but it has always been trivial with the legacy code). That wouldn''t help a 32-bit kernel much (where generally the initial balloon is all in highmem, yet the vacated pages need to be in lowmem), but for 64-bit kernels it should be fine.> Regarding the right balance, note that on the other extreme end, if PFN > space were infinite, there''s not much expected performance gain from > rendering virtual backends fully independent. Beyond controller queue > depth, these requests are all just going to pile up, waiting.Is there a way to look through the queue stack to find out how many distinct ones there are that the backend is running on top of as well as - for a particular I/O path - the one with the smallest depth? Or can one assume that the top most one (generally loop''s or blktap2''s) won''t advertise a queue deeper than what is going to be accepted downstream (probably not, I''d guess)? And - what you say would similarly apply to the usefulness of multi-page rings afaict.> XenServer has some support for decoupling in blktap.ko [1] which worked > relatively well: Use frame ''pool'' kobjects. A bunch of pages, mapped to > sysfs object. Name was arbitrary. Size configurable, even at runtime. > > Sysfs meant stuff was easily set up by shell or python code, or > manually. To become operational, every backend must be bound to a pool > (initially, the global ''default'' one, for tool compat). Backends can be > relinked arbitrarily before entering Connected state. > > Then let the userland toolstack set things up according to physical I/O > topology and properties probed. Basically every physical backend (say, a > volume group, or a HBA) would start out by allocating and dimensioning a > dedicated pool (named after the backend), and every backend instance > fired up gets bound to the pool it belongs to.Having userland do all that seems like a fallback solution only to me - I would hope that sufficient information is available directly to the drivers. Thanks in any case for responding so quickly, Jan> There''s a lot of additional optimizations one could consider, e.g. > autogrowing the pool (log(nbackends) or so?) and the like. To improve > locality, having backends which look ahead in their request queue and > allocate whole batches is probably a good idea too, etc, etc. > > HTH, > Daniel > > [1] > http://xenbits.xen.org/gitweb/?p=people/dstodden/linux.git > mostly in drivers/block/blktap/sysfs.c (show/store_pool) and request.c. > Note that these are based on mempools, not the frame pools blkback > would take.
On Mon, 2012-03-26 at 21:47 +0100, Matt Wilson wrote:> On Mon, Mar 26, 2012 at 09:20:08AM -0700, Ian Campbell wrote: > > On Mon, 2012-03-26 at 16:56 +0100, Jan Beulich wrote: > > > > > > Does anyone have an opinion here, in particular regarding the > > > original authors'' decision to make this global vs. the apparently > > > made observation (by Daniel Stodden, the author of said patch, > > > who I don''t have any current email of to ask directly), but also > > > in the context of multi-page rings, the purpose of which is to > > > allow for larger amounts of in-flight I/O? > > > > Not really much to say other than we (well, mostly Wei Liu) have a > > similar issue with netback too. That used to have a global pool, then a > > pool-per-worker thread. When Wei added thread-per-vif support he solved > > this by adding a "page pool" which handles allocations. Possibly this > > could grow some sort of fairness etc nobs and be shared with blkback? > > I''d definitely welcome an approach in blkback similar to the page pool > changes to netback. Steve Noonan''s change to use a spinlock per > blkfront device [1] should show increased contention on the global > request list on the blkback side when multiple devices are attached. > > Does netback not also need to handle questions of fairness and > resource starvation?It does, what I was meaning to say was that the page pool could grow some fairness nobs for netback use (if it is not already fair) and separately that it could be shared with blkback (rather than fairness nobs being related to the sharing with blkback which looking back is how my original statement could have been read). Ian.> > Matt > > [1] http://git.kernel.org/?p=linux/kernel/git/axboe/linux-block.git;a=commitdiff;h=3467811e26660eb46bc655234573d22d6876d5f9
On Tue, 2012-03-27 at 08:27 +0100, Jan Beulich wrote:> >>> On 26.03.12 at 18:53, Daniel Stodden <daniel.stodden@googlemail.com> wrote: > > On Mon, 2012-03-26 at 17:06 +0100, Keir Fraser wrote: > >> Cc''ing Daniel for you on this one, Jan. > >> > >> K. > >> > >> On 26/03/2012 16:56, "Jan Beulich" <JBeulich@suse.com> wrote: > >> > >> > All the resources allocated based on xen_blkif_reqs are global in > >> > blkback. While (without having measured anything) I think that this > >> > is bad from a QoS perspective (not the least implied from a warning > >> > issued by Citrix''es multi-page-ring patches: > >> > > >> > if (blkif_reqs < BLK_RING_SIZE(order)) > >> > printk(KERN_WARNING "WARNING: " > >> > "I/O request space (%d reqs) < ring order %ld, " > >> > "consider increasing %s.reqs to >= %ld.", > >> > blkif_reqs, order, KBUILD_MODNAME, > >> > roundup_pow_of_two(BLK_RING_SIZE(order))); > >> > > >> > indicating that this _is_ a bottleneck), I''m otoh hesitant to convert > >> > this to per-instance allocations, as the amount of memory taken > >> > away from Dom0 for this may be not insignificant when there are > >> > many devices. > >> > > >> > Does anyone have an opinion here, in particular regarding the > >> > original authors'' decision to make this global vs. the apparently > >> > made observation (by Daniel Stodden, the author of said patch, > >> > who I don''t have any current email of to ask directly), but also > >> > in the context of multi-page rings, the purpose of which is to > >> > allow for larger amounts of in-flight I/O? > >> > > >> > Thanks, Jan > > > > Re-CC''ing Andrei Lifchits, I think there''s been some work going on at > > Citrix regarding that matter. > > > > Yes, just allocating a pfn pool per backend instance is way too much > > memory balooned out. Otherwise this stuff would have never looked the > > way it does now. > > This of course could be accounted for by having an initially non-empty > (large enough) balloon (not sure how easy it is these days to do this > for pv-ops, but it has always been trivial with the legacy code). That > wouldn''t help a 32-bit kernel much (where generally the initial balloon > is all in highmem, yet the vacated pages need to be in lowmem), but > for 64-bit kernels it should be fine. > > > Regarding the right balance, note that on the other extreme end, if PFN > > space were infinite, there''s not much expected performance gain from > > rendering virtual backends fully independent. Beyond controller queue > > depth, these requests are all just going to pile up, waiting. > > Is there a way to look through the queue stack to find out how many > distinct ones there are that the backend is running on top of as well > as - for a particular I/O path - the one with the smallest depth? Or can > one assume that the top most one (generally loop''s or blktap2''s) won''t > advertise a queue deeper than what is going to be accepted > downstream (probably not, I''d guess)? > > And - what you say would similarly apply to the usefulness of multi-page > rings afaict. >The balance is tricky. What I observe so far is that having multi-page rings doesn''t necessarily help improve performance (but it is still nice to have it in case of future usage). There are other contentions, which limit throughput of a single VIF.> > XenServer has some support for decoupling in blktap.ko [1] which worked > > relatively well: Use frame ''pool'' kobjects. A bunch of pages, mapped to > > sysfs object. Name was arbitrary. Size configurable, even at runtime. > > > > Sysfs meant stuff was easily set up by shell or python code, or > > manually. To become operational, every backend must be bound to a pool > > (initially, the global ''default'' one, for tool compat). Backends can be > > relinked arbitrarily before entering Connected state. > > > > Then let the userland toolstack set things up according to physical I/O > > topology and properties probed. Basically every physical backend (say, a > > volume group, or a HBA) would start out by allocating and dimensioning a > > dedicated pool (named after the backend), and every backend instance > > fired up gets bound to the pool it belongs to. > > Having userland do all that seems like a fallback solution only to me - I > would hope that sufficient information is available directly to the drivers. >I tempt to make all information available to drivers, but haven''t reached a conclusion yet. Maybe we should also allow user to experiment various configuration for their specific needs? Wei.> Thanks in any case for responding so quickly, > Jan > > > There''s a lot of additional optimizations one could consider, e.g. > > autogrowing the pool (log(nbackends) or so?) and the like. To improve > > locality, having backends which look ahead in their request queue and > > allocate whole batches is probably a good idea too, etc, etc. > > > > HTH, > > Daniel > > > > [1] > > http://xenbits.xen.org/gitweb/?p=people/dstodden/linux.git > > mostly in drivers/block/blktap/sysfs.c (show/store_pool) and request.c. > > Note that these are based on mempools, not the frame pools blkback > > would take. > > > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xen.org > http://lists.xen.org/xen-devel
On Mon, 2012-03-26 at 17:20 +0100, Ian Campbell wrote:> On Mon, 2012-03-26 at 16:56 +0100, Jan Beulich wrote: > > All the resources allocated based on xen_blkif_reqs are global in > > blkback. While (without having measured anything) I think that this > > is bad from a QoS perspective (not the least implied from a warning > > issued by Citrix''es multi-page-ring patches: > > > > if (blkif_reqs < BLK_RING_SIZE(order)) > > printk(KERN_WAfdfdRNING "WARNING: " > > "I/O request space (%d reqs) < ring order %ld, " > > "consider increasing %s.reqs to >= %ld.", > > blkif_reqs, order, KBUILD_MODNAME, > > roundup_pow_of_two(BLK_RING_SIZE(order))); > > > > indicating that this _is_ a bottleneck), I''m otoh hesitant to convert > > this to per-instance allocations, as the amount of memory taken > > away from Dom0 for this may be not insignificant when there are > > many devices. > >What''s your main concern on QoS? Lock contention? Starvation? Or any other things? You are right here, we should be careful on how much memory blkback may consume.> > Does anyone have an opinion here, in particular regarding the > > original authors'' decision to make this global vs. the apparently > > made observation (by Daniel Stodden, the author of said patch, > > who I don''t have any current email of to ask directly), but also > > in the context of multi-page rings, the purpose of which is to > > allow for larger amounts of in-flight I/O? >I don''t know the actual bottleneck for blkback because I never play with it. But for multi-page ring netback / netfront, global pool (in the sense of starvation) may not become bottleneck if there are very limited VIFs running. What I observe is that to achieve 3.8Gb/s throughput between two VIFs connected throughput internal bridge, it only consumes ~80 pages in the pool. Wei.> Not really much to say other than we (well, mostly Wei Liu) have a > similar issue with netback too. That used to have a global pool, then a > pool-per-worker thread. When Wei added thread-per-vif support he solved > this by adding a "page pool" which handles allocations. Possibly this > could grow some sort of fairness etc nobs and be shared with blkback? > > Ian. >
>>> On 27.03.12 at 11:21, Wei Liu <wei.liu2@citrix.com> wrote: > On Tue, 2012-03-27 at 08:27 +0100, Jan Beulich wrote: >> >>> On 26.03.12 at 18:53, Daniel Stodden <daniel.stodden@googlemail.com> wrote: >> > Then let the userland toolstack set things up according to physical I/O >> > topology and properties probed. Basically every physical backend (say, a >> > volume group, or a HBA) would start out by allocating and dimensioning a >> > dedicated pool (named after the backend), and every backend instance >> > fired up gets bound to the pool it belongs to. >> >> Having userland do all that seems like a fallback solution only to me - I >> would hope that sufficient information is available directly to the drivers. >> > > I tempt to make all information available to drivers, but haven''t > reached a conclusion yet. Maybe we should also allow user to experiment > various configuration for their specific needs?Oh, yes, user overrides are certainly desirable. I was merely hinting at it being desirable from my pov to have sensible default behavior. Jan
>>> On 27.03.12 at 11:41, Wei Liu <wei.liu2@citrix.com> wrote: > On Mon, 2012-03-26 at 17:20 +0100, Ian Campbell wrote: >> On Mon, 2012-03-26 at 16:56 +0100, Jan Beulich wrote: >> > All the resources allocated based on xen_blkif_reqs are global in >> > blkback. While (without having measured anything) I think that this >> > is bad from a QoS perspective (not the least implied from a warning >> > issued by Citrix''es multi-page-ring patches: >> > >> > if (blkif_reqs < BLK_RING_SIZE(order)) >> > printk(KERN_WAfdfdRNING "WARNING: " >> > "I/O request space (%d reqs) < ring order %ld, " >> > "consider increasing %s.reqs to >= %ld.", >> > blkif_reqs, order, KBUILD_MODNAME, >> > roundup_pow_of_two(BLK_RING_SIZE(order))); >> > >> > indicating that this _is_ a bottleneck), I''m otoh hesitant to convert >> > this to per-instance allocations, as the amount of memory taken >> > away from Dom0 for this may be not insignificant when there are >> > many devices. >> > > > What''s your main concern on QoS? Lock contention? Starvation? Or any > other things?However you want to put it. Prior to the multi-page ring patches, we have 64 pending requests (global) and 32 ring entries. Obviously, bumping the ring size just to order 1 will already bring the number of possible in-flight entries per device on par with those in-flight across all devices. So _if_ someone really determined that a multi-page ring helps performance, I wonder whether that was with manually adjusted global pending request values (not said anywhere) or with just a single frontend (not very close to real world scenarios). In any case, two guests with heavy I/O clearly have the potential to hinder each other, even if both get backed by different physical devices. Jan
On Tue, 2012-03-27 at 11:22 +0100, Jan Beulich wrote:> >>> On 27.03.12 at 11:41, Wei Liu <wei.liu2@citrix.com> wrote: > > On Mon, 2012-03-26 at 17:20 +0100, Ian Campbell wrote: > >> On Mon, 2012-03-26 at 16:56 +0100, Jan Beulich wrote: > >> > All the resources allocated based on xen_blkif_reqs are global in > >> > blkback. While (without having measured anything) I think that this > >> > is bad from a QoS perspective (not the least implied from a warning > >> > issued by Citrix''es multi-page-ring patches: > >> > > >> > if (blkif_reqs < BLK_RING_SIZE(order)) > >> > printk(KERN_WARNING "WARNING: " > >> > "I/O request space (%d reqs) < ring order %ld, " > >> > "consider increasing %s.reqs to >= %ld.", > >> > blkif_reqs, order, KBUILD_MODNAME, > >> > roundup_pow_of_two(BLK_RING_SIZE(order))); > >> > > >> > indicating that this _is_ a bottleneck), I''m otoh hesitant to convert > >> > this to per-instance allocations, as the amount of memory taken > >> > away from Dom0 for this may be not insignificant when there are > >> > many devices. > >> > > > > > What''s your main concern on QoS? Lock contention? Starvation? Or any > > other things? > > However you want to put it. Prior to the multi-page ring patches, we > have 64 pending requests (global) and 32 ring entries. Obviously, > bumping the ring size just to order 1 will already bring the number of > possible in-flight entries per device on par with those in-flight across > all devices. So _if_ someone really determined that a multi-page ring > helps performance, I wonder whether that was with manually > adjusted global pending request values (not said anywhere) or with > just a single frontend (not very close to real world scenarios). >Just to be precise, bumping order to 1 makes ring entries more than 64.> In any case, two guests with heavy I/O clearly have the potential to > hinder each other, even if both get backed by different physical > devices. >Right. One solution I can think of is that each blk thread holds a small number of private entries (threshold to be determined), while blkback maintains a pool for any allocation goes beyond that threshold. But this just makes things more and more complex -- let''s not overkill ourselves before we identify the real bottleneck. Wei.> Jan >
>>> On 27.03.12 at 14:34, Wei Liu <wei.liu2@citrix.com> wrote: > On Tue, 2012-03-27 at 11:22 +0100, Jan Beulich wrote: >> >>> On 27.03.12 at 11:41, Wei Liu <wei.liu2@citrix.com> wrote: >> > On Mon, 2012-03-26 at 17:20 +0100, Ian Campbell wrote: >> >> On Mon, 2012-03-26 at 16:56 +0100, Jan Beulich wrote: >> >> > All the resources allocated based on xen_blkif_reqs are global in >> >> > blkback. While (without having measured anything) I think that this >> >> > is bad from a QoS perspective (not the least implied from a warning >> >> > issued by Citrix''es multi-page-ring patches: >> >> > >> >> > if (blkif_reqs < BLK_RING_SIZE(order)) >> >> > printk(KERN_WARNING "WARNING: " >> >> > "I/O request space (%d reqs) < ring order %ld, " >> >> > "consider increasing %s.reqs to >= %ld.", >> >> > blkif_reqs, order, KBUILD_MODNAME, >> >> > roundup_pow_of_two(BLK_RING_SIZE(order))); >> >> > >> >> > indicating that this _is_ a bottleneck), I''m otoh hesitant to convert >> >> > this to per-instance allocations, as the amount of memory taken >> >> > away from Dom0 for this may be not insignificant when there are >> >> > many devices. >> >> > >> > >> > What''s your main concern on QoS? Lock contention? Starvation? Or any >> > other things? >> >> However you want to put it. Prior to the multi-page ring patches, we >> have 64 pending requests (global) and 32 ring entries. Obviously, >> bumping the ring size just to order 1 will already bring the number of >> possible in-flight entries per device on par with those in-flight across >> all devices. So _if_ someone really determined that a multi-page ring >> helps performance, I wonder whether that was with manually >> adjusted global pending request values (not said anywhere) or with >> just a single frontend (not very close to real world scenarios). >> > > Just to be precise, bumping order to 1 makes ring entries more than 64.Iirc order 0 -> 32 entries, order 1 -> 64 entries. Jan
On Tue, 2012-03-27 at 13:45 +0100, Jan Beulich wrote:> >>> On 27.03.12 at 14:34, Wei Liu <wei.liu2@citrix.com> wrote: > > On Tue, 2012-03-27 at 11:22 +0100, Jan Beulich wrote: > >> >>> On 27.03.12 at 11:41, Wei Liu <wei.liu2@citrix.com> wrote: > >> > On Mon, 2012-03-26 at 17:20 +0100, Ian Campbell wrote: > >> >> On Mon, 2012-03-26 at 16:56 +0100, Jan Beulich wrote: > >> >> > All the resources allocated based on xen_blkif_reqs are global in > >> >> > blkback. While (without having measured anything) I think that this > >> >> > is bad from a QoS perspective (not the least implied from a warning > >> >> > issued by Citrix''es multi-page-ring patches: > >> >> > > >> >> > if (blkif_reqs < BLK_RING_SIZE(order)) > >> >> > printk(KERN_WARNING "WARNING: " > >> >> > "I/O request space (%d reqs) < ring order %ld, " > >> >> > "consider increasing %s.reqs to >= %ld.", > >> >> > blkif_reqs, order, KBUILD_MODNAME, > >> >> > roundup_pow_of_two(BLK_RING_SIZE(order))); > >> >> > > >> >> > indicating that this _is_ a bottleneck), I''m otoh hesitant to convert > >> >> > this to per-instance allocations, as the amount of memory taken > >> >> > away from Dom0 for this may be not insignificant when there are > >> >> > many devices. > >> >> > > >> > > >> > What''s your main concern on QoS? Lock contention? Starvation? Or any > >> > other things? > >> > >> However you want to put it. Prior to the multi-page ring patches, we > >> have 64 pending requests (global) and 32 ring entries. Obviously, > >> bumping the ring size just to order 1 will already bring the number of > >> possible in-flight entries per device on par with those in-flight across > >> all devices. So _if_ someone really determined that a multi-page ring > >> helps performance, I wonder whether that was with manually > >> adjusted global pending request values (not said anywhere) or with > >> just a single frontend (not very close to real world scenarios). > >> > > > > Just to be precise, bumping order to 1 makes ring entries more than 64. > > Iirc order 0 -> 32 entries, order 1 -> 64 entries. >I skimmed the code just to make sure. You''re right, the size is rounded down to power of 2. What I was talking about is the value before rounded-down. Wei.> Jan >
> -----Original Message----- > From: Jan Beulich [mailto:JBeulich@suse.com] > Sent: Tuesday, March 27, 2012 8:27 AM > To: Daniel Stodden > Cc: Andrei Lifchits; xen-devel > Subject: Re: [Xen-devel] blkback global resources > > >>> On 26.03.12 at 18:53, Daniel Stodden <daniel.stodden@googlemail.com> > wrote: > > On Mon, 2012-03-26 at 17:06 +0100, Keir Fraser wrote: > >> Cc''ing Daniel for you on this one, Jan. > >> > >> K. > >> > >> On 26/03/2012 16:56, "Jan Beulich" <JBeulich@suse.com> wrote: > >> > >> > All the resources allocated based on xen_blkif_reqs are global in > >> > blkback. While (without having measured anything) I think that this > >> > is bad from a QoS perspective (not the least implied from a warning > >> > issued by Citrix''es multi-page-ring patches: > >> > > >> > if (blkif_reqs < BLK_RING_SIZE(order)) printk(KERN_WARNING > >> > "WARNING: " > >> > "I/O request space (%d reqs) < ring order %ld, " > >> > "consider increasing %s.reqs to >= %ld.", > >> > blkif_reqs, order, KBUILD_MODNAME, > >> > roundup_pow_of_two(BLK_RING_SIZE(order))); > >> > > >> > indicating that this _is_ a bottleneck), I''m otoh hesitant to > >> > convert this to per-instance allocations, as the amount of memory > >> > taken away from Dom0 for this may be not insignificant when there > >> > are many devices. > >> > > >> > Does anyone have an opinion here, in particular regarding the > >> > original authors'' decision to make this global vs. the apparently > >> > made observation (by Daniel Stodden, the author of said patch, who > >> > I don''t have any current email of to ask directly), but also in the > >> > context of multi-page rings, the purpose of which is to allow for > >> > larger amounts of in-flight I/O? > >> > > >> > Thanks, Jan > > > > Re-CC''ing Andrei Lifchits, I think there''s been some work going on at > > Citrix regarding that matter. > > > > Yes, just allocating a pfn pool per backend instance is way too much > > memory balooned out. Otherwise this stuff would have never looked the > > way it does now. > > This of course could be accounted for by having an initially non-empty (large > enough) balloon (not sure how easy it is these days to do this for pv-ops, but > it has always been trivial with the legacy code). That wouldn''t help a 32-bit > kernel much (where generally the initial balloon is all in highmem, yet the > vacated pages need to be in lowmem), but for 64-bit kernels it should be > fine. > > > Regarding the right balance, note that on the other extreme end, if > > PFN space were infinite, there''s not much expected performance gain > > from rendering virtual backends fully independent. Beyond controller > > queue depth, these requests are all just going to pile up, waiting. > > Is there a way to look through the queue stack to find out how many distinct > ones there are that the backend is running on top of as well as - for a > particular I/O path - the one with the smallest depth? Or can one assume > that the top most one (generally loop''s or blktap2''s) won''t advertise a queue > deeper than what is going to be accepted downstream (probably not, I''d > guess)?Hm, I don''t remember seeing anything relating to that off the top of my head in the blkback code, so I don''t think so. (I''m not sure the benefit would be that great, anyways).> And - what you say would similarly apply to the usefulness of multi-page > rings afaict. > > > XenServer has some support for decoupling in blktap.ko [1] which > > worked relatively well: Use frame ''pool'' kobjects. A bunch of pages, > > mapped to sysfs object. Name was arbitrary. Size configurable, even at > runtime.I have added a similar functionality to blkback (pools configurable through xenstore, with userland tools creating one pool per SR), which is now out in the form of a limited-availability hotfix and will be there in the next XenServer release. Felipe (CC''d) measured the effects on performance and found that it helps.> > Sysfs meant stuff was easily set up by shell or python code, or > > manually. To become operational, every backend must be bound to a pool > > (initially, the global ''default'' one, for tool compat). Backends can > > be relinked arbitrarily before entering Connected state. > > > > Then let the userland toolstack set things up according to physical > > I/O topology and properties probed. Basically every physical backend > > (say, a volume group, or a HBA) would start out by allocating and > > dimensioning a dedicated pool (named after the backend), and every > > backend instance fired up gets bound to the pool it belongs to. > > Having userland do all that seems like a fallback solution only to me - I would > hope that sufficient information is available directly to the drivers.You''re probably right.> Thanks in any case for responding so quickly, Jan > > > There''s a lot of additional optimizations one could consider, e.g. > > autogrowing the pool (log(nbackends) or so?) and the like. To improve > > locality, having backends which look ahead in their request queue and > > allocate whole batches is probably a good idea too, etc, etc. > > > > HTH, > > Daniel > > > > [1] > > http://xenbits.xen.org/gitweb/?p=people/dstodden/linux.git > > mostly in drivers/block/blktap/sysfs.c (show/store_pool) and request.c. > > Note that these are based on mempools, not the frame pools blkback > > would take. >Cheers, Andrei
Reasonably Related Threads
- [PATCH 0001/001] xen: multi page ring support for block devices
- [PATCH 0001/001] xen: multi page ring support for block devices
- [PATCH 0001/001] xen: multi page ring support for block devices
- [PATCH] blkback: Fix block I/O latency issue
- [PATCH] multi-page blkfront/blkback patch