Dave Hansen
2016-Dec-07 16:57 UTC
[PATCH kernel v5 0/5] Extend virtio-balloon for fast (de)inflating & fast live migration
Removing silly virtio-dev@ list because it's bouncing mail... On 12/07/2016 08:21 AM, David Hildenbrand wrote:>> Li's current patches do that. Well, maybe not pfn/length, but they do >> take a pfn and page-order, which fits perfectly with the kernel's >> concept of high-order pages. > > So we can send length in powers of two. Still, I don't see any benefit > over a simple pfn/len schema. But I'll have a more detailed look at the > implementation first, maybe that will enlighten me :)It is more space-efficient. We're fitting the order into 6 bits, which would allows the full 2^64 address space to be represented in one entry, and leaves room for the bitmap size to be encoded as well, if we decide we need a bitmap in the future. If that was purely a length, we'd be limited to 64*4k pages per entry, which isn't even a full large page.
Andrea Arcangeli
2016-Dec-07 18:38 UTC
[Qemu-devel] [PATCH kernel v5 0/5] Extend virtio-balloon for fast (de)inflating & fast live migration
Hello, On Wed, Dec 07, 2016 at 08:57:01AM -0800, Dave Hansen wrote:> It is more space-efficient. We're fitting the order into 6 bits, which > would allows the full 2^64 address space to be represented in one entry,Very large order is the same as very large len, 6 bits of order or 8 bytes of len won't really move the needle here, simpler code is preferable. The main benefit of "len" is that it can be more granular, plus it's simpler than the bitmap too. Eventually all this stuff has to end up into a madvisev (not yet upstream but somebody posted it for jemalloc and should get merged eventually). So the bitmap shall be demuxed to a addr,len array anyway, the bitmap won't ever be sent to the madvise syscall, which makes the intermediate representation with the bitmap a complication with basically no benefits compared to a (N, [addr1,len1], .., [addrN, lenN]) representation. If you prefer 1 byte of order (not just 6 bits) instead 8bytes of len that's possible too, I wouldn't be against that, the conversion before calling madvise would be pretty efficient too.> and leaves room for the bitmap size to be encoded as well, if we decide > we need a bitmap in the future.How would a bitmap ever be useful with very large page-order?> If that was purely a length, we'd be limited to 64*4k pages per entry, > which isn't even a full large page.I don't follow here. What we suggest is to send the data down represented as (N, [addr1,len1], ..., [addrN, lenN]) which allows infinite ranges each one of maximum length 2^64, so 2^64 multiplied infinite times if you wish. Simplifying the code and not having any bitmap at all and no :6 :6 bits either. The high order to low order loop of allocations is the interesting part here, not the bitmap, and the fact of doing a single vmexit to send the large ranges. Once we pull out the largest order regions, we just add them to the array as [addr,1UL<<order], when the array reaches a maximum N number of entries or we fail a order 0 allocation, we flush all those entries down to qemu. Qemu then builds the iov for madvisev and it's pretty much a 1:1 conversion, not a decoding operation converting the bitmap in the (N, [addr1,len1], ..., [addrN, lenN]) for madvisev (or a flood of madvise MADV_DONTNEED with current kernels). Considering the loop that allocates starting from MAX_ORDER..1, the chance the bitmap is actually getting filled with more than one bit at page_shift of PAGE_SHIFT should be very low after some uptime. By the very nature of this loop, if we already exacerbates all high order buddies, the page-order 0 pages obtained are going to be fairly fragmented reducing the usefulness of the bitmap and potentially only wasting CPU/memory.
Dave Hansen
2016-Dec-07 18:44 UTC
[Qemu-devel] [PATCH kernel v5 0/5] Extend virtio-balloon for fast (de)inflating & fast live migration
On 12/07/2016 10:38 AM, Andrea Arcangeli wrote:>> > and leaves room for the bitmap size to be encoded as well, if we decide >> > we need a bitmap in the future. > How would a bitmap ever be useful with very large page-order?Please, guys. Read the patches. *Please*. The current code doesn't even _use_ a bitmap.
Dave Hansen
2016-Dec-07 19:54 UTC
[Qemu-devel] [PATCH kernel v5 0/5] Extend virtio-balloon for fast (de)inflating & fast live migration
We're talking about a bunch of different stuff which is all being conflated. There are 3 issues here that I can see. I'll attempt to summarize what I think is going on: 1. Current patches do a hypercall for each order in the allocator. This is inefficient, but independent from the underlying data structure in the ABI, unless bitmaps are in play, which they aren't. 2. Should we have bitmaps in the ABI, even if they are not in use by the guest implementation today? Andrea says they have zero benefits over a pfn/len scheme. Dave doesn't think they have zero benefits but isn't that attached to them. QEMU's handling gets more complicated when using a bitmap. 3. Should the ABI contain records each with a pfn/len pair or a pfn/order pair? 3a. 'len' is more flexible, but will always be a power-of-two anyway for high-order pages (the common case) 3b. if we decide not to have a bitmap, then we basically have plenty of space for 'len' and should just do it 3c. It's easiest for the hypervisor to turn pfn/len into the madvise() calls that it needs. Did I miss anything? On 12/07/2016 10:38 AM, Andrea Arcangeli wrote:> On Wed, Dec 07, 2016 at 08:57:01AM -0800, Dave Hansen wrote: >> It is more space-efficient. We're fitting the order into 6 bits, which >> would allows the full 2^64 address space to be represented in one entry, > > Very large order is the same as very large len, 6 bits of order or 8 > bytes of len won't really move the needle here, simpler code is > preferable.Agreed. But without seeing them side-by-side I'm not sure we can say which is simpler.> The main benefit of "len" is that it can be more granular, plus it's > simpler than the bitmap too. Eventually all this stuff has to end up > into a madvisev (not yet upstream but somebody posted it for jemalloc > and should get merged eventually). > > So the bitmap shall be demuxed to a addr,len array anyway, the bitmap > won't ever be sent to the madvise syscall, which makes the > intermediate representation with the bitmap a complication with > basically no benefits compared to a (N, [addr1,len1], .., [addrN, > lenN]) representation.FWIW, I don't feel that strongly about the bitmap. Li had one originally, but I think the code thus far has demonstrated a huge benefit without even having a bitmap. I've got no objections to ripping the bitmap out of the ABI.>> and leaves room for the bitmap size to be encoded as well, if we decide >> we need a bitmap in the future. > > How would a bitmap ever be useful with very large page-order?Surely we can think of a few ways... A bitmap is 64x more dense if the lists are unordered. It means being able to store ~32k*2M=64G worth of 2M pages in one data page vs. ~1G. That's 64x fewer cachelines to touch, 64x fewer pages to move to the hypervisor and lets us allocate 1/64th the memory. Given a maximum allocation that we're allowed, it lets us do 64x more per-pass. Now, are those benefits worth it? Maybe not, but let's not pretend they don't exist. ;)>> If that was purely a length, we'd be limited to 64*4k pages per entry, >> which isn't even a full large page. > > I don't follow here. > > What we suggest is to send the data down represented as (N, > [addr1,len1], ..., [addrN, lenN]) which allows infinite ranges each > one of maximum length 2^64, so 2^64 multiplied infinite times if you > wish. Simplifying the code and not having any bitmap at all and no :6 > :6 bits either. > > The high order to low order loop of allocations is the interesting part > here, not the bitmap, and the fact of doing a single vmexit to send > the large ranges.Yes, the current code sends one batch of pages up to the hypervisor per order. But, this has nothing to do with the underlying data structure, or the choice to have an order vs. len in the ABI. What you describe here is obviously more efficient.> Considering the loop that allocates starting from MAX_ORDER..1, the > chance the bitmap is actually getting filled with more than one bit at > page_shift of PAGE_SHIFT should be very low after some uptime.Yes, if bitmaps were in use, this is true. I think a guest populating bitmaps would probably not use the same algorithm.> By the very nature of this loop, if we already exacerbates all high > order buddies, the page-order 0 pages obtained are going to be fairly > fragmented reducing the usefulness of the bitmap and potentially only > wasting CPU/memory.
Maybe Matching Threads
- [Qemu-devel] [PATCH kernel v5 0/5] Extend virtio-balloon for fast (de)inflating & fast live migration
- [Qemu-devel] [PATCH kernel v5 0/5] Extend virtio-balloon for fast (de)inflating & fast live migration
- [Qemu-devel] [PATCH kernel v5 0/5] Extend virtio-balloon for fast (de)inflating & fast live migration
- [Qemu-devel] [PATCH kernel v5 0/5] Extend virtio-balloon for fast (de)inflating & fast live migration
- [Qemu-devel] [PATCH kernel v5 0/5] Extend virtio-balloon for fast (de)inflating & fast live migration