thr3ads.net - Linux Virtualization - [RFC PATCH V2 0/5] vhost: accelerate metadata access through vmap() [Mar 2019]

If this information is useful, please help other people find it:
Share via:

Michael S. Tsirkin

2019-Mar-12 03:52 UTC

[RFC PATCH V2 0/5] vhost: accelerate metadata access through vmap()

On Tue, Mar 12, 2019 at 10:59:09AM +0800, Jason Wang
wrote:> 
> On 2019/3/12 ??2:14, David Miller wrote:
> > From: "Michael S. Tsirkin" <mst at redhat.com>
> > Date: Mon, 11 Mar 2019 09:59:28 -0400
> > 
> > > On Mon, Mar 11, 2019 at 03:13:17PM +0800, Jason Wang wrote:
> > > > On 2019/3/8 ??10:12, Christoph Hellwig wrote:
> > > > > On Wed, Mar 06, 2019 at 02:18:07AM -0500, Jason Wang
wrote:
> > > > > > This series tries to access virtqueue metadata
through kernel virtual
> > > > > > address instead of copy_user() friends since they
had too much
> > > > > > overheads like checks, spec barriers or even
hardware feature
> > > > > > toggling. This is done through setup kernel
address through vmap() and
> > > > > > resigter MMU notifier for invalidation.
> > > > > > 
> > > > > > Test shows about 24% improvement on TX PPS.
TCP_STREAM doesn't see
> > > > > > obvious improvement.
> > > > > How is this going to work for CPUs with virtually
tagged caches?
> > > > 
> > > > Anything different that you worry?
> > > If caches have virtual tags then kernel and userspace view of
memory
> > > might not be automatically in sync if they access memory
> > > through different virtual addresses. You need to do things like
> > > flush_cache_page, probably multiple times.
> > "flush_dcache_page()"
> 
> 
> I get this. Then I think the current set_bit_to_user() is suspicious, we
> probably miss a flush_dcache_page() there:
> 
> 
> static int set_bit_to_user(int nr, void __user *addr)
> {
> ??????? unsigned long log = (unsigned long)addr;
> ??????? struct page *page;
> ??????? void *base;
> ??????? int bit = nr + (log % PAGE_SIZE) * 8;
> ??????? int r;
> 
> ??????? r = get_user_pages_fast(log, 1, 1, &page);
> ??????? if (r < 0)
> ??????????????? return r;
> ??????? BUG_ON(r != 1);
> ??????? base = kmap_atomic(page);
> ??????? set_bit(bit, base);
> ??????? kunmap_atomic(base);
> ??????? set_page_dirty_lock(page);
> ??????? put_page(page);
> ??????? return 0;
> }
> 
> Thanks
I think you are right. The correct fix though is to re-implement
it using asm and handling pagefault, not gup.
Three atomic ops per bit is way to expensive.

-- 
MST

Jason Wang

2019-Mar-12 07:17 UTC

head link

[RFC PATCH V2 0/5] vhost: accelerate metadata access through vmap()

On 2019/3/12 ??11:52, Michael S. Tsirkin wrote:> On Tue, Mar 12, 2019 at 10:59:09AM +0800, Jason Wang wrote:
>> On 2019/3/12 ??2:14, David Miller wrote:
>>> From: "Michael S. Tsirkin" <mst at redhat.com>
>>> Date: Mon, 11 Mar 2019 09:59:28 -0400
>>>
>>>> On Mon, Mar 11, 2019 at 03:13:17PM +0800, Jason Wang wrote:
>>>>> On 2019/3/8 ??10:12, Christoph Hellwig wrote:
>>>>>> On Wed, Mar 06, 2019 at 02:18:07AM -0500, Jason Wang
wrote:
>>>>>>> This series tries to access virtqueue metadata
through kernel virtual
>>>>>>> address instead of copy_user() friends since they
had too much
>>>>>>> overheads like checks, spec barriers or even
hardware feature
>>>>>>> toggling. This is done through setup kernel address
through vmap() and
>>>>>>> resigter MMU notifier for invalidation.
>>>>>>>
>>>>>>> Test shows about 24% improvement on TX PPS.
TCP_STREAM doesn't see
>>>>>>> obvious improvement.
>>>>>> How is this going to work for CPUs with virtually
tagged caches?
>>>>> Anything different that you worry?
>>>> If caches have virtual tags then kernel and userspace view of
memory
>>>> might not be automatically in sync if they access memory
>>>> through different virtual addresses. You need to do things like
>>>> flush_cache_page, probably multiple times.
>>> "flush_dcache_page()"
>>
>> I get this. Then I think the current set_bit_to_user() is suspicious,
we
>> probably miss a flush_dcache_page() there:
>>
>>
>> static int set_bit_to_user(int nr, void __user *addr)
>> {
>>  ??????? unsigned long log = (unsigned long)addr;
>>  ??????? struct page *page;
>>  ??????? void *base;
>>  ??????? int bit = nr + (log % PAGE_SIZE) * 8;
>>  ??????? int r;
>>
>>  ??????? r = get_user_pages_fast(log, 1, 1, &page);
>>  ??????? if (r < 0)
>>  ??????????????? return r;
>>  ??????? BUG_ON(r != 1);
>>  ??????? base = kmap_atomic(page);
>>  ??????? set_bit(bit, base);
>>  ??????? kunmap_atomic(base);
>>  ??????? set_page_dirty_lock(page);
>>  ??????? put_page(page);
>>  ??????? return 0;
>> }
>>
>> Thanks
> I think you are right. The correct fix though is to re-implement
> it using asm and handling pagefault, not gup.

I agree but it needs to introduce new helpers in asm? for all archs 
which is not trivial. At least for -stable, we need the flush?

> Three atomic ops per bit is way to expensive.

Yes.

Thanks

Michael S. Tsirkin

2019-Mar-12 11:54 UTC

head link

[RFC PATCH V2 0/5] vhost: accelerate metadata access through vmap()

On Tue, Mar 12, 2019 at 03:17:00PM +0800, Jason Wang
wrote:> 
> On 2019/3/12 ??11:52, Michael S. Tsirkin wrote:
> > On Tue, Mar 12, 2019 at 10:59:09AM +0800, Jason Wang wrote:
> > > On 2019/3/12 ??2:14, David Miller wrote:
> > > > From: "Michael S. Tsirkin" <mst at
redhat.com>
> > > > Date: Mon, 11 Mar 2019 09:59:28 -0400
> > > > 
> > > > > On Mon, Mar 11, 2019 at 03:13:17PM +0800, Jason Wang
wrote:
> > > > > > On 2019/3/8 ??10:12, Christoph Hellwig wrote:
> > > > > > > On Wed, Mar 06, 2019 at 02:18:07AM -0500,
Jason Wang wrote:
> > > > > > > > This series tries to access virtqueue
metadata through kernel virtual
> > > > > > > > address instead of copy_user() friends
since they had too much
> > > > > > > > overheads like checks, spec barriers or
even hardware feature
> > > > > > > > toggling. This is done through setup
kernel address through vmap() and
> > > > > > > > resigter MMU notifier for invalidation.
> > > > > > > > 
> > > > > > > > Test shows about 24% improvement on TX
PPS. TCP_STREAM doesn't see
> > > > > > > > obvious improvement.
> > > > > > > How is this going to work for CPUs with
virtually tagged caches?
> > > > > > Anything different that you worry?
> > > > > If caches have virtual tags then kernel and userspace
view of memory
> > > > > might not be automatically in sync if they access
memory
> > > > > through different virtual addresses. You need to do
things like
> > > > > flush_cache_page, probably multiple times.
> > > > "flush_dcache_page()"
> > > 
> > > I get this. Then I think the current set_bit_to_user() is
suspicious, we
> > > probably miss a flush_dcache_page() there:
> > > 
> > > 
> > > static int set_bit_to_user(int nr, void __user *addr)
> > > {
> > >  ??????? unsigned long log = (unsigned long)addr;
> > >  ??????? struct page *page;
> > >  ??????? void *base;
> > >  ??????? int bit = nr + (log % PAGE_SIZE) * 8;
> > >  ??????? int r;
> > > 
> > >  ??????? r = get_user_pages_fast(log, 1, 1, &page);
> > >  ??????? if (r < 0)
> > >  ??????????????? return r;
> > >  ??????? BUG_ON(r != 1);
> > >  ??????? base = kmap_atomic(page);
> > >  ??????? set_bit(bit, base);
> > >  ??????? kunmap_atomic(base);
> > >  ??????? set_page_dirty_lock(page);
> > >  ??????? put_page(page);
> > >  ??????? return 0;
> > > }
> > > 
> > > Thanks
> > I think you are right. The correct fix though is to re-implement
> > it using asm and handling pagefault, not gup.
> 
> 
> I agree but it needs to introduce new helpers in asm? for all archs which
is
> not trivial.
We can have a generic implementation using kmap.
> At least for -stable, we need the flush?
> 
> 
> > Three atomic ops per bit is way to expensive.
> 
> 
> Yes.
> 
> Thanks
See James's reply - I stand corrected we do kunmap so no need to flush.

-- 
MST

Andrea Arcangeli

2019-Mar-12 20:04 UTC

head link

[RFC PATCH V2 0/5] vhost: accelerate metadata access through vmap()

On Tue, Mar 12, 2019 at 08:46:50AM -0700, James Bottomley
wrote:> On Tue, 2019-03-12 at 07:54 -0400, Michael S. Tsirkin wrote:
> > On Tue, Mar 12, 2019 at 03:17:00PM +0800, Jason Wang wrote:
> > > 
> > > On 2019/3/12 ??????11:52, Michael S. Tsirkin wrote:
> > > > On Tue, Mar 12, 2019 at 10:59:09AM +0800, Jason Wang wrote:
> [...]
> > > At least for -stable, we need the flush?
> > > 
> > > 
> > > > Three atomic ops per bit is way to expensive.
> > > 
> > > 
> > > Yes.
> > > 
> > > Thanks
> > 
> > See James's reply - I stand corrected we do kunmap so no need to
> > flush.
> 
> Well, I said that's what we do on Parisc.  The cachetlb document
> definitely says if you alter the data between kmap and kunmap you are
> responsible for the flush.  It's just that flush_dcache_page() is a no-
> op on x86 so they never remember to add it and since it will crash
> parisc if you get it wrong we finally gave up trying to make them.
> 
> But that's the point: it is a no-op on your favourite architecture so
> it costs you nothing to add it.
Yes, the fact Parisc gave up and is doing it on kunmap is reasonable
approach for Parisc, but it doesn't move the needle as far as vhost
common code is concerned, because other archs don't flush any cache on
kunmap.

So either all other archs give up trying to optimize, or vhost still
has to call flush_dcache_page() after kunmap.

Which means after we fix vhost to add the flush_dcache_page after
kunmap, Parisc will get a double hit (but it also means Parisc was the
only one of those archs needed explicit cache flushes, where vhost
worked correctly so far.. so it kinds of proofs your point of giving
up being the safe choice).

Thanks,
Andrea

Andrea Arcangeli

2019-Mar-12 21:11 UTC

head link

[RFC PATCH V2 0/5] vhost: accelerate metadata access through vmap()

On Tue, Mar 12, 2019 at 01:53:37PM -0700, James Bottomley
wrote:> I've got to say: optimize what?  What code do we ever have in the
> kernel that kmap's a page and then doesn't do anything with it? You
can
> guarantee that on kunmap the page is either referenced (needs
> invalidating) or updated (needs flushing). The in-kernel use of kmap is
> always
> 
> kmap
> do something with the mapped page
> kunmap
> 
> In a very short interval.  It seems just a simplification to make
> kunmap do the flush if needed rather than try to have the users
> remember.  The thing which makes this really simple is that on most
> architectures flush and invalidate is the same operation.  If you
> really want to optimize you can use the referenced and dirty bits on
> the kmapped pte to tell you what operation to do, but if your flush is
> your invalidate, you simply assume the data needs flushing on kunmap
> without checking anything.
Except other archs like arm64 and sparc do the cache flushing on
copy_to_user_page and copy_user_page, not on kunmap.

#define copy_user_page(to,from,vaddr,pg) __cpu_copy_user_page(to, from, vaddr)
void __cpu_copy_user_page(void *kto, const void *kfrom, unsigned long vaddr)
{
	struct page *page = virt_to_page(kto);
	copy_page(kto, kfrom);
	flush_dcache_page(page);
}
#define copy_user_page(to, from, vaddr, page)	\
	do {	copy_page(to, from);		\
		sparc_flush_page_to_ram(page);	\
	} while (0)

And they do nothing on kunmap:

static inline void kunmap(struct page *page)
{
	BUG_ON(in_interrupt());
	if (!PageHighMem(page))
		return;
	kunmap_high(page);
}
void kunmap_high(struct page *page)
{
	unsigned long vaddr;
	unsigned long nr;
	unsigned long flags;
	int need_wakeup;
	unsigned int color = get_pkmap_color(page);
	wait_queue_head_t *pkmap_map_wait;

	lock_kmap_any(flags);
	vaddr = (unsigned long)page_address(page);
	BUG_ON(!vaddr);
	nr = PKMAP_NR(vaddr);

	/*
	 * A count must never go down to zero
	 * without a TLB flush!
	 */
	need_wakeup = 0;
	switch (--pkmap_count[nr]) {
	case 0:
		BUG();
	case 1:
		/*
		 * Avoid an unnecessary wake_up() function call.
		 * The common case is pkmap_count[] == 1, but
		 * no waiters.
		 * The tasks queued in the wait-queue are guarded
		 * by both the lock in the wait-queue-head and by
		 * the kmap_lock.  As the kmap_lock is held here,
		 * no need for the wait-queue-head's lock.  Simply
		 * test if the queue is empty.
		 */
		pkmap_map_wait = get_pkmap_wait_queue_head(color);
		need_wakeup = waitqueue_active(pkmap_map_wait);
	}
	unlock_kmap_any(flags);

	/* do wake-up, if needed, race-free outside of the spin lock */
	if (need_wakeup)
		wake_up(pkmap_map_wait);
}
static inline void kunmap(struct page *page)
{
}

because they already did it just above.

> > Which means after we fix vhost to add the flush_dcache_page after
> > kunmap, Parisc will get a double hit (but it also means Parisc was
> > the only one of those archs needed explicit cache flushes, where
> > vhost worked correctly so far.. so it kinds of proofs your point of
> > giving up being the safe choice).
> 
> What double hit?  If there's no cache to flush then cache flush is a
> no-op.  It's also a highly piplineable no-op because the CPU has the L1
> cache within easy reach.  The only event when flush takes a large
> amount time is if we actually have dirty data to write back to main
> memory.
The double hit is in parisc copy_to_user_page:

#define copy_to_user_page(vma, page, vaddr, dst, src, len) \
do { \
	flush_cache_page(vma, vaddr, page_to_pfn(page)); \
	memcpy(dst, src, len); \
	flush_kernel_dcache_range_asm((unsigned long)dst, (unsigned long)dst + len); \
} while (0)

That is executed just before kunmap:

static inline void kunmap(struct page *page)
{
	flush_kernel_dcache_page_addr(page_address(page));
}

Can't argue about the fact your "safer" kunmap is safer, but we
cannot
rely on common code unless we remove some optimization from the common
code abstractions and we make all archs do kunmap like parisc.

Thanks,
Andrea

Andrea Arcangeli

2019-Mar-12 21:53 UTC

head link

[RFC PATCH V2 0/5] vhost: accelerate metadata access through vmap()

On Tue, Mar 12, 2019 at 02:19:15PM -0700, James Bottomley
wrote:> I mean in the sequence
> 
> flush_dcache_page(page);
> flush_dcache_page(page);
> 
> The first flush_dcache_page did all the work and the second it a
> tightly pipelined no-op.  That's what I mean by there not really being
> a double hit.
Ok I wasn't sure it was clear there was a double (profiling) hit on
that function.

void flush_kernel_dcache_page_addr(void *addr)
{
	unsigned long flags;

	flush_kernel_dcache_page_asm(addr);
	purge_tlb_start(flags);
	pdtlb_kernel(addr);
	purge_tlb_end(flags);
}

#define purge_tlb_start(flags)	spin_lock_irqsave(&pa_tlb_lock, flags)
#define purge_tlb_end(flags)	spin_unlock_irqrestore(&pa_tlb_lock, flags)

You got a system-wide spinlock in there that won't just go away the
second time. So it's a bit more than a tightly pipelined "noop".

Your logic of adding the flush on kunmap makes sense, all I'm saying
is that it's sacrificing some performance for safety. You asked
"optimized what", I meant to optimize away all the above quoted code
that will end running twice for each vhost set_bit when it should run
just once like in other archs. And it clearly paid off until now
(until now it run just once and it was the only safe one).

Before we can leverage your idea to flush the dcache on kunmap in
common code without having to sacrifice performance in arch code, we'd
need to change all other archs to add the cache flushes on kunmap too,
and then remove the cache flushes from the other places like copy_page
or we'd waste CPU. Then you'd have the best of both words, no double
flush and kunmap would be enough.

Thanks,
Andrea

Andrea Arcangeli

2019-Mar-12 22:50 UTC

head link

[RFC PATCH V2 0/5] vhost: accelerate metadata access through vmap()

On Tue, Mar 12, 2019 at 03:02:54PM -0700, James Bottomley
wrote:> I'm sure there must be workarounds elsewhere in the other arch code
> otherwise things like this, which appear all over drivers/, wouldn't
> work:
> 
> drivers/scsi/isci/request.c:1430
> 
> 	kaddr = kmap_atomic(page);
> 	memcpy(kaddr + sg->offset, src_addr, copy_len);
> 	kunmap_atomic(kaddr);
> 
Are you sure "page" is an userland page with an alias address?

	sg->page_link = (unsigned long)virt_to_page(addr);

page_link seems to point to kernel memory.

I found an apparent solution like parisc on arm 32bit:

void __kunmap_atomic(void *kvaddr)
{
	unsigned long vaddr = (unsigned long) kvaddr & PAGE_MASK;
	int idx, type;

	if (kvaddr >= (void *)FIXADDR_START) {
		type = kmap_atomic_idx();
		idx = FIX_KMAP_BEGIN + type + KM_TYPE_NR * smp_processor_id();

		if (cache_is_vivt())
			__cpuc_flush_dcache_area((void *)vaddr, PAGE_SIZE);

However on arm 64bit kunmap_atomic is not implemented at all and other
32bit implementations don't do it, for example sparc seems to do the
cache flush too if the kernel is built with CONFIG_DEBUG_HIGHMEM
(which makes the flushing conditional to the debug option).

The kunmap_atomic where fixmap is used, is flushing the tlb lazily so
even on 32bit you can't even be sure if there was a tlb flush for each
single page you unmapped, so it's hard to see how the above can work
safe, is "page" would have been an userland page mapped with aliased
CPU cache.
> the sequence dirties the kernel virtual address but doesn't flush
> before doing kunmap.  There are hundreds of other examples which is why
> I think adding flush_kernel_dcache_page() is an already lost cause.
In lots of cases kmap is needed to just modify kernel memory not to
modify userland memory (where get/put_user is more commonly used
instead..), there's no cache aliasing in such case.
> Actually copy_user_page() is unused in the main kernel.  The big
> problem is copy_user_highpage() but that's mostly highly optimised by
> the VIPT architectures (in other words you can fiddle with kmap without
> impacting it).
copy_user_page is not unused, it's called precisely by
copy_user_highpage, which is why the cache flushes are done inside
copy_user_page.

static inline void copy_user_highpage(struct page *to, struct page *from,
	unsigned long vaddr, struct vm_area_struct *vma)
{
	char *vfrom, *vto;

	vfrom = kmap_atomic(from);
	vto = kmap_atomic(to);
	copy_user_page(vto, vfrom, vaddr, to);
	kunmap_atomic(vto);
	kunmap_atomic(vfrom);
}

Christoph Hellwig

2019-Mar-13 16:05 UTC

head link

[RFC PATCH V2 0/5] vhost: accelerate metadata access through vmap()

On Tue, Mar 12, 2019 at 01:53:37PM -0700, James Bottomley
wrote:> I've got to say: optimize what?  What code do we ever have in the
> kernel that kmap's a page and then doesn't do anything with it? You
can
> guarantee that on kunmap the page is either referenced (needs
> invalidating) or updated (needs flushing). The in-kernel use of kmap is
> always
> 
> kmap
> do something with the mapped page
> kunmap
> 
> In a very short interval.  It seems just a simplification to make
> kunmap do the flush if needed rather than try to have the users
> remember.  The thing which makes this really simple is that on most
> architectures flush and invalidate is the same operation.  If you
> really want to optimize you can use the referenced and dirty bits on
> the kmapped pte to tell you what operation to do, but if your flush is
> your invalidate, you simply assume the data needs flushing on kunmap
> without checking anything.
I agree that this would be a good way to simplify the API.   Now
we'd just need volunteers to implement this for all architectures
that need cache flushing and then remove the explicit flushing in
the callers..
> > Which means after we fix vhost to add the flush_dcache_page after
> > kunmap, Parisc will get a double hit (but it also means Parisc was
> > the only one of those archs needed explicit cache flushes, where
> > vhost worked correctly so far.. so it kinds of proofs your point of
> > giving up being the safe choice).
> 
> What double hit?  If there's no cache to flush then cache flush is a
> no-op.  It's also a highly piplineable no-op because the CPU has the L1
> cache within easy reach.  The only event when flush takes a large
> amount time is if we actually have dirty data to write back to main
> memory.
I've heard people complaining that on some microarchitectures even
no-op cache flushes are relatively expensive.  Don't ask me why,
but if we can easily avoid double flushes we should do that.

Michael S. Tsirkin

2019-Mar-14 10:42 UTC

head link

[RFC PATCH V2 0/5] vhost: accelerate metadata access through vmap()

On Wed, Mar 13, 2019 at 09:37:08AM -0700, James Bottomley
wrote:> On Wed, 2019-03-13 at 09:05 -0700, Christoph Hellwig wrote:
> > On Tue, Mar 12, 2019 at 01:53:37PM -0700, James Bottomley wrote:
> > > I've got to say: optimize what?  What code do we ever have in
the
> > > kernel that kmap's a page and then doesn't do anything
with it? You
> > > can
> > > guarantee that on kunmap the page is either referenced (needs
> > > invalidating) or updated (needs flushing). The in-kernel use of
> > > kmap is
> > > always
> > > 
> > > kmap
> > > do something with the mapped page
> > > kunmap
> > > 
> > > In a very short interval.  It seems just a simplification to make
> > > kunmap do the flush if needed rather than try to have the users
> > > remember.  The thing which makes this really simple is that on
most
> > > architectures flush and invalidate is the same operation.  If you
> > > really want to optimize you can use the referenced and dirty bits
> > > on the kmapped pte to tell you what operation to do, but if your
> > > flush is your invalidate, you simply assume the data needs
flushing
> > > on kunmap without checking anything.
> > 
> > I agree that this would be a good way to simplify the API.   Now
> > we'd just need volunteers to implement this for all architectures
> > that need cache flushing and then remove the explicit flushing in
> > the callers..
> 
> Well, it's already done on parisc ...  I can help with this if we agree
> it's the best way forward.  It's really only architectures that
> implement flush_dcache_page that would need modifying.
> 
> It may also improve performance because some kmap/use/flush/kunmap
> sequences have flush_dcache_page() instead of
> flush_kernel_dcache_page() and the former is hugely expensive and
> usually unnecessary because GUP already flushed all the user aliases.
> 
> In the interests of full disclosure the reason we do it for parisc is
> because our later machines have problems even with clean aliases.  So
> on most VIPT systems, doing kmap/read/kunmap creates a fairly harmless
> clean alias.  Technically it should be invalidated, because if you
> remap the same page to the same colour you get cached stale data, but
> in practice the data is expired from the cache long before that
> happens, so the problem is almost never seen if the flush is forgotten.
>  Our problem is on the P9xxx processor: they have a L1/L2 VIPT L3 PIPT
> cache.  As the L1/L2 caches expire clean data, they place the expiring
> contents into L3, but because L3 is PIPT, the stale alias suddenly
> becomes the default for any read of they physical page because any
> update which dirtied the cache line often gets written to main memory
> and placed into the L3 as clean *before* the clean alias in L1/L2 gets
> expired, so the older clean alias replaces it.
> 
> Our only recourse is to kill all aliases with prejudice before the
> kernel loses ownership.
> 
> > > > Which means after we fix vhost to add the flush_dcache_page
after
> > > > kunmap, Parisc will get a double hit (but it also means
Parisc
> > > > was the only one of those archs needed explicit cache
flushes,
> > > > where vhost worked correctly so far.. so it kinds of proofs
your
> > > > point of giving up being the safe choice).
> > > 
> > > What double hit?  If there's no cache to flush then cache
flush is
> > > a no-op.  It's also a highly piplineable no-op because the
CPU has
> > > the L1 cache within easy reach.  The only event when flush takes
a
> > > large amount time is if we actually have dirty data to write back
> > > to main memory.
> > 
> > I've heard people complaining that on some microarchitectures even
> > no-op cache flushes are relatively expensive.  Don't ask me why,
> > but if we can easily avoid double flushes we should do that.
> 
> It's still not entirely free for us.  Our internal cache line is around
> 32 bytes (some have 16 and some have 64) but that means we need 128
> flushes for a page ... we definitely can't pipeline them all.  So I
> agree duplicate flush elimination would be a small improvement.
> 
> James
I suspect we'll keep the copyXuser path around for 32 bit anyway -
right Jason?
So we can also keep using that on parisc...

-- 
MST

Jason Wang

2019-Mar-14 13:49 UTC

head link

[RFC PATCH V2 0/5] vhost: accelerate metadata access through vmap()

On 2019/3/14 ??6:42, Michael S. Tsirkin wrote:>>>>> Which means after we fix vhost to add the flush_dcache_page
after
>>>>> kunmap, Parisc will get a double hit (but it also means
Parisc
>>>>> was the only one of those archs needed explicit cache
flushes,
>>>>> where vhost worked correctly so far.. so it kinds of proofs
your
>>>>> point of giving up being the safe choice).
>>>> What double hit?  If there's no cache to flush then cache
flush is
>>>> a no-op.  It's also a highly piplineable no-op because the
CPU has
>>>> the L1 cache within easy reach.  The only event when flush
takes a
>>>> large amount time is if we actually have dirty data to write
back
>>>> to main memory.
>>> I've heard people complaining that on some microarchitectures
even
>>> no-op cache flushes are relatively expensive.  Don't ask me
why,
>>> but if we can easily avoid double flushes we should do that.
>> It's still not entirely free for us.  Our internal cache line is
around
>> 32 bytes (some have 16 and some have 64) but that means we need 128
>> flushes for a page ... we definitely can't pipeline them all.  So I
>> agree duplicate flush elimination would be a small improvement.
>>
>> James
> I suspect we'll keep the copyXuser path around for 32 bit anyway -
> right Jason?

Yes since we don't want to slow down 32bit.

Thanks

> So we can also keep using that on parisc...
>
> --

Seemingly Similar Threads

Search for more possibly parallel threads

Linux Virtualization - Mar 2019 - [RFC PATCH V2 0/5] vhost: accelerate metadata access through vmap()

[RFC PATCH V2 0/5] vhost: accelerate metadata access through vmap()

[RFC PATCH V2 0/5] vhost: accelerate metadata access through vmap()

[RFC PATCH V2 0/5] vhost: accelerate metadata access through vmap()

[RFC PATCH V2 0/5] vhost: accelerate metadata access through vmap()

[RFC PATCH V2 0/5] vhost: accelerate metadata access through vmap()

[RFC PATCH V2 0/5] vhost: accelerate metadata access through vmap()

[RFC PATCH V2 0/5] vhost: accelerate metadata access through vmap()

[RFC PATCH V2 0/5] vhost: accelerate metadata access through vmap()

[RFC PATCH V2 0/5] vhost: accelerate metadata access through vmap()

[RFC PATCH V2 0/5] vhost: accelerate metadata access through vmap()

Seemingly Similar Threads