thr3ads.net - Linux Virtualization - virtio-fs: adding support for multi-queue [Feb 2023]

If this information is useful, please help other people find it:
Share via:

Peter-Jan Gootzen

2023-Feb-08 16:29 UTC

virtio-fs: adding support for multi-queue

On 08/02/2023 11:43, Stefan Hajnoczi wrote:> On Wed, Feb 08, 2023 at 09:33:33AM +0100, Peter-Jan Gootzen wrote:
>>
>>
>> On 07/02/2023 22:57, Vivek Goyal wrote:
>>> On Tue, Feb 07, 2023 at 04:32:02PM -0500, Stefan Hajnoczi wrote:
>>>> On Tue, Feb 07, 2023 at 02:53:58PM -0500, Vivek Goyal wrote:
>>>>> On Tue, Feb 07, 2023 at 02:45:39PM -0500, Stefan Hajnoczi
wrote:
>>>>>> On Tue, Feb 07, 2023 at 11:14:46AM +0100, Peter-Jan
Gootzen wrote:
>>>>>>> Hi,
>>>>>>>
>>>>>
>>>>> [cc German]
>>>>>
>>>>>>> For my MSc thesis project in collaboration with IBM
>>>>>>> (https://github.com/IBM/dpu-virtio-fs) we are
looking to improve the
>>>>>>> performance of the virtio-fs driver in high
throughput scenarios. We think
>>>>>>> the main bottleneck is the fact that the virtio-fs
driver does not support
>>>>>>> multi-queue (while the spec does). A big factor in
this is that our setup on
>>>>>>> the virtio-fs device-side (a DPU) does not easily
allow multiple cores to
>>>>>>> tend to a single virtio queue.
>>>>>
>>>>> This is an interesting limitation in DPU.
>>>>
>>>> Virtqueues are single-consumer queues anyway. Sharing them
between
>>>> multiple threads would be expensive. I think using multiqueue
is natural
>>>> and not specific to DPUs.
>>>
>>> Can we create multiple threads (a thread pool) on DPU and let these
>>> threads process requests in parallel (While there is only one virt
>>> queue).
>>>
>>> So this is what we had done in virtiofsd. One thread is dedicated
to
>>> pull the requests from virt queue and then pass the request to
thread
>>> pool to process it. And that seems to help with performance in
>>> certain cases.
>>>
>>> Is that possible on DPU? That itself can give a nice performance
>>> boost for certain workloads without having to implement multiqueue
>>> actually.
>>>
>>> Just curious. I am not opposed to the idea of multiqueue. I am
>>> just curious about the kind of performance gain (if any) it can
>>> provide. And will this be helpful for rust virtiofsd running on
>>> host as well?
>>>
>>> Thanks
>>> Vivek
>>>
>> There is technically nothing preventing us from consuming a single
queue on
>> multiple cores, however our current Virtio implementation (DPU-side) is
set
>> up with the assumption that you should never want to do that
(concurrency
>> mayham around the Virtqueues and the DMAs). So instead of putting all
the
>> work into reworking the implementation to support that and still incur
the
>> big overhead, we see it more fitting to amend the virtio-fs driver with
>> multi-queue support.
>>
>>
>>> Is it just a theory at this point of time or have you implemented
>>> it and seeing significant performance benefit with multiqueue?
>>
>> It is a theory, but we are currently seeing that using the single
request
>> queue, the single core attending to that queue on the DPU is reasonably
>> close to being fully saturated.
>>
>>> And will this be helpful for rust virtiofsd running on
>>> host as well?
>>
>> I figure this would be dependent on the workload and the users-needs.
>> Having many cores concurrently pulling on their own virtq and then
>> immediately process the request locally would of course improve
performance.
>> But we are offloading all this work to the DPU, for providing
>> high-throughput cloud services.
> 
> I think Vivek is getting at whether your code processes requests
> sequentially or in parallel. A single thread processing the virtqueue
> that hands off requests to worker threads or uses io_uring to perform
> I/O asynchronously will perform differently from a single thread that
> processes requests sequentially in a blocking fashion. Multiqueue is not
> necessary for parallelism, but the single queue might become a
> bottleneck.
Requests are handled non-blocking with remote IO on the DPU. Our current 
architecture is as follows:
T1: Tends to the Virtq, parses FUSE to remote IO and fires off the 
asynchronous remote IO.
T2: Polls for completion on the remote IO and parses it back to FUSE, 
puts the FUSE buffers in a completion queue of T1.
T1: Handles the Virtio completion and DMA of the requests in the CQ.

Thread 1 is busy polling on its two queues (Virtq and CQ) with equal 
priority, thread 2 is busy polling as well. This setup is not really 
optimal, but we are working within the constraints of both our DPU and 
remote IO stack.
Currently we are able to get with sequential single job 4k throughput:
Write: 246MiB/s
Read: 20MiB/s
We are not sure yet where the bottleneck is for reads, we hope to be 
able to match it to the write speed. For writes the two main bottlenecks 
we see are: the single Virtq (so limited parallelism on the DPU and 
remote-side) and that virtio-fs IO is constrained to the page size of 4k 
(NFS for example, who we are trying to replace, sees huge performance 
gains with larger block sizes).
>> This is what I remembered as well, but can't find it clearly in the
source
>> right now, do you have references to the source for this?
> 
> virtio_blk.ko uses an irq_affinity descriptor to tell virtio_find_vqs()
> to spread MSI interrupts across CPUs:
>
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/block/virtio_blk.c#n609
> 
> The core blk-mq code has the blk_mq_virtio_map_queues() function to map
> block layer queues to virtqueues:
>
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/block/blk-mq-virtio.c#n24
> 
> virtio_net.ko manually sets virtqueue affinity:
>
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/net/virtio_net.c#n2283
> 
> virtio_net.ko tells the core net subsystem about queues using
> netif_set_real_num_tx_queues() and then skbs are mapped to queues by
> common code:
>
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/net/core/dev.c#n4079
Thanks for the pointers. :)

Thanks,
Peter-Jan

Vivek Goyal

2023-Feb-08 20:23 UTC

head link

virtio-fs: adding support for multi-queue

On Wed, Feb 08, 2023 at 05:29:25PM +0100, Peter-Jan Gootzen
wrote:> On 08/02/2023 11:43, Stefan Hajnoczi wrote:
> > On Wed, Feb 08, 2023 at 09:33:33AM +0100, Peter-Jan Gootzen wrote:
> > > 
> > > 
> > > On 07/02/2023 22:57, Vivek Goyal wrote:
> > > > On Tue, Feb 07, 2023 at 04:32:02PM -0500, Stefan Hajnoczi
wrote:
> > > > > On Tue, Feb 07, 2023 at 02:53:58PM -0500, Vivek Goyal
wrote:
> > > > > > On Tue, Feb 07, 2023 at 02:45:39PM -0500, Stefan
Hajnoczi wrote:
> > > > > > > On Tue, Feb 07, 2023 at 11:14:46AM +0100,
Peter-Jan Gootzen wrote:
> > > > > > > > Hi,
> > > > > > > > 
> > > > > > 
> > > > > > [cc German]
> > > > > > 
> > > > > > > > For my MSc thesis project in
collaboration with IBM
> > > > > > > > (https://github.com/IBM/dpu-virtio-fs)
we are looking to improve the
> > > > > > > > performance of the virtio-fs driver in
high throughput scenarios. We think
> > > > > > > > the main bottleneck is the fact that the
virtio-fs driver does not support
> > > > > > > > multi-queue (while the spec does). A big
factor in this is that our setup on
> > > > > > > > the virtio-fs device-side (a DPU) does
not easily allow multiple cores to
> > > > > > > > tend to a single virtio queue.
> > > > > > 
> > > > > > This is an interesting limitation in DPU.
> > > > > 
> > > > > Virtqueues are single-consumer queues anyway. Sharing
them between
> > > > > multiple threads would be expensive. I think using
multiqueue is natural
> > > > > and not specific to DPUs.
> > > > 
> > > > Can we create multiple threads (a thread pool) on DPU and
let these
> > > > threads process requests in parallel (While there is only
one virt
> > > > queue).
> > > > 
> > > > So this is what we had done in virtiofsd. One thread is
dedicated to
> > > > pull the requests from virt queue and then pass the request
to thread
> > > > pool to process it. And that seems to help with performance
in
> > > > certain cases.
> > > > 
> > > > Is that possible on DPU? That itself can give a nice
performance
> > > > boost for certain workloads without having to implement
multiqueue
> > > > actually.
> > > > 
> > > > Just curious. I am not opposed to the idea of multiqueue. I
am
> > > > just curious about the kind of performance gain (if any) it
can
> > > > provide. And will this be helpful for rust virtiofsd running
on
> > > > host as well?
> > > > 
> > > > Thanks
> > > > Vivek
> > > > 
> > > There is technically nothing preventing us from consuming a
single queue on
> > > multiple cores, however our current Virtio implementation
(DPU-side) is set
> > > up with the assumption that you should never want to do that
(concurrency
> > > mayham around the Virtqueues and the DMAs). So instead of putting
all the
> > > work into reworking the implementation to support that and still
incur the
> > > big overhead, we see it more fitting to amend the virtio-fs
driver with
> > > multi-queue support.
> > > 
> > > 
> > > > Is it just a theory at this point of time or have you
implemented
> > > > it and seeing significant performance benefit with
multiqueue?
> > > 
> > > It is a theory, but we are currently seeing that using the single
request
> > > queue, the single core attending to that queue on the DPU is
reasonably
> > > close to being fully saturated.
> > > 
> > > > And will this be helpful for rust virtiofsd running on
> > > > host as well?
> > > 
> > > I figure this would be dependent on the workload and the
users-needs.
> > > Having many cores concurrently pulling on their own virtq and
then
> > > immediately process the request locally would of course improve
performance.
> > > But we are offloading all this work to the DPU, for providing
> > > high-throughput cloud services.
> > 
> > I think Vivek is getting at whether your code processes requests
> > sequentially or in parallel. A single thread processing the virtqueue
> > that hands off requests to worker threads or uses io_uring to perform
> > I/O asynchronously will perform differently from a single thread that
> > processes requests sequentially in a blocking fashion. Multiqueue is
not
> > necessary for parallelism, but the single queue might become a
> > bottleneck.
> 
> Requests are handled non-blocking with remote IO on the DPU. Our current
> architecture is as follows:
> T1: Tends to the Virtq, parses FUSE to remote IO and fires off the
> asynchronous remote IO.
> T2: Polls for completion on the remote IO and parses it back to FUSE, puts
> the FUSE buffers in a completion queue of T1.
> T1: Handles the Virtio completion and DMA of the requests in the CQ.
> 
> Thread 1 is busy polling on its two queues (Virtq and CQ) with equal
> priority, thread 2 is busy polling as well. This setup is not really
> optimal, but we are working within the constraints of both our DPU and
> remote IO stack.
> Currently we are able to get with sequential single job 4k throughput:
> Write: 246MiB/s
> Read: 20MiB/s
I had been doing some performance benchmarking for virtiofs and I found
some old results.

https://github.com/rhvgoyal/virtiofs-tests/tree/master/performance-results/feb-10-2021

While running on top of local fs, with bs=4K, with single queue I could
achieve more than 600MB/s.

NAME                    WORKLOAD                Bandwidth       IOPS            
default                 seqread-psync           625.0mb         156.2k          
no-tpool                seqread-psync           660.8mb         165.2k          

But catch here I think is that host is doing the caching. In your
case I am assuming there is no caching at DPU and all the I/O is
going to remote storage (which might be doing caching in memory).

Anyway, point I am trying to make is that even with single vq, virtiofs
can push a reasonable amount of I/O.

I will be cuirous to find how multiqueue can improve these numbers
further.
> We are not sure yet where the bottleneck is for reads, we hope to be able
to
> match it to the write speed. For writes the two main bottlenecks we see
are:
> the single Virtq (so limited parallelism on the DPU and remote-side) and
> that virtio-fs IO is constrained to the page size of 4k (NFS for example,
> who we are trying to replace, sees huge performance gains with larger block
> sizes).
I am wondering how did you conclude that single vq is the bottleneck for
performance and not the remote storage DPU is sending I/O to.

Thanks
Vivek
> 
> > > This is what I remembered as well, but can't find it clearly
in the source
> > > right now, do you have references to the source for this?
> > 
> > virtio_blk.ko uses an irq_affinity descriptor to tell
virtio_find_vqs()
> > to spread MSI interrupts across CPUs:
> >
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/block/virtio_blk.c#n609
> > 
> > The core blk-mq code has the blk_mq_virtio_map_queues() function to
map
> > block layer queues to virtqueues:
> >
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/block/blk-mq-virtio.c#n24
> > 
> > virtio_net.ko manually sets virtqueue affinity:
> >
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/net/virtio_net.c#n2283
> > 
> > virtio_net.ko tells the core net subsystem about queues using
> > netif_set_real_num_tx_queues() and then skbs are mapped to queues by
> > common code:
> >
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/net/core/dev.c#n4079
> 
> Thanks for the pointers. :)
> 
> Thanks,
> Peter-Jan
>

Stefan Hajnoczi

2023-Feb-22 14:32 UTC

head link

virtio-fs: adding support for multi-queue

On Wed, Feb 08, 2023 at 05:29:25PM +0100, Peter-Jan Gootzen
wrote:> On 08/02/2023 11:43, Stefan Hajnoczi wrote:
> > On Wed, Feb 08, 2023 at 09:33:33AM +0100, Peter-Jan Gootzen wrote:
> > > 
> > > 
> > > On 07/02/2023 22:57, Vivek Goyal wrote:
> > > > On Tue, Feb 07, 2023 at 04:32:02PM -0500, Stefan Hajnoczi
wrote:
> > > > > On Tue, Feb 07, 2023 at 02:53:58PM -0500, Vivek Goyal
wrote:
> > > > > > On Tue, Feb 07, 2023 at 02:45:39PM -0500, Stefan
Hajnoczi wrote:
> > > > > > > On Tue, Feb 07, 2023 at 11:14:46AM +0100,
Peter-Jan Gootzen wrote:
> > > > > > > > Hi,
> > > > > > > > 
> > > > > > 
> > > > > > [cc German]
> > > > > > 
> > > > > > > > For my MSc thesis project in
collaboration with IBM
> > > > > > > > (https://github.com/IBM/dpu-virtio-fs)
we are looking to improve the
> > > > > > > > performance of the virtio-fs driver in
high throughput scenarios. We think
> > > > > > > > the main bottleneck is the fact that the
virtio-fs driver does not support
> > > > > > > > multi-queue (while the spec does). A big
factor in this is that our setup on
> > > > > > > > the virtio-fs device-side (a DPU) does
not easily allow multiple cores to
> > > > > > > > tend to a single virtio queue.
> > > > > > 
> > > > > > This is an interesting limitation in DPU.
> > > > > 
> > > > > Virtqueues are single-consumer queues anyway. Sharing
them between
> > > > > multiple threads would be expensive. I think using
multiqueue is natural
> > > > > and not specific to DPUs.
> > > > 
> > > > Can we create multiple threads (a thread pool) on DPU and
let these
> > > > threads process requests in parallel (While there is only
one virt
> > > > queue).
> > > > 
> > > > So this is what we had done in virtiofsd. One thread is
dedicated to
> > > > pull the requests from virt queue and then pass the request
to thread
> > > > pool to process it. And that seems to help with performance
in
> > > > certain cases.
> > > > 
> > > > Is that possible on DPU? That itself can give a nice
performance
> > > > boost for certain workloads without having to implement
multiqueue
> > > > actually.
> > > > 
> > > > Just curious. I am not opposed to the idea of multiqueue. I
am
> > > > just curious about the kind of performance gain (if any) it
can
> > > > provide. And will this be helpful for rust virtiofsd running
on
> > > > host as well?
> > > > 
> > > > Thanks
> > > > Vivek
> > > > 
> > > There is technically nothing preventing us from consuming a
single queue on
> > > multiple cores, however our current Virtio implementation
(DPU-side) is set
> > > up with the assumption that you should never want to do that
(concurrency
> > > mayham around the Virtqueues and the DMAs). So instead of putting
all the
> > > work into reworking the implementation to support that and still
incur the
> > > big overhead, we see it more fitting to amend the virtio-fs
driver with
> > > multi-queue support.
> > > 
> > > 
> > > > Is it just a theory at this point of time or have you
implemented
> > > > it and seeing significant performance benefit with
multiqueue?
> > > 
> > > It is a theory, but we are currently seeing that using the single
request
> > > queue, the single core attending to that queue on the DPU is
reasonably
> > > close to being fully saturated.
> > > 
> > > > And will this be helpful for rust virtiofsd running on
> > > > host as well?
> > > 
> > > I figure this would be dependent on the workload and the
users-needs.
> > > Having many cores concurrently pulling on their own virtq and
then
> > > immediately process the request locally would of course improve
performance.
> > > But we are offloading all this work to the DPU, for providing
> > > high-throughput cloud services.
> > 
> > I think Vivek is getting at whether your code processes requests
> > sequentially or in parallel. A single thread processing the virtqueue
> > that hands off requests to worker threads or uses io_uring to perform
> > I/O asynchronously will perform differently from a single thread that
> > processes requests sequentially in a blocking fashion. Multiqueue is
not
> > necessary for parallelism, but the single queue might become a
> > bottleneck.
> 
> Requests are handled non-blocking with remote IO on the DPU. Our current
> architecture is as follows:
> T1: Tends to the Virtq, parses FUSE to remote IO and fires off the
> asynchronous remote IO.
> T2: Polls for completion on the remote IO and parses it back to FUSE, puts
> the FUSE buffers in a completion queue of T1.
> T1: Handles the Virtio completion and DMA of the requests in the CQ.
> 
> Thread 1 is busy polling on its two queues (Virtq and CQ) with equal
> priority, thread 2 is busy polling as well. This setup is not really
> optimal, but we are working within the constraints of both our DPU and
> remote IO stack.
Why does T1 need to handle VIRTIO completion and DMA requests instead of
T2?

Stefan
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 488 bytes
Desc: not available
URL:
<http://lists.linuxfoundation.org/pipermail/virtualization/attachments/20230222/6c3169b4/attachment.sig>

Linux Virtualization - Feb 2023 - virtio-fs: adding support for multi-queue

virtio-fs: adding support for multi-queue

virtio-fs: adding support for multi-queue

virtio-fs: adding support for multi-queue