Mike Christie
2023-Apr-10 17:16 UTC
[PATCH v6 11/11] vhost: allow userspace to create workers
On 4/10/23 2:04 AM, Jason Wang wrote:> On Wed, Apr 5, 2023 at 7:08?AM Mike Christie > <michael.christie at oracle.com> wrote: >> >> On 4/4/23 3:00 AM, Jason Wang wrote: >>>> >>>> -static void vhost_worker_free(struct vhost_dev *dev) >>>> +static void vhost_worker_put(struct vhost_dev *dev, struct vhost_worker *worker) >>>> { >>>> - struct vhost_worker *worker = dev->worker; >>>> - >>>> if (!worker) >>>> return; >>>> >>>> - dev->worker = NULL; >>>> + if (!refcount_dec_and_test(&worker->refcount)) >>>> + return; >>>> + >>>> WARN_ON(!llist_empty(&worker->work_list)); >>>> vhost_task_stop(worker->vtsk); >>>> kfree(worker); >>>> } >>>> >>>> +static void vhost_vq_detach_worker(struct vhost_virtqueue *vq) >>>> +{ >>>> + if (vq->worker) >>> >>> What happens to the pending work that queues for the old worker? >> >> I didn't think there would be works queued at this time. I see your comment >> below about my assumption about the backend being set being wrong. Will >> discuss down there. >> >> >>>> >>>> +/* Caller must have device and virtqueue mutex */ >>>> +static void __vhost_vq_attach_worker(struct vhost_virtqueue *vq, >>>> + struct vhost_worker *worker) >>>> +{ >>>> + refcount_inc(&worker->refcount); >>>> + vhost_vq_detach_worker(vq);()) >>>> + vq->worker = worker; >>> >>> What happens if there's a pending flush in a specific device (e.g in >>> vhost_scsi_tmf_resp_work())? >> >> We wouldn't hit that specific case where we are running the above code and >> vhost_scsi_tmf_resp_work. >> >> Either: >> >> 1. The backend is NULL and has never been set, and so we would never have >> called vhost_scsi_tmf_resp_work, because it can only be called after the >> backend is set. >> >> 2. If the backed has been set and vhost_scsi_tmf_resp_work has >> run or is running, then we when we would not have called __vhost_vq_attach_worker >> from vhost_vq_attach_worker because it would see vhost_vq_get_backend >> returning a non-NULL value. >> >> If vhost_scsi later sets the backend to NULL, then vhost_scsi_clear_endpoint >> will have made sure the flush has completed when the clear function returns. >> It does that with the device mutex so when we run __vhost_vq_attach_worker >> It will only see a vq/worker with no flushes in progress. > > Ok. > >> >> For the general case of can we be doing a vhost_dev_flush/vhost_work_flush_on >> and __vhost_vq_attach_worker, then I thought we are ok as well because I >> thought we have to currently have the device mutex when we flush so those can't >> race with ioctl calls to vhost_vq_attach_worker since we hold the dev mutex >> during that ioctls. > > I'm not sure I understand here, but we can't assume the user of > vhost_work_queue() is called in the ioctl context. SeeThe comment above was only to answer the question about __vhost_vq_attach_worker racing with works queued from a flush. In general, I thought we did something to stop IO with the vq mutex (clear the backend stop polling, etc) then do vhost_dev_flush under the dev mutex. So once the flush is done there would not be any new vhost_work_queue calls for the virtqueue. For vhost_zerocopy_callback case, when can handle_tx_zerocopy start to be called? It looked like it's only called after the backend is set because handle_tx checks for a non-null backend.> vhost_zerocopy_callback(). But since you want to limit the call before > set_backend, another question comes, do we really need the dynamic > attaching/creating in this case? How about a simple one ioctl that is > used to build the queue to workers mapping instead?I didn't think we need the dynamic case. It was from a review comment. See at the very bottom for more info, because it's related to your free worker question.> >> Or we call flush from the file_operations release function >> so the device is closed and can't race with ioctls. >> >>> >>> Does this mean we need to hold vq mutex when doing the flush? (which >>> seems not the case of vhost_scsi_tmf_resp_work()). >>> >>>> +} >>>> + >>>> +/* Caller must have device and virtqueue mutex */ >>>> +static int vhost_vq_attach_worker(struct vhost_virtqueue *vq, >>>> + struct vhost_vring_worker *info) >>>> +{ >>>> + unsigned long index = info->worker_id; >>>> + struct vhost_dev *dev = vq->dev; >>>> + struct vhost_worker *worker; >>>> + >>>> + if (!dev->use_worker) >>>> + return -EINVAL; >>>> + >>>> + /* >>>> + * We don't support setting a worker on an active vq to make flushing >>>> + * and removal simple. >>>> + */ >>>> + if (vhost_vq_get_backend(vq)) >>>> + return -EBUSY; >>> >>> This assumes the worker won't be started until the backend is set >>> which is not true. >> >> I can see how we get flushes before setting the backend, but I thought we are >> ok because we have the device mutex held. >> >> What are the other possible cases? Is one case we could get a >> VHOST_SET_VRING_KICK before setting the backend and so vhost_poll_start starts >> to allow vhost_poll_queue before the backend is set? > > Yes, and it's not necessarily the kick, the networking core could wake > up workers before set_backend.I'm not fully understanding this part. I now agree we can wake up workers before set_backend now. But are you saying we can wake up workers for vhost net before the kick also? If so, I think you have to give me a hint, because I didn't see it. I did find vhost vsock had a case like you described though. It looked like vhost_transport_send_pkt could be called anytime after VHOST_VSOCK_SET_GUEST_CID. Is there a case like that for vhost net?> >> Is there any thing else? > > Haven't found time to do this, but what I want to say is, if we want > this assumption, we need to document it and change the devices thatYeah, I only had: + * virtqueues. This must be done before the virtqueue is active. In the next version, I'll be more clear.> are affected by this change.For this part about changing devices, what about the opposite? We could just have drivers opt into this where we handle the new ioctl in the driver's ioctl handling. If there is no driver specific code, vhost_dev_ioctl would just return not supported like it does today. The thing is that for net, you guys already have your way to do mq and get multiple works so you probably won't need this so I don't want to add complexity to net. For vsock, I'm not sure if doing multiple workers is useful. If it is, then we will need some vsock specific code to handle running works during the worker creation operation. For vhost scsi, we have a way to do mq right now, so we just need the multiple workers support. It seems more simple for scsi because we just have the vhost_scsi_do_plug path which can queue works before set_backend (however that's fixed now in mst's current tree).>> >>> >>> struct vhost_worker_state { >>> /* >>> * For VHOST_NEW_WORKER the kernel will return the new vhost_worker id. >>> * For VHOST_FREE_WORKER this must be set to the id of the vhost_worker >>> * to free. >>> */ >>> int worker_id; >>> }; >>> >>> A side effect of using xa_index directly is that userspace can guess >>> the xa_index of the default worker and free it here, is this intended? >> I don't need the functionality. It was only something that I didn't >> think mattered much, because you can only free the worker if it's not in >> use by any virtqueues, so I didn't add any special code to handle it. >> The user would have had to do an ATTACH to the default worker and replace >> it then it could free it, so it works like the other workers. > > It's about the robustness of the uAPI: > > struct vhost_worker_state { > /* > * For VHOST_NEW_WORKER the kernel will return the new vhost_worker id. > * For VHOST_FREE_WORKER this must be set to the id of the vhost_worker > * to free. > */ > int worker_id; > }; > > It looks like the workder_id is the one userspace gets from > VHOST_NEW_WORKER. If yes, we should hide the default worker.I see. Ok.> > >> >>> Should we hide the default worker from xa? >> >> I can change it if you are worried about future problems. >> >> >>>> - default: >>>> + case VHOST_ATTACH_VRING_WORKER: >>>> + if (copy_from_user(&w, argp, sizeof(w))) { >>>> + r = -EFAULT; >>>> + break; >>>> + } >>>> + r = vhost_vq_attach_worker(vq, &w); >>>> + if (!r && copy_to_user(argp, &w, sizeof(w))) >>>> + r = -EFAULT; >>>> + break; >>> >>> It's a little odd that we have new and attach but only a free. >> >> Do you mean we would also want a detach? I've been debating that. >> I'm not sure what the user wanted though. >> >> Would it leave the virtqueue with no worker? Or, does it pick the first >> worker it finds? If there is no worker because we did the former or >> because userspace detached all of them, then do we just drop work that >> gets queued? >> >> It seemed like a user would never want to drop work, so I made it so >> you can only tell it to use new workers (attach which I guess is more >> like a swap worker) > > swap and free old worker indeed?I didn't understand the question. We would only free if there were no more virtqueues using the worker and if userspace does a free on it which releases the final refcount.> > static void vhost_vq_detach_worker(struct vhost_virtqueue *vq) > { > if (vq->worker) > vhost_worker_put(vq->dev, vq->worker); > vq->worker = NULL; > } > > That makes me think under which case we should use VHOST_FREE_WORKER? >A lot of this is from a review comment a while back :) A long while back there was the review comment that we wanted to be able to share workers between devices one day and swap them around to different devices and virtqueues. So, I think the idea was to use them more like a pool. To handle that comment, I switched from the more one giant single ioctl approach like you suggested above, and made it so userspace an create N workers and then if it wanted move them around to different devices/vqs. The free would be for if you were lowering the resource use and wanted to free some of the workers in the pool. This was supported for some patchset versions I posted, but then I realized it wasn't useful for vhost-scsi because we only want to create/delete workers when we create/delete virtqueues, it added a bunch of extra complexity in the main IO path and flushing code, and there were the questions of do you share between parents with different mms, etc. So I dropped it in last 2 postings thinking we probably will only use this for vhost-scsi and net/vsock will never use it, so keep it as simple as possible. However, I left the ability to create/free workers in the interface so if we wanted to do this more of a pool approach we could build on what we have.
Jason Wang
2023-Apr-11 03:00 UTC
[PATCH v6 11/11] vhost: allow userspace to create workers
On Tue, Apr 11, 2023 at 1:16?AM Mike Christie <michael.christie at oracle.com> wrote:> > On 4/10/23 2:04 AM, Jason Wang wrote: > > On Wed, Apr 5, 2023 at 7:08?AM Mike Christie > > <michael.christie at oracle.com> wrote: > >> > >> On 4/4/23 3:00 AM, Jason Wang wrote: > >>>> > >>>> -static void vhost_worker_free(struct vhost_dev *dev) > >>>> +static void vhost_worker_put(struct vhost_dev *dev, struct vhost_worker *worker) > >>>> { > >>>> - struct vhost_worker *worker = dev->worker; > >>>> - > >>>> if (!worker) > >>>> return; > >>>> > >>>> - dev->worker = NULL; > >>>> + if (!refcount_dec_and_test(&worker->refcount)) > >>>> + return; > >>>> + > >>>> WARN_ON(!llist_empty(&worker->work_list)); > >>>> vhost_task_stop(worker->vtsk); > >>>> kfree(worker); > >>>> } > >>>> > >>>> +static void vhost_vq_detach_worker(struct vhost_virtqueue *vq) > >>>> +{ > >>>> + if (vq->worker) > >>> > >>> What happens to the pending work that queues for the old worker? > >> > >> I didn't think there would be works queued at this time. I see your comment > >> below about my assumption about the backend being set being wrong. Will > >> discuss down there. > >> > >> > >>>> > >>>> +/* Caller must have device and virtqueue mutex */ > >>>> +static void __vhost_vq_attach_worker(struct vhost_virtqueue *vq, > >>>> + struct vhost_worker *worker) > >>>> +{ > >>>> + refcount_inc(&worker->refcount); > >>>> + vhost_vq_detach_worker(vq);()) > >>>> + vq->worker = worker; > >>> > >>> What happens if there's a pending flush in a specific device (e.g in > >>> vhost_scsi_tmf_resp_work())? > >> > >> We wouldn't hit that specific case where we are running the above code and > >> vhost_scsi_tmf_resp_work. > >> > >> Either: > >> > >> 1. The backend is NULL and has never been set, and so we would never have > >> called vhost_scsi_tmf_resp_work, because it can only be called after the > >> backend is set. > >> > >> 2. If the backed has been set and vhost_scsi_tmf_resp_work has > >> run or is running, then we when we would not have called __vhost_vq_attach_worker > >> from vhost_vq_attach_worker because it would see vhost_vq_get_backend > >> returning a non-NULL value. > >> > >> If vhost_scsi later sets the backend to NULL, then vhost_scsi_clear_endpoint > >> will have made sure the flush has completed when the clear function returns. > >> It does that with the device mutex so when we run __vhost_vq_attach_worker > >> It will only see a vq/worker with no flushes in progress. > > > > Ok. > > > >> > >> For the general case of can we be doing a vhost_dev_flush/vhost_work_flush_on > >> and __vhost_vq_attach_worker, then I thought we are ok as well because I > >> thought we have to currently have the device mutex when we flush so those can't > >> race with ioctl calls to vhost_vq_attach_worker since we hold the dev mutex > >> during that ioctls. > > > > I'm not sure I understand here, but we can't assume the user of > > vhost_work_queue() is called in the ioctl context. See > > The comment above was only to answer the question about __vhost_vq_attach_worker > racing with works queued from a flush. > > In general, I thought we did something to stop IO with the vq mutex > (clear the backend stop polling, etc) then do vhost_dev_flush under the > dev mutex. So once the flush is done there would not be any new vhost_work_queue > calls for the virtqueue. > > For vhost_zerocopy_callback case, when can handle_tx_zerocopy start to be called? > It looked like it's only called after the backend is set because handle_tx > checks for a non-null backend.You are right, again, what I want to say is, if the patch adds new assumptions, we should document it.> > > > > > vhost_zerocopy_callback(). But since you want to limit the call before > > set_backend, another question comes, do we really need the dynamic > > attaching/creating in this case? How about a simple one ioctl that is > > used to build the queue to workers mapping instead? > > > I didn't think we need the dynamic case. It was from a review comment.Right, so we actually don't need three new ioctls but only a single is sufficient? struct vhost_worker_state { __u16 workers; __u16 queue_to_work_map[]; }; And limiting this to be called before datapath can run is sufficient? (sorry I missed some of the previous comments).> > See at the very bottom for more info, because it's related to your > free worker question. > > > > > >> Or we call flush from the file_operations release function > >> so the device is closed and can't race with ioctls. > >> > >>> > >>> Does this mean we need to hold vq mutex when doing the flush? (which > >>> seems not the case of vhost_scsi_tmf_resp_work()). > >>> > >>>> +} > >>>> + > >>>> +/* Caller must have device and virtqueue mutex */ > >>>> +static int vhost_vq_attach_worker(struct vhost_virtqueue *vq, > >>>> + struct vhost_vring_worker *info) > >>>> +{ > >>>> + unsigned long index = info->worker_id; > >>>> + struct vhost_dev *dev = vq->dev; > >>>> + struct vhost_worker *worker; > >>>> + > >>>> + if (!dev->use_worker) > >>>> + return -EINVAL; > >>>> + > >>>> + /* > >>>> + * We don't support setting a worker on an active vq to make flushing > >>>> + * and removal simple. > >>>> + */ > >>>> + if (vhost_vq_get_backend(vq)) > >>>> + return -EBUSY; > >>> > >>> This assumes the worker won't be started until the backend is set > >>> which is not true. > >> > >> I can see how we get flushes before setting the backend, but I thought we are > >> ok because we have the device mutex held. > >> > >> What are the other possible cases? Is one case we could get a > >> VHOST_SET_VRING_KICK before setting the backend and so vhost_poll_start starts > >> to allow vhost_poll_queue before the backend is set? > > > > Yes, and it's not necessarily the kick, the networking core could wake > > up workers before set_backend. > > I'm not fully understanding this part. I now agree we can wake up workers before > set_backend now. But are you saying we can wake up workers for vhost net before > the kick also? If so, I think you have to give me a hint, because I didn't see it.Ok, I meant sock can wake up the vhost worker, but I think I was wrong, since it only works after the backend is set.> > I did find vhost vsock had a case like you described though. It looked like > vhost_transport_send_pkt could be called anytime after VHOST_VSOCK_SET_GUEST_CID.Yes.> Is there a case like that for vhost net?I think not.> > > > > >> Is there any thing else? > > > > Haven't found time to do this, but what I want to say is, if we want > > this assumption, we need to document it and change the devices that > > Yeah, I only had: > > + * virtqueues. This must be done before the virtqueue is active. > > In the next version, I'll be more clear. > > > are affected by this change. > > For this part about changing devices, what about the opposite? We could > just have drivers opt into this where we handle the new ioctl in the > driver's ioctl handling. If there is no driver specific code, vhost_dev_ioctl > would just return not supported like it does today.Yes, that could be one way.> > The thing is that for net, you guys already have your way to do mq > and get multiple works so you probably won't need this so I don't > want to add complexity to net. For vsock, I'm not sure if doing multiple > workers is useful. If it is, then we will need some vsock specific > code to handle running works during the worker creation operation.Yes, it really depends on the multiqueue models. Net uses 1 queue pair per device. SCSI uses multiqueue per device Not sure for vsock, Stefano may have more comment on this.> > For vhost scsi, we have a way to do mq right now, so we just need the > multiple workers support. It seems more simple for scsi because we just > have the vhost_scsi_do_plug path which can queue works before set_backend > (however that's fixed now in mst's current tree). > > > >> > >>> > >>> struct vhost_worker_state { > >>> /* > >>> * For VHOST_NEW_WORKER the kernel will return the new vhost_worker id. > >>> * For VHOST_FREE_WORKER this must be set to the id of the vhost_worker > >>> * to free. > >>> */ > >>> int worker_id; > >>> }; > >>> > >>> A side effect of using xa_index directly is that userspace can guess > >>> the xa_index of the default worker and free it here, is this intended? > >> I don't need the functionality. It was only something that I didn't > >> think mattered much, because you can only free the worker if it's not in > >> use by any virtqueues, so I didn't add any special code to handle it. > >> The user would have had to do an ATTACH to the default worker and replace > >> it then it could free it, so it works like the other workers. > > > > It's about the robustness of the uAPI: > > > > struct vhost_worker_state { > > /* > > * For VHOST_NEW_WORKER the kernel will return the new vhost_worker id. > > * For VHOST_FREE_WORKER this must be set to the id of the vhost_worker > > * to free. > > */ > > int worker_id; > > }; > > > > It looks like the workder_id is the one userspace gets from > > VHOST_NEW_WORKER. If yes, we should hide the default worker. > > I see. Ok. > > > > > > >> > >>> Should we hide the default worker from xa? > >> > >> I can change it if you are worried about future problems. > >> > >> > >>>> - default: > >>>> + case VHOST_ATTACH_VRING_WORKER: > >>>> + if (copy_from_user(&w, argp, sizeof(w))) { > >>>> + r = -EFAULT; > >>>> + break; > >>>> + } > >>>> + r = vhost_vq_attach_worker(vq, &w); > >>>> + if (!r && copy_to_user(argp, &w, sizeof(w))) > >>>> + r = -EFAULT; > >>>> + break; > >>> > >>> It's a little odd that we have new and attach but only a free. > >> > >> Do you mean we would also want a detach? I've been debating that. > >> I'm not sure what the user wanted though. > >> > >> Would it leave the virtqueue with no worker? Or, does it pick the first > >> worker it finds? If there is no worker because we did the former or > >> because userspace detached all of them, then do we just drop work that > >> gets queued? > >> > >> It seemed like a user would never want to drop work, so I made it so > >> you can only tell it to use new workers (attach which I guess is more > >> like a swap worker) > > > > swap and free old worker indeed? > > I didn't understand the question.I mean if I understand the code correctly, the code tries to drop refcnt and free the old worker during attach.> We would only free if there were no > more virtqueues using the worker and if userspace does a free on it > which releases the final refcount. > > > > > static void vhost_vq_detach_worker(struct vhost_virtqueue *vq) > > { > > if (vq->worker) > > vhost_worker_put(vq->dev, vq->worker); > > vq->worker = NULL; > > } > > > > That makes me think under which case we should use VHOST_FREE_WORKER? > > > > A lot of this is from a review comment a while back :)Sorry for that, I'm unable to recover those comments.> > A long while back there was the review comment that we wanted to be able to > share workers between devices one day and swap them around to different > devices and virtqueues. So, I think the idea was to use them more like a pool.I see, but it seems much more complicated than what is proposed here and we need to address things like cgroup and rlimit which seems not trivial.> > To handle that comment, I switched from the more one giant single ioctl > approach like you suggested above, and made it so userspace an create N > workers and then if it wanted move them around to different devices/vqs. > The free would be for if you were lowering the resource use and wanted > to free some of the workers in the pool.Ok.> > This was supported for some patchset versions I posted, but then I realized > it wasn't useful for vhost-scsi because we only want to create/delete workers > when we create/delete virtqueues,My understanding is the attach/detach/new are only required when: 1) device implement multiqueue per device (not the case of net but the SCSI) 2) device has the ability to enable and disable queues (not the case of SCSI but the net)> it added a bunch of extra complexity > in the main IO path and flushing code, and there were the questions of do you > share between parents with different mms, etc. So I dropped it in last 2 postings > thinking we probably will only use this for vhost-scsi and net/vsock will never > use it, so keep it as simple as possible. > > However, I left the ability to create/free workers in the interface so if we > wanted to do this more of a pool approach we could build on what we have.Ok, I'm with either approach. But if attach implies freeing old worker, we don't need an explicit free? Thanks>