On 2021/2/4 ??7:19, Si-Wei Liu wrote:> On Tue, Feb 2, 2021 at 9:16 PM Jason Wang <jasowang at redhat.com>
wrote:
>>
>> On 2021/2/3 ??1:54, Si-Wei Liu wrote:
>>> On Tue, Feb 2, 2021 at 1:23 AM Eli Cohen <elic at nvidia.com>
wrote:
>>>> On Tue, Feb 02, 2021 at 12:38:51AM -0800, Si-Wei Liu wrote:
>>>>> Thanks Eli and Jason for clarifications. See inline.
>>>>>
>>>>> On Mon, Feb 1, 2021 at 11:06 PM Eli Cohen <elic at
nvidia.com> wrote:
>>>>>> On Tue, Feb 02, 2021 at 02:02:25PM +0800, Jason Wang
wrote:
>>>>>>> On 2021/2/2 ??12:15, Si-Wei Liu wrote:
>>>>>>>> On Mon, Feb 1, 2021 at 7:13 PM Jason Wang
<jasowang at redhat.com> wrote:
>>>>>>>>> On 2021/2/2 ??3:17, Si-Wei Liu wrote:
>>>>>>>>>> On Mon, Feb 1, 2021 at 10:51 AM Si-Wei
Liu <siwliu.kernel at gmail.com> wrote:
>>>>>>>>>>> On Thu, Jan 28, 2021 at 5:46 AM Eli
Cohen <elic at nvidia.com> wrote:
>>>>>>>>>>>> suspend_vq should only suspend
the VQ on not save the current available
>>>>>>>>>>>> index. This is done when a
change of map occurs when the driver calls
>>>>>>>>>>>> save_channel_info().
>>>>>>>>>>> Hmmm, suspend_vq() is also called
by teardown_vq(), the latter of
>>>>>>>>>>> which doesn't save the
available index as save_channel_info() doesn't
>>>>>>>>>>> get called in that path at all. How
does it handle the case that
>>>>>>>>>>> aget_vq_state() is called from
userspace (e.g. QEMU) while the
>>>>>>>>>>> hardware VQ object was torn down,
but userspace still wants to access
>>>>>>>>>>> the queue index?
>>>>>>>>>>>
>>>>>>>>>>> Refer to
https://lore.kernel.org/netdev/1601583511-15138-1-git-send-email-si-wei.liu at
oracle.com/
>>>>>>>>>>>
>>>>>>>>>>> vhost VQ 0 ring restore failed: -1:
Resource temporarily unavailable (11)
>>>>>>>>>>> vhost VQ 1 ring restore failed: -1:
Resource temporarily unavailable (11)
>>>>>>>>>>>
>>>>>>>>>>> QEMU will complain with the above
warning while VM is being rebooted
>>>>>>>>>>> or shut down.
>>>>>>>>>>>
>>>>>>>>>>> Looks to me either the kernel
driver should cover this requirement, or
>>>>>>>>>>> the userspace has to bear the
burden in saving the index and not call
>>>>>>>>>>> into kernel if VQ is destroyed.
>>>>>>>>>> Actually, the userspace doesn't
have the insights whether virt queue
>>>>>>>>>> will be destroyed if just changing the
device status via set_status().
>>>>>>>>>> Looking at other vdpa driver in tree
i.e. ifcvf it doesn't behave like
>>>>>>>>>> so. Hence this still looks to me to be
Mellanox specifics and
>>>>>>>>>> mlx5_vdpa implementation detail that
shouldn't expose to userspace.
>>>>>>>>> So I think we can simply drop this patch?
>>>>>>>> Yep, I think so. To be honest I don't know
why it has anything to do
>>>>>>>> with the memory hotplug issue.
>>>>>>> Eli may know more, my understanding is that, during
memory hotplut, qemu
>>>>>>> need to propagate new memory mappings via
set_map(). For mellanox, it means
>>>>>>> it needs to rebuild memory keys, so the virtqueue
needs to be suspended.
>>>>>>>
>>>>>> I think Siwei was asking why the first patch was
related to the hotplug
>>>>>> issue.
>>>>> I was thinking how consistency is assured when
saving/restoring this
>>>>> h/w avail_index against the one in the virtq memory,
particularly in
>>>>> the region_add/.region_del() context (e.g. the hotplug
case). Problem
>>>>> is I don't see explicit memory barrier when guest
thread updates the
>>>>> avail_index, how does the device make sure the h/w
avail_index is
>>>>> uptodate while guest may race with updating the virtq's
avail_index in
>>>>> the mean while? Maybe I completely miss something obvious?
>>>> DKIM-Signature: v1; arsa-sha256; crelaxed/relaxed; dnvidia.com;
sn1;
>>>> t 12257780;
bhHnB0z4VEKwRS3WGY8d836MJgxu5Eln/jbFZlNXVxc08;
>>>>
hX-PGP-Universal:Date:From:To:CC:Subject:Message-ID:References:
>>>> MIME-Version:Content-Type:Content-Disposition:
>>>>
Content-Transfer-Encoding:In-Reply-To:User-Agent:X-Originating-IP:
>>>> X-ClientProxiedBy;
>>>>
bgGmb8+rcn3/rKzKQ/7QzSnghWzZ+FAU0XntsRZYGQ66sFvT7zsYPHogG3LIWNY77t
>>>>
wNHPw7GCJrNaH3nEXPbOp0FMOZw4Kv4W7UPuYPobbLeTkvuPAidjB8dM42vz+1X61t
>>>>
9IVQT9X4hnAxRjI5CqZOo41GS4Tl1X+ykGoA+VE80BR/R/+nQ3tXDVULfppzeB+vu3
>>>>
TWnnpaZ2GyoNyPlMiyVRkHdXzDVgA4uQHxwHn7otGK5J4lzyu8KrFyQtiP+f6hfu5v
>>>>
crJkYS8e9A+rfzUmKWuyHcKcmhPhAVJ4XdpzZcDXXlMHVxG7nR1o88xttC6D1+oNIP
>>>> 9xHI3DkNBpEuA
>>>> If you're asking about syncronization upon hot plug of
memory, the
>>>> hardware always goes to read the available index from memory
when a new
>>>> hardware object is associted with a virtqueue. You can argue
then that
>>>> you don't need to restore the available index and you may
be right but
>>>> this is the currect inteface to the firmware.
>>>>
>>>>
>>>> If you're asking on generally how sync is assured when the
guest updates
>>>> the available index, can you please send a pointer to the code
where you
>>>> see the update without a memory barrier?
>>> This is a snippet of virtqueue_add_split() where avail_index gets
>>> updated by guest:
>>>
>>> /* Put entry in available array (but don't update
avail->idx until they
>>> * do sync). */
>>> avail = vq->split.avail_idx_shadow &
(vq->split.vring.num - 1);
>>> vq->split.vring.avail->ring[avail] =
cpu_to_virtio16(_vq->vdev, head);
>>>
>>> /* Descriptors and available array need to be set before
we expose the
>>> * new available array entries. */
>>> virtio_wmb(vq->weak_barriers);
>>> vq->split.avail_idx_shadow++;
>>> vq->split.vring.avail->idx =
cpu_to_virtio16(_vq->vdev,
>>>
vq->split.avail_idx_shadow);
>>> vq->num_added++;
>>>
>>> There's memory barrier to make sure the update to descriptor
and
>>> available ring is seen before writing to the avail->idx, but
there
>>> seems no gurantee that this update would flush to the memory
>>> immmedately either before or after the mlx5-vdpa is suspened and
gets
>>> the old avail_index value stashed somewhere. In this case, how does
>>> the hardware ensure the consistency between the guest virtio and
host
>>> mlx5-vdpa? Or, it completly relies on guest to update the
avail_index
>>> once the next buffer is available, so that the index will be in
sync
>>> again?
>>
>> I'm not sure I get the question but notice that the driver should
check
>> and notify virtqueue when device want a notification. So there's a
>> virtio_wmb() e.g in:
>>
>> static bool virtqueue_kick_prepare_split(struct virtqueue *_vq)
>> {
>> struct vring_virtqueue *vq = to_vvq(_vq);
>> u16 new, old;
>> bool needs_kick;
>>
>> START_USE(vq);
>> /* We need to expose available array entries before checking
avail
>> * event. */
>> virtio_mb(vq->weak_barries);
>>
>> old = vq->split.avail_idx_shadow - vq->num_added;
>> new = vq->split.avail_idx_shadow;
>> vq->num_added = 0;
>>
>> (See the comment above virtio_mb()). So the avail idx is guaranteed to
>> be committed to memroy(cache hierarchy) before the check of
>> notification. I think we sync through this.
> Thanks for pointing it out! Indeed, the avail index is synced before
> kicking the device. However, even so I think the race might still be
> possible, see below:
>
>
> mlx5_vdpa device virtio driver
> -------------------------------------------------------------------------
> virtqueue_add_split
> (bumped up avail_idx1 to
> avail_idx2, but update
> not yet committed to mem)
>
> (hot plug memory...)
> mlx5_vdpa_set_map
> mlx5_vdpa_change_map
> suspend_vqs
> suspend_vq
> (avail_idx1 was saved)
> save_channels_info
> :
> : virtqueue_kick_prepare_split
> : (avail_idx2 committed to memory)
> restore_channels_info
> setup_driver
> ...
> create_virtqueue
> (saved avail_idx1
> getting restored;
> whereas current
> avail_idx2 in
> memory is ignored)
The index is not ignored, the device is expected to read descriptors
from avail_idx1 until avail_idx2.
> :
> :
> virtqueue_notify
> vp_notify
> mlx5_vdpa_kick_vq
> (device processing up to
> avail_idx1 rather than
> avail_idx2?)
>
>
> According to Eli, the vdpa firmware does not load the latest value,
> i.e. avail_idx2 from memory when re-creating the virtqueue object.
During sync it's not needed but after the device start to work, it needs
to read descriptor until the available index.
> Instead it starts with a stale avail_idx1 that was saved in mlx5_vdpa
> driver private structure before the memory update is made. That is the
> way how the current firmware interface is designed. The thing I'm not
> sure though is if the firmware/device will resync and get to the
> latest avail_idx2 from memory while processing the kick request with
> mlx5_vdpa_kick_vq?
I think the point is that, the device must have an internal avail index,
and the device is only expected to sync this internal avail index.
> If not, a few avail entries will be missing in the
> kick, which could cause odd behavior or implicit failure.
>
> But if the firmware will resync on kick_vq (and get_vq_state), I think
> I completely lost the point of saving and restoring avail_idx when
> changing the memory map...
The memory map changing is usually transparent to guest driver but only
the device. Maybe you can refer vhost codes or qemu codes for the device
logic.
Thanks
>
> Thanks,
> -Siwei
>
>> Thanks
>>
>>
>>> Thanks,
>>> -Siwei
>>>
>>>>> Thanks,
>>>>> -Siwei
>>>>>
>>>>>> But you're correct. When memory is added, I get a
new memory map. This
>>>>>> requires me to build a new memory key object which
covers the new memory
>>>>>> map. Since the virtqueue objects are referencing this
memory key, I need
>>>>>> to destroy them and build new ones referncing the new
memory key.
>>>>>>
>>>>>>> Thanks
>>>>>>>
>>>>>>>
>>>>>>>> -Siwei
>>>>>>>>
>>>>>>>>> Thanks
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>>> -Siwei
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>> Signed-off-by: Eli Cohen
<elic at nvidia.com>
>>>>>>>>>>>> ---
>>>>>>>>>>>>
drivers/vdpa/mlx5/net/mlx5_vnet.c | 8 --------
>>>>>>>>>>>> 1 file changed, 8
deletions(-)
>>>>>>>>>>>>
>>>>>>>>>>>> diff --git
a/drivers/vdpa/mlx5/net/mlx5_vnet.c b/drivers/vdpa/mlx5/net/mlx5_vnet.c
>>>>>>>>>>>> index
88dde3455bfd..549ded074ff3 100644
>>>>>>>>>>>> ---
a/drivers/vdpa/mlx5/net/mlx5_vnet.c
>>>>>>>>>>>> +++
b/drivers/vdpa/mlx5/net/mlx5_vnet.c
>>>>>>>>>>>> @@ -1148,8 +1148,6 @@ static
int setup_vq(struct mlx5_vdpa_net *ndev, struct mlx5_vdpa_virtqueue *mvq)
>>>>>>>>>>>>
>>>>>>>>>>>> static void
suspend_vq(struct mlx5_vdpa_net *ndev, struct mlx5_vdpa_virtqueue *mvq)
>>>>>>>>>>>> {
>>>>>>>>>>>> - struct mlx5_virtq_attr
attr;
>>>>>>>>>>>> -
>>>>>>>>>>>> if
(!mvq->initialized)
>>>>>>>>>>>> return;
>>>>>>>>>>>>
>>>>>>>>>>>> @@ -1158,12 +1156,6 @@ static
void suspend_vq(struct mlx5_vdpa_net *ndev, struct mlx5_vdpa_virtqueue *m
>>>>>>>>>>>>
>>>>>>>>>>>> if
(modify_virtqueue(ndev, mvq, MLX5_VIRTIO_NET_Q_OBJECT_STATE_SUSPEND))
>>>>>>>>>>>>
mlx5_vdpa_warn(&ndev->mvdev, "modify to suspend failed\n");
>>>>>>>>>>>> -
>>>>>>>>>>>> - if
(query_virtqueue(ndev, mvq, &attr)) {
>>>>>>>>>>>> -
mlx5_vdpa_warn(&ndev->mvdev, "failed to query virtqueue\n");
>>>>>>>>>>>> - return;
>>>>>>>>>>>> - }
>>>>>>>>>>>> - mvq->avail_idx =
attr.available_index;
>>>>>>>>>>>> }
>>>>>>>>>>>>
>>>>>>>>>>>> static void
suspend_vqs(struct mlx5_vdpa_net *ndev)
>>>>>>>>>>>> --
>>>>>>>>>>>> 2.29.2
>>>>>>>>>>>>