On Fri, 2015-09-18 at 14:09 -0700, Nicholas A. Bellinger wrote:> On Fri, 2015-09-18 at 11:12 -0700, Ming Lin wrote: > > On Thu, 2015-09-17 at 17:55 -0700, Nicholas A. Bellinger wrote: > > > On Thu, 2015-09-17 at 16:31 -0700, Ming Lin wrote: > > > > On Wed, 2015-09-16 at 23:10 -0700, Nicholas A. Bellinger wrote: > > > > > Hi Ming & Co, > > <SNIP> > > > > > > > I think the future "LIO NVMe target" only speaks NVMe protocol. > > > > > > > > > > > > Nick(CCed), could you correct me if I'm wrong? > > > > > > > > > > > > For SCSI stack, we have: > > > > > > virtio-scsi(guest) > > > > > > tcm_vhost(or vhost_scsi, host) > > > > > > LIO-scsi-target > > > > > > > > > > > > For NVMe stack, we'll have similar components: > > > > > > virtio-nvme(guest) > > > > > > vhost_nvme(host) > > > > > > LIO-NVMe-target > > > > > > > > > > > > > > > > I think it's more interesting to consider a 'vhost style' driver that > > > > > can be used with unmodified nvme host OS drivers. > > > > > > > > > > Dr. Hannes (CC'ed) had done something like this for megasas a few years > > > > > back using specialized QEMU emulation + eventfd based LIO fabric driver, > > > > > and got it working with Linux + MSFT guests. > > > > > > > > > > Doing something similar for nvme would (potentially) be on par with > > > > > current virtio-scsi+vhost-scsi small-block performance for scsi-mq > > > > > guests, without the extra burden of a new command set specific virtio > > > > > driver. > > > > > > > > Trying to understand it. > > > > Is it like below? > > > > > > > > .------------------------. MMIO .---------------------------------------. > > > > | Guest |--------> | Qemu | > > > > | Unmodified NVMe driver |<-------- | NVMe device simulation(eventfd based) | > > > > '------------------------' '---------------------------------------' > > > > | ^ > > > > write NVMe | | notify command > > > > command | | completion > > > > to eventfd | | to eventfd > > > > v | > > > > .--------------------------------------. > > > > | Host: | > > > > | eventfd based LIO NVMe fabric driver | > > > > '--------------------------------------' > > > > | > > > > | nvme_queue_rq() > > > > v > > > > .--------------------------------------. > > > > | NVMe driver | > > > > '--------------------------------------' > > > > | > > > > | > > > > v > > > > .-------------------------------------. > > > > | NVMe device | > > > > '-------------------------------------' > > > > > > > > > > Correct. The LIO driver on KVM host would be handling some amount of > > > NVMe host interface emulation in kernel code, and would be able to > > > decode nvme Read/Write/Flush operations and translate -> submit to > > > existing backend drivers. > > > > Let me call the "eventfd based LIO NVMe fabric driver" as > > "tcm_eventfd_nvme" > > > > Currently, LIO frontend driver(iscsi, fc, vhost-scsi etc) talk to LIO > > backend driver(fileio, iblock etc) with SCSI commands. > > > > Did you mean the "tcm_eventfd_nvme" driver need to translate NVMe > > commands to SCSI commands and then submit to backend driver? > > > > IBLOCK + FILEIO + RD_MCP don't speak SCSI, they simply process I/Os with > LBA + length based on SGL memory or pass along a FLUSH with LBA + > length. > > So once the 'tcm_eventfd_nvme' driver on KVM host receives a nvme host > hardware frame via eventfd, it would decode the frame and send along the > Read/Write/Flush when exposing existing (non nvme native) backend > drivers.Learned vhost architecture: http://blog.vmsplice.net/2011/09/qemu-internals-vhost-architecture.html The nice thing is it is not tied to KVM in any way. For SCSI, there are "virtio-scsi" in guest kernel and "vhost-scsi" in host kernel. For NVMe, there is no "virtio-nvme" in guest kernel(just unmodified NVMe driver), but I'll do similar thing in Qemu with vhost infrastructure. And there is "vhost_nvme" in host kernel. For the "virtqueue" implementation in qemu-nvme, I'll possibly just use/copy drivers/virtio/virtio_ring.c, same as what linux/tools/virtio/virtio_test.c does. A bit more detail graph as below. What do you think? .-----------------------------------------. .------------------------. | Guest(Linux, Windows, FreeBSD, Solaris) | NVMe | qemu | | unmodified NVMe driver | command | NVMe device emulation | | | -------> | vhost + virtqueue | '-----------------------------------------' '------------------------' | | ^ passthrough | kick/notify NVMe command | via eventfd userspace via virtqueue | | | v v | ---------------------------------------------------------------------------------- .-----------------------------------------------------------------------. kernel | LIO frontend driver | | - vhost_nvme | '-----------------------------------------------------------------------' | translate ^ | (NVMe command) | | to | v (LBA, length) | .----------------------------------------------------------------------. | LIO backend driver | | - fileio (/mnt/xxx.file) | | - iblock (/dev/sda1, /dev/nvme0n1, ...) | '----------------------------------------------------------------------' | ^ | submit_bio() | v | .----------------------------------------------------------------------. | block layer | | | '----------------------------------------------------------------------' | ^ | | v | .----------------------------------------------------------------------. | block device driver | | | '----------------------------------------------------------------------' | | | | | | | | v v v v .------------. .-----------. .------------. .---------------. | SATA | | SCSI | | NVMe | | .... | '------------' '-----------' '------------' '---------------'
On Wed, 2015-09-23 at 15:58 -0700, Ming Lin wrote:> On Fri, 2015-09-18 at 14:09 -0700, Nicholas A. Bellinger wrote: > > On Fri, 2015-09-18 at 11:12 -0700, Ming Lin wrote: > > > On Thu, 2015-09-17 at 17:55 -0700, Nicholas A. Bellinger wrote:<SNIP>> > IBLOCK + FILEIO + RD_MCP don't speak SCSI, they simply process I/Os with > > LBA + length based on SGL memory or pass along a FLUSH with LBA + > > length. > > > > So once the 'tcm_eventfd_nvme' driver on KVM host receives a nvme host > > hardware frame via eventfd, it would decode the frame and send along the > > Read/Write/Flush when exposing existing (non nvme native) backend > > drivers. > > Learned vhost architecture: > http://blog.vmsplice.net/2011/09/qemu-internals-vhost-architecture.html > > The nice thing is it is not tied to KVM in any way. >Yes. There are assumptions vhost currently makes about the guest using virtio queues however, and at least for an initial vhost_nvme prototype it's probably easier to avoid hacking up drivers/vhost/* (for now).. (Adding MST CC')> For SCSI, there are "virtio-scsi" in guest kernel and "vhost-scsi" in > host kernel. > > For NVMe, there is no "virtio-nvme" in guest kernel(just unmodified NVMe > driver), but I'll do similar thing in Qemu with vhost infrastructure. > And there is "vhost_nvme" in host kernel. > > For the "virtqueue" implementation in qemu-nvme, I'll possibly just > use/copy drivers/virtio/virtio_ring.c, same as what > linux/tools/virtio/virtio_test.c does. > > A bit more detail graph as below. What do you think? > > .-----------------------------------------. .------------------------. > | Guest(Linux, Windows, FreeBSD, Solaris) | NVMe | qemu | > | unmodified NVMe driver | command | NVMe device emulation | > | | -------> | vhost + virtqueue | > '-----------------------------------------' '------------------------' > | | ^ > passthrough | kick/notify > NVMe command | via eventfd > userspace via virtqueue | | | > v v | > ----------------------------------------------------------------------------------This should read something like: Passthrough of nvme hardware frames via QEMU PCI-e struct vhost_mem into a custom vhost_nvme kernel driver ioctl using struct file + struct eventfd_ctx primitives. Eg: QEMU user-space is not performing the nvme command decode before passing emulated nvme hardware frame up to host kernel driver.> .-----------------------------------------------------------------------. > kernel | LIO frontend driver | > | - vhost_nvme | > '-----------------------------------------------------------------------' > | translate ^ > | (NVMe command) | > | to | > v (LBA, length) |vhost_nvme is performing host kernel level decode of user-space provided nvme hardware frames into nvme command + LBA +length + SGL buffer for target backend driver submission> .----------------------------------------------------------------------. > | LIO backend driver | > | - fileio (/mnt/xxx.file) | > | - iblock (/dev/sda1, /dev/nvme0n1, ...) | > '----------------------------------------------------------------------' > | ^ > | submit_bio() | > v | > .----------------------------------------------------------------------. > | block layer | > | | > '----------------------------------------------------------------------'For this part, HCH mentioned he is currently working on some code to pass native NVMe commands + SGL memory via blk-mq struct request into struct nvme_dev and/or struct nvme_queue.> | ^ > | | > v | > .----------------------------------------------------------------------. > | block device driver | > | | > '----------------------------------------------------------------------' > | | | | > | | | | > v v v v > .------------. .-----------. .------------. .---------------. > | SATA | | SCSI | | NVMe | | .... | > '------------' '-----------' '------------' '---------------' > >Looks fine. Btw, after chatting with Dr. Hannes this week at SDC here are his original rts-megasas -v6 patches from Feb 2013. Note they are standalone patches that require a sufficiently old enough LIO + QEMU to actually build + function. https://github.com/Datera/rts-megasas/blob/master/rts_megasas-qemu-v6.patch https://github.com/Datera/rts-megasas/blob/master/rts_megasas-fabric-v6.patch For groking purposes, they demonstrate the principle design for a host kernel level driver, along with the megasas firmware interface (MFI) specific emulation magic that makes up the bulk of the code. Take a look. --nab
On Sat, Sep 26, 2015 at 10:01 PM, Nicholas A. Bellinger <nab at linux-iscsi.org> wrote:> > Btw, after chatting with Dr. Hannes this week at SDC here are his > original rts-megasas -v6 patches from Feb 2013. > > Note they are standalone patches that require a sufficiently old enough > LIO + QEMU to actually build + function. > > https://github.com/Datera/rts-megasas/blob/master/rts_megasas-qemu-v6.patch > https://github.com/Datera/rts-megasas/blob/master/rts_megasas-fabric-v6.patch > > For groking purposes, they demonstrate the principle design for a host > kernel level driver, along with the megasas firmware interface (MFI) > specific emulation magic that makes up the bulk of the code. > > Take a look.Big thanks. Reading the patches now.> > --nab >
On 09/27/2015 07:01 AM, Nicholas A. Bellinger wrote:> On Wed, 2015-09-23 at 15:58 -0700, Ming Lin wrote: >> On Fri, 2015-09-18 at 14:09 -0700, Nicholas A. Bellinger wrote: >>> On Fri, 2015-09-18 at 11:12 -0700, Ming Lin wrote: >>>> On Thu, 2015-09-17 at 17:55 -0700, Nicholas A. Bellinger wrote: > > <SNIP> ><Even more SNIP>> > Btw, after chatting with Dr. Hannes this week at SDC here are his > original rts-megasas -v6 patches from Feb 2013. > > Note they are standalone patches that require a sufficiently old enough > LIO + QEMU to actually build + function. > > https://github.com/Datera/rts-megasas/blob/master/rts_megasas-qemu-v6.patch > https://github.com/Datera/rts-megasas/blob/master/rts_megasas-fabric-v6.patch > > For groking purposes, they demonstrate the principle design for a host > kernel level driver, along with the megasas firmware interface (MFI) > specific emulation magic that makes up the bulk of the code. >And indeed, Nic persuaded me to have them updated to qemu latest. Which I'll be doing shortly. Stay tuned. Cheers, Hannes -- Dr. Hannes Reinecke zSeries & Storage hare at suse.de +49 911 74053 688 SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N?rnberg GF: F. Imend?rffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton HRB 21284 (AG N?rnberg)