I notice this code in drivers/block/xen-blkback/common.h #define vbd_sz(_v) ((_v)->bdev->bd_part ? \ (_v)->bdev->bd_part->nr_sects : \ get_capacity((_v)->bdev->bd_disk)) is the value returned by vbd_sz(_v) the number of sectors in the Linux device (eg size / 4096), or the number of 512 byte sectors? I suspect the former which is causing block requests beyond 1/8th the size of the device to fail (assuming 4K sectors are expected to work at all - I can''t quite get my head around how it would be expected to work - does Linux do the read-modify-write if required?) I can''t test until tomorrow AEDT, but maybe someone here knows the answer already? James
On Mon, Aug 13, 2012 at 02:12:58PM +0000, James Harper wrote:> I notice this code in drivers/block/xen-blkback/common.h > > #define vbd_sz(_v) ((_v)->bdev->bd_part ? \ > (_v)->bdev->bd_part->nr_sects : \ > get_capacity((_v)->bdev->bd_disk)) > > is the value returned by vbd_sz(_v) the number of sectors in the Linux device (eg size / 4096), or the number of 512 byte sectors? I suspect the former which is causing block requests beyond 1/8th the size of the device to fail (assuming 4K sectors are expected to work at all - I can''t quite get my head around how it would be expected to work - does Linux do the read-modify-write if required?)I think you need to instrument it to be sure.. But more interesting, do you actually have a disk that exposes a 4KB hardware and logical sector? So far I''ve only found SSDs that expose a 512kB logical sector but also expose the 4KB hardware. Never could figure out how that is all suppose to work as the blkback is filled with << 9 on a bunch of things.> > I can''t test until tomorrow AEDT, but maybe someone here knows the answer already? > > James > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xen.org > http://lists.xen.org/xen-devel
> On Mon, Aug 13, 2012 at 02:12:58PM +0000, James Harper wrote: > > I notice this code in drivers/block/xen-blkback/common.h > > > > #define vbd_sz(_v) ((_v)->bdev->bd_part ? \ > > (_v)->bdev->bd_part->nr_sects : \ > > get_capacity((_v)->bdev->bd_disk)) > > > > is the value returned by vbd_sz(_v) the number of sectors in the Linux > > device (eg size / 4096), or the number of 512 byte sectors? I suspect > > the former which is causing block requests beyond 1/8th the size of > > the device to fail (assuming 4K sectors are expected to work at all - > > I can''t quite get my head around how it would be expected to work - > > does Linux do the read-modify-write if required?) > > I think you need to instrument it to be sure.. But more interesting, do you > actually have a disk that exposes a 4KB hardware and logical sector? So far > I''ve only found SSDs that expose a 512kB logical sector but also expose the > 4KB hardware. > > Never could figure out how that is all suppose to work as the blkback is filled > with << 9 on a bunch of things. >I was using bcache which does expose a 4K block size, by default. I changed it to 512 and it all works now, although I haven''t tested if there is any loss of performance. Does Xen provide a way to tell Windows that the underlying device is 512e (4K sector with 512 byte emulated interface)? This would keep everything working as is but allow windows to align writes to 4K boundaries where possible. James
On Wed, Sep 05, 2012 at 11:56:08PM +0000, James Harper wrote:> > On Mon, Aug 13, 2012 at 02:12:58PM +0000, James Harper wrote: > > > I notice this code in drivers/block/xen-blkback/common.h > > > > > > #define vbd_sz(_v) ((_v)->bdev->bd_part ? \ > > > (_v)->bdev->bd_part->nr_sects : \ > > > get_capacity((_v)->bdev->bd_disk)) > > > > > > is the value returned by vbd_sz(_v) the number of sectors in the Linux > > > device (eg size / 4096), or the number of 512 byte sectors? I suspect > > > the former which is causing block requests beyond 1/8th the size of > > > the device to fail (assuming 4K sectors are expected to work at all - > > > I can''t quite get my head around how it would be expected to work - > > > does Linux do the read-modify-write if required?) > > > > I think you need to instrument it to be sure.. But more interesting, do you > > actually have a disk that exposes a 4KB hardware and logical sector? So far > > I''ve only found SSDs that expose a 512kB logical sector but also expose the > > 4KB hardware. > > > > Never could figure out how that is all suppose to work as the blkback is filled > > with << 9 on a bunch of things. > > > > I was using bcache which does expose a 4K block size, by default. I changed it to 512 and it all works now, although I haven''t tested if there is any loss of performance.OK, let me see how I can setup bcache and play with that.> > Does Xen provide a way to tell Windows that the underlying device is 512e (4K sector with 512 byte emulated interface)? This would keep everything working as is but allow windows to align writes to 4K boundaries where possible.We can certainly expose that via the XenBus interface.> > James
On 6 September 2012 20:58, Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> wrote:> On Wed, Sep 05, 2012 at 11:56:08PM +0000, James Harper wrote: >> > On Mon, Aug 13, 2012 at 02:12:58PM +0000, James Harper wrote: >> > > I notice this code in drivers/block/xen-blkback/common.h >> > > >> > > #define vbd_sz(_v) ((_v)->bdev->bd_part ? \ >> > > (_v)->bdev->bd_part->nr_sects : \ >> > > get_capacity((_v)->bdev->bd_disk)) >> > > >> > > is the value returned by vbd_sz(_v) the number of sectors in the Linux >> > > device (eg size / 4096), or the number of 512 byte sectors? I suspect >> > > the former which is causing block requests beyond 1/8th the size of >> > > the device to fail (assuming 4K sectors are expected to work at all - >> > > I can''t quite get my head around how it would be expected to work - >> > > does Linux do the read-modify-write if required?) >> > >> > I think you need to instrument it to be sure.. But more interesting, do you >> > actually have a disk that exposes a 4KB hardware and logical sector? So far >> > I''ve only found SSDs that expose a 512kB logical sector but also expose the >> > 4KB hardware. >> > >> > Never could figure out how that is all suppose to work as the blkback is filled >> > with << 9 on a bunch of things. >> > >> >> I was using bcache which does expose a 4K block size, by default. I changed it to 512 and it all works now, although I haven''t tested if there is any loss of performance. > > OK, let me see how I can setup bcache and play with that. >> >> Does Xen provide a way to tell Windows that the underlying device is 512e (4K sector with 512 byte emulated interface)? This would keep everything working as is but allow windows to align writes to 4K boundaries where possible. > > We can certainly expose that via the XenBus interface. >> >> James > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xen.org > http://lists.xen.org/xen-develAfter reading through blkback it appears that it can only support 512 byte sector sizes and removing this limitation would take quite abit of work. It uses hard coded bitshifts pervasively to convert between number of requests/pages and size of sectors etc. (that is all the >> 9 everywhere) I am going to see what I can about working on getting it to support 4k sectors too and eventually uncoupled logical/physical sizes but that would take even more work as far as I can tell. Being able to use 4k sectors seems like it would provide pretty massive gains in performance just by being more efficient let alone increasing byte aligned writes to the underlying block storage system. Joseph. -- CTO | Orion Virtualisation Solutions | www.orionvm.com.au Phone: 1300 56 99 52 | Mobile: 0428 754 846
On 16/09/2012 08:00, "Joseph Glanville" <joseph.glanville@orionvm.com.au> wrote:> After reading through blkback it appears that it can only support 512 > byte sector sizes and removing this limitation would take quite abit > of work. > It uses hard coded bitshifts pervasively to convert between number of > requests/pages and size of sectors etc. (that is all the >> 9 > everywhere) > > I am going to see what I can about working on getting it to support 4k > sectors too and eventually uncoupled logical/physical sizes but that > would take even more work as far as I can tell. > > Being able to use 4k sectors seems like it would provide pretty > massive gains in performance just by being more efficient let alone > increasing byte aligned writes to the underlying block storage system.The PV blk transport may be based on 512-byte sectors, but the real sector size is communicated between blkfront and blkback via xenbus (field ''sector-size'') and blkfront is expected to only make requests that are multiple of, and aligned according to, that real ''sector-size''. I would kind of expect it to work, as CD-ROMs have a larger sector size (2kB IIRC) and we support those... Bashing your head against the PV blk transport code may be premature. ;) -- Keir
On 16 September 2012 18:31, Keir Fraser <keir.xen@gmail.com> wrote:> On 16/09/2012 08:00, "Joseph Glanville" <joseph.glanville@orionvm.com.au> > wrote: > >> After reading through blkback it appears that it can only support 512 >> byte sector sizes and removing this limitation would take quite abit >> of work. >> It uses hard coded bitshifts pervasively to convert between number of >> requests/pages and size of sectors etc. (that is all the >> 9 >> everywhere) >> >> I am going to see what I can about working on getting it to support 4k >> sectors too and eventually uncoupled logical/physical sizes but that >> would take even more work as far as I can tell. >> >> Being able to use 4k sectors seems like it would provide pretty >> massive gains in performance just by being more efficient let alone >> increasing byte aligned writes to the underlying block storage system. > > The PV blk transport may be based on 512-byte sectors, but the real sector > size is communicated between blkfront and blkback via xenbus (field > ''sector-size'') and blkfront is expected to only make requests that are > multiple of, and aligned according to, that real ''sector-size''. > > I would kind of expect it to work, as CD-ROMs have a larger sector size (2kB > IIRC) and we support those... > > Bashing your head against the PV blk transport code may be premature. ;) > > -- Keir > >Understood, still have a fair bit of reading to do. :) Thanks, Joseph. -- CTO | Orion Virtualisation Solutions | www.orionvm.com.au Phone: 1300 56 99 52 | Mobile: 0428 754 846
> > Being able to use 4k sectors seems like it would provide pretty > > massive gains in performance just by being more efficient let alone > > increasing byte aligned writes to the underlying block storage system. > > The PV blk transport may be based on 512-byte sectors, but the real sector > size is communicated between blkfront and blkback via xenbus (field > ''sector-size'') and blkfront is expected to only make requests that are > multiple of, and aligned according to, that real ''sector-size''. > > I would kind of expect it to work, as CD-ROMs have a larger sector size (2kB > IIRC) and we support those... > > Bashing your head against the PV blk transport code may be premature. ;) >So a sector-size of 4096 would basically be a 512e device, allowing the underlying OS to communicate in 512 byte blocks but knowing that things will work best in 4096 byte sized transfers aligned to multiples of 4096 bytes, right? James
On 16/09/2012 11:37, "James Harper" <james.harper@bendigoit.com.au> wrote:>>> Being able to use 4k sectors seems like it would provide pretty >>> massive gains in performance just by being more efficient let alone >>> increasing byte aligned writes to the underlying block storage system. >> >> The PV blk transport may be based on 512-byte sectors, but the real sector >> size is communicated between blkfront and blkback via xenbus (field >> ''sector-size'') and blkfront is expected to only make requests that are >> multiple of, and aligned according to, that real ''sector-size''. >> >> I would kind of expect it to work, as CD-ROMs have a larger sector size (2kB >> IIRC) and we support those... >> >> Bashing your head against the PV blk transport code may be premature. ;) >> > > So a sector-size of 4096 would basically be a 512e device, allowing the > underlying OS to communicate in 512 byte blocks but knowing that things will > work best in 4096 byte sized transfers aligned to multiples of 4096 bytes, > right?My recollection is that blkfront is required to submit only appropriately -sized and -aligned requests; i.e. it''s not merely advisory. I remember this got added for CD-ROM support and if they had worked without this, I''m sure we wouldn''t have bothered! -- Keir> James >
> > So a sector-size of 4096 would basically be a 512e device, allowing > > the underlying OS to communicate in 512 byte blocks but knowing that > > things will work best in 4096 byte sized transfers aligned to > > multiples of 4096 bytes, right? > > My recollection is that blkfront is required to submit only appropriately -sized > and -aligned requests; i.e. it''s not merely advisory. I remember this got > added for CD-ROM support and if they had worked without this, I''m sure we > wouldn''t have bothered! >That''s a shame. It would be good to have separate values for Physical and Logical block sizes so the guest VM can make appropriate alignment decisions. In fact there is a lot of stuff in /sys for the block devices that would be nice to be mapped into xenstore! James
> I would kind of expect it to work, as CD-ROMs have a larger sector size (2kB > IIRC) and we support those...For data blocks they are 2K, as are some magneto-opticals. The more complicated case is modern hard disks, while you can access them on 512 byte boundaries they are actually using bigger block sizes but the large blocks are not neccessarily on the 0 boundary in order to get optimal alignment for existing file systems and partitioning. So knowing the block size isn''t the whole story. Alan
> > > I would kind of expect it to work, as CD-ROMs have a larger sector > > size (2kB > > IIRC) and we support those... > > For data blocks they are 2K, as are some magneto-opticals. > > The more complicated case is modern hard disks, while you can access them > on 512 byte boundaries they are actually using bigger block sizes but the large > blocks are not neccessarily on the 0 boundary in order to get optimal > alignment for existing file systems and partitioning. > > So knowing the block size isn''t the whole story. >Are you saying that Xen and/or Linux needs to worry about a user setting up a poorly aligned filesystem to pass to a VM? Seems simpler just to set things up right in the first place. Or did you mean something else? James
> > So knowing the block size isn''t the whole story. > > > > Are you saying that Xen and/or Linux needs to worry about a user setting up a poorly aligned filesystem to pass to a VM? Seems simpler just to set things up right in the first place.That assumes things like a file system and the existing layout being correct. Plus you also have to set the thing up which means you have to know about such stuff. For file systems Linux itself does indeed take the approach of "so partition sensibly" because in the fs case it''s really hard if not impossible to do a good job any other way. For raw devices and things like databases wanting atomicity of block writes however its quite different and you need to be aware of the alignments. Alan