On Wed, Mar 20, 2019 at 12:11:57PM -0500, Eric Blake wrote:>On 3/20/19 11:57 AM, Richard W.M. Jones wrote: > >> >>>> Also an observation: qemu's nbd client only ever issues block status >>>> requests with the req-one flag set, so perhaps we should optimize for >>>> that case. >>> >>> I hope to get to the point where future qemu doesn't send the req-one >>> flag. There's several threads on the qemu list about how qemu-img is >>> slower than it needs to be because it is throwing away useful >>> information, and where it is aggravated by the kernel's abyssmal lseek() >>> performance on tmpfs. But until qemu learns useful caching, you're >>> right that most existing NBD clients that request block status do so one >>> extent at a time (because I don't know of any other existing NBD clients >>> that use BLOCK_STATUS yet). >> >> Is it ever possible to cache block status results? What happens in >> the (admittedly unusual) case where two writers are hitting the same >> NBD server? For example if the server is implementing a cluster >> filesystem. > >For a read-only client: caching 'data' regions is okay, caching 'zero' >or 'hole' regions is bad (because even though you are not modifying the >image, another writer might be; demoting 'hole' to 'data' is safe - it >merely pessimizes into a read() instead of skipping work; but caching >'hole' that is later promoted to 'data' is wrong - it can cause data >lass if the client doesn't read the actual data). > >For a writing client: either you are an exclusive writer (and should >know what you wrote, so the cache is the fact that you changed the state >yourself) or you are on a cluster filesystem (at which point, your >cluster system better have its own rules for how to resolve races >inherent in multiple writers, where you shouldn't be relying on block >status but on the cluster protocol in the first place). >Even for the reading client, you don't need to need to access a place on the disk twice, even one access is racy because there can be a change between BLOCK_STATUS and READ. And that same thing happens in the plugins for files and everything that someone else can access. I don't think it is designed for concurrent access. Or is it?>-- >Eric Blake, Principal Software Engineer >Red Hat, Inc. +1-919-301-3226 >Virtualization: qemu.org | libvirt.org >
On 3/21/19 7:18 AM, Martin Kletzander wrote:>> For a read-only client: caching 'data' regions is okay, caching 'zero' >> or 'hole' regions is bad (because even though you are not modifying the >> image, another writer might be; demoting 'hole' to 'data' is safe - it >> merely pessimizes into a read() instead of skipping work; but caching >> 'hole' that is later promoted to 'data' is wrong - it can cause data >> lass if the client doesn't read the actual data). >> >> For a writing client: either you are an exclusive writer (and should >> know what you wrote, so the cache is the fact that you changed the state >> yourself) or you are on a cluster filesystem (at which point, your >> cluster system better have its own rules for how to resolve races >> inherent in multiple writers, where you shouldn't be relying on block >> status but on the cluster protocol in the first place). >> > > Even for the reading client, you don't need to need to access a place on > the > disk twice, even one access is racy because there can be a change between > BLOCK_STATUS and READ. And that same thing happens in the plugins for > files and > everything that someone else can access. I don't think it is designed for > concurrent access. Or is it?Indeed, there is always a TOCTTOU race when you rely on block status if there is ever a concurrent writer. But, is it dangerous? Without a block status, we can have either: reader writer ------------------------ read sector X as A write sector X as B or: reader writer ------------------------ write sector X as B read sector X as B with the two steps, we have one of: reader writer ------------------------ learn sector X has status Y read sector X as A write sector X as B reader writer ------------------------ learn sector X has status Y write sector X as B read sector X as B reader writer ------------------------ write sector X as B learn sector X has status Y read sector X as B where the shortcut is that if the reader sees status 'hole', it skips a read. Had it done the read in spite of learning about a hole, it would either see all 0s (contents A - but that's no different than read winning the race without a status check), or the new content (contents B just written by the writer - proving the hole status is out of date, but no different from losing the race without a status check). In general, trying to copy an image while it is being modified is not going to work reliably; the main point of block status is to make copying more efficient, but when copying, you are assuming no concurrent modifications. -- Eric Blake, Principal Software Engineer Red Hat, Inc. +1-919-301-3226 Virtualization: qemu.org | libvirt.org
On Thu, Mar 21, 2019 at 09:01:26AM -0500, Eric Blake wrote:>On 3/21/19 7:18 AM, Martin Kletzander wrote: > >>> For a read-only client: caching 'data' regions is okay, caching 'zero' >>> or 'hole' regions is bad (because even though you are not modifying the >>> image, another writer might be; demoting 'hole' to 'data' is safe - it >>> merely pessimizes into a read() instead of skipping work; but caching >>> 'hole' that is later promoted to 'data' is wrong - it can cause data >>> lass if the client doesn't read the actual data). >>> >>> For a writing client: either you are an exclusive writer (and should >>> know what you wrote, so the cache is the fact that you changed the state >>> yourself) or you are on a cluster filesystem (at which point, your >>> cluster system better have its own rules for how to resolve races >>> inherent in multiple writers, where you shouldn't be relying on block >>> status but on the cluster protocol in the first place). >>> >> >> Even for the reading client, you don't need to need to access a place on >> the >> disk twice, even one access is racy because there can be a change between >> BLOCK_STATUS and READ. And that same thing happens in the plugins for >> files and >> everything that someone else can access. I don't think it is designed for >> concurrent access. Or is it? > >Indeed, there is always a TOCTTOU race when you rely on block status if >there is ever a concurrent writer. But, is it dangerous? Without a >block status, we can have either: > >reader writer >------------------------ >read sector X as A > write sector X as B > >or: > >reader writer >------------------------ > write sector X as B >read sector X as B > >with the two steps, we have one of: > >reader writer >------------------------ >learn sector X has status Y >read sector X as A > write sector X as B > >reader writer >------------------------ >learn sector X has status Y > write sector X as B >read sector X as B > >reader writer >------------------------ > write sector X as B >learn sector X has status Y >read sector X as B > >where the shortcut is that if the reader sees status 'hole', it skips a >read. Had it done the read in spite of learning about a hole, it would >either see all 0s (contents A - but that's no different than read >winning the race without a status check), or the new content (contents B >just written by the writer - proving the hole status is out of date, but >no different from losing the race without a status check). > >In general, trying to copy an image while it is being modified is not >going to work reliably; the main point of block status is to make >copying more efficient, but when copying, you are assuming no concurrent >modifications. >Yes, basically it should not be used for communication. So we are on the same page, I just wanted to confirm that.>-- >Eric Blake, Principal Software Engineer >Red Hat, Inc. +1-919-301-3226 >Virtualization: qemu.org | libvirt.org >