Steve Dobbelstein
2006-Feb-02 21:34 UTC
[Xen-devel] Shouldn''t backend devices for VMX domain disks be opened with O_DIRECT?
While running some disk performance tests for VMX domains we noticed that writes to the backend device for a VMX domain''s disk go through the buffer cache, that is, they are not written immediately to disk. Shouldn''t the I/Os go straight to the backend device, i.e., the device should be opened with O_DIRECT or some such? From the domain''s perspective it expects the data to be physically on the device, but in reality it is not. There are things, such a writes to a file system journal, that the OS in the domain will expect to be on disk. If the whole system crashes before the buffer cache in dom0 is written to disk, those writes may not be on the disk. When the domain is started again it may find the file system in an inconsistent state, due to writes to the journal that didn''t make it to disk, and may not be able to recover. It seems to me that if a domain expects things to be physically on its frontend device that they should be physically on the backend device as well. Or am I missing something from the bigger picture? Steve D. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Anthony Liguori
2006-Feb-02 21:46 UTC
Re: [Xen-devel] Shouldn''t backend devices for VMX domain disks be opened with O_DIRECT?
Steve Dobbelstein wrote:>While running some disk performance tests for VMX domains we noticed that >writes to the backend device for a VMX domain''s disk go through the buffer >cache, that is, they are not written immediately to disk. Shouldn''t the >I/Os go straight to the backend device, i.e., the device should be opened >with O_DIRECT or some such? From the domain''s perspective it expects the >data to be physically on the device, but in reality it is not. There are >things, such a writes to a file system journal, that the OS in the domain >will expect to be on disk. If the whole system crashes before the buffer >cache in dom0 is written to disk, those writes may not be on the disk. >When the domain is started again it may find the file system in an >inconsistent state, due to writes to the journal that didn''t make it to >disk, and may not be able to recover. > >It seems to me that if a domain expects things to be physically on its >frontend device that they should be physically on the backend device as >well. Or am I missing something from the bigger picture? > >I would doubt it. Since it''s usually opening a file, and qemu-dm is emulating a contigous disk, you probably want the buffer cache to reorder events. Are you seeing a performance improvement? Should be easy to check. Regards, Anthony Liguori>Steve D. > > >_______________________________________________ >Xen-devel mailing list >Xen-devel@lists.xensource.com >http://lists.xensource.com/xen-devel > > >_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Steve Dobbelstein
2006-Feb-02 22:28 UTC
Re: [Xen-devel] Shouldn''t backend devices for VMX domain disks be opened with O_DIRECT?
aliguori@us.ltcfwd.linux.ibm.com wrote on 02/02/2006 03:46:11 PM:> I would doubt it. Since it''s usually opening a file, and qemu-dm is > emulating a contigous disk, you probably want the buffer cache to > reorder events.I guess we''re not usual since our backend is an LVM volume. :) I can appreciate how writing to the buffer cache can speed up the response to the I/O and make it more efficient in its writing to the backend device by reordering events. However, I''m still wondering if we have a data corruption issue should dom0 crash before it writes the data in the buffer cache to disk, data that the domain expects to be on the disk but won''t be there when the domain is restarted.> Are you seeing a performance improvement? Should be easy to check.We just started doing the first runs of disk performance tests when we noticed this behavior and thought we should bring it up on the list. We don''t have enough data points to compare yet. We''ll post problems/issues if/when we find them. Steve D. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Philip R. Auld
2006-Feb-02 22:41 UTC
Re: [Xen-devel] Shouldn''t backend devices for VMX domain disks be opened with O_DIRECT?
Hi, Rumor has it that on Thu, Feb 02, 2006 at 04:28:37PM -0600 Steve Dobbelstein said:> aliguori@us.ltcfwd.linux.ibm.com wrote on 02/02/2006 03:46:11 PM: > > > I would doubt it. Since it''s usually opening a file, and qemu-dm is > > emulating a contigous disk, you probably want the buffer cache to > > reorder events. > > I guess we''re not usual since our backend is an LVM volume. :) > > I can appreciate how writing to the buffer cache can speed up the response > to the I/O and make it more efficient in its writing to the backend device > by reordering events. However, I''m still wondering if we have a data > corruption issue should dom0 crash before it writes the data in the buffer > cache to disk, data that the domain expects to be on the disk but won''t be > there when the domain is restarted.I agree. It sounds like a correctness problem. It''s just like disks with write caching enabled.> > > Are you seeing a performance improvement? Should be easy to check. >It''s more about correctness and data integrity than performance. Cheers, Phil> We just started doing the first runs of disk performance tests when we > noticed this behavior and thought we should bring it up on the list. We > don''t have enough data points to compare yet. We''ll post problems/issues > if/when we find them. > > Steve D. > > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xensource.com > http://lists.xensource.com/xen-devel-- Philip R. Auld, Ph.D. Egenera, Inc. Software Architect 165 Forest St. (508) 858-2628 Marlboro, MA 01752 _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Anthony Liguori
2006-Feb-03 00:09 UTC
Re: [Xen-devel] Shouldn''t backend devices for VMX domain disks be opened with O_DIRECT?
Philip R. Auld wrote:>Hi, > >Rumor has it that on Thu, Feb 02, 2006 at 04:28:37PM -0600 Steve Dobbelstein said: > > >>aliguori@us.ltcfwd.linux.ibm.com wrote on 02/02/2006 03:46:11 PM: >> >> >> >>>I would doubt it. Since it''s usually opening a file, and qemu-dm is >>>emulating a contigous disk, you probably want the buffer cache to >>>reorder events. >>> >>> >>I guess we''re not usual since our backend is an LVM volume. :) >> >>I can appreciate how writing to the buffer cache can speed up the response >>to the I/O and make it more efficient in its writing to the backend device >>by reordering events. However, I''m still wondering if we have a data >>corruption issue should dom0 crash before it writes the data in the buffer >>cache to disk, data that the domain expects to be on the disk but won''t be >>there when the domain is restarted. >> >> > >I agree. It sounds like a correctness problem. It''s just like disks >with write caching enabled. > >Referring to the original question, which has been quoted away, journaling doesn''t require that data be written to disk per-say but that writes occur in a particular order. A journal is always recoverable given that writes occur in the expected order. A buffer cache will have no effect on that order so you''re no more likely to have corruption than if you disabled the buffer cache. You especially want the buffer cache if you have LVM partitions. Sectors on an LVM disk are not necessarily contiguous and can even span multiple disks. You definitely want the IO scheduler involved there. If anything, what you really want (from a performance perspective) is to disable the buffer cache in the domU and leave it enabled in the dom0 (this is what the paravirtual drivers should be doing IIRC). Does this address your corruption concerns? Regards, Anthony Liguori>>>Are you seeing a performance improvement? Should be easy to check. >>> >>> > >It''s more about correctness and data integrity than performance. > > >Cheers, > >Phil > > > > >>We just started doing the first runs of disk performance tests when we >>noticed this behavior and thought we should bring it up on the list. We >>don''t have enough data points to compare yet. We''ll post problems/issues >>if/when we find them. >> >>Steve D. >> >> >>_______________________________________________ >>Xen-devel mailing list >>Xen-devel@lists.xensource.com >>http://lists.xensource.com/xen-devel >> >> > > >_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Luciano Miguel Ferreira Rocha
2006-Feb-03 00:31 UTC
Re: [Xen-devel] Shouldn''t backend devices for VMX domain disks be opened with O_DIRECT?
On Thu, Feb 02, 2006 at 06:09:11PM -0600, Anthony Liguori wrote:> >I agree. It sounds like a correctness problem. It''s just like disks > >with write caching enabled. > > > > > Referring to the original question, which has been quoted away, > journaling doesn''t require that data be written to disk per-say but that > writes occur in a particular order. A journal is always recoverable > given that writes occur in the expected order. A buffer cache will have > no effect on that order so you''re no more likely to have corruption than > if you disabled the buffer cache.Corruption meaning that the domU thinks data has been committed to disk but never has (dom0 crashed before the cache could be flushed). The correctness of some protocols or procedures depend on being able to forcefully commit changes to disk (databases, for example).> If anything, what you really want (from a performance perspective) is to > disable the buffer cache in the domU and leave it enabled in the dom0 > (this is what the paravirtual drivers should be doing IIRC).I disagree. domU must be able to sync(). And if domUs are already caching data, why let them pollute dom0''s cache? -- lfr 0/0 _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Rik van Riel
2006-Feb-03 02:40 UTC
Re: [Xen-devel] Shouldn''t backend devices for VMX domain disks be opened with O_DIRECT?
On Thu, 2 Feb 2006, Anthony Liguori wrote:> Referring to the original question, which has been quoted away, > journaling doesn''t require that data be written to disk per-say but that > writes occur in a particular order. A journal is always recoverable > given that writes occur in the expected order.If I do a database transaction or accept an email (SMTP transaction), I need to ensure that the data really did make it to disk. There is a reason that the email RFCs explicitly state that the email has to be committed to stable storage before returning the "250 Ok"! -- All Rights Reversed _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Stephen Tweedie
2006-Feb-03 02:42 UTC
Re: [Xen-devel] Shouldn''t backend devices for VMX domain disks be opened with O_DIRECT?
Hi, On Thu, 2006-02-02 at 18:09 -0600, Anthony Liguori wrote:> Referring to the original question, which has been quoted away, > journaling doesn''t require that data be written to disk per-say but that > writes occur in a particular order. A journal is always recoverable > given that writes occur in the expected order.Sure... it''s *internally* consistent, maybe. But you need more than that. You need guarantees that things are on disk, else external consistency guarantees will be broken. Consider things like sendmail fsync()ing a spool file before telling the sender that the email has been accepted. After that acknowledgement, the sender can delete the mail from its queues knowing that the recipient MTA definitely has the data, and even if it crashes, the mail won''t be lost. Databases frequently have similar consistency requirements. If a power failure loses writes that you have told the domU have completed --- even if you maintain write ordering --- then you *are* putting application correctness at risk, there''s no doubt about it.> A buffer cache will have > no effect on that order so you''re no more likely to have corruption than > if you disabled the buffer cache.Not if it''s being used as a write-through cache. If it''s write-back, it will have a major impact on ordering.> You especially want the buffer cache if you have LVM partitions. > Sectors on an LVM disk are not necessarily contiguous and can even span > multiple disks. You definitely want the IO scheduler involved there.That does not at all imply the use of the buffer cache. All that you need to satisfy this is AIO (asynchronous *submission* of the IO) combined with O_DIRECT IO (synchronous *completion*) --- ie. you can submit multiple IOs concurrently, but you know for sure when each one completes. That still lets the elevator get strongly involved in the scheduling and reordering of the IOs, but lets you know reliably when things hit disk. Fortunately, that''s just what blkback is doing --- it''s using submit_bio to submit the write IOs without waiting for completion, and is using the bio''s bi_end_io callback to process the IO completion once it is hard on disk. --Stephen _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Anthony Liguori
2006-Feb-03 02:50 UTC
Re: [Xen-devel] Shouldn''t backend devices for VMX domain disks be opened with O_DIRECT?
Stephen Tweedie wrote:>Hi, > >On Thu, 2006-02-02 at 18:09 -0600, Anthony Liguori wrote: > > > >>Referring to the original question, which has been quoted away, >>journaling doesn''t require that data be written to disk per-say but that >>writes occur in a particular order. A journal is always recoverable >>given that writes occur in the expected order. >> >> > >Sure... it''s *internally* consistent, maybe. But you need more than >that. You need guarantees that things are on disk, else external >consistency guarantees will be broken. > >Ok, this is certainly correct (but not the original point).>Consider things like sendmail fsync()ing a spool file before telling the >sender that the email has been accepted. After that acknowledgement, >the sender can delete the mail from its queues knowing that the >recipient MTA definitely has the data, and even if it crashes, the mail >won''t be lost. Databases frequently have similar consistency >requirements. If a power failure loses writes that you have told the >domU have completed --- even if you maintain write ordering --- then you >*are* putting application correctness at risk, there''s no doubt about >it. > >Ok, this is a good argument for using O_SYNC.>Fortunately, that''s just what blkback is doing --- it''s using submit_bio >to submit the write IOs without waiting for completion, and is using the >bio''s bi_end_io callback to process the IO completion once it is hard on >disk. > >Yup, the question here is with the device model which doesn''t use the block frontend/backend. Would O_DIRECT be helpful over O_SYNC? Regards, Anthony Liguori>--Stephen > > > > >_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Stephen C. Tweedie
2006-Feb-03 15:42 UTC
Re: [Xen-devel] Shouldn''t backend devices for VMX domain disks be opened with O_DIRECT?
Hi, On Thu, 2006-02-02 at 20:50 -0600, Anthony Liguori wrote:> Yup, the question here is with the device model which doesn''t use the > block frontend/backend. Would O_DIRECT be helpful over O_SYNC?There are really two separate parts to that. First is whether write-through caching is helpful. My own gut reaction is not --- it implies an extra copy in the dom0, which is both a CPU overhead to make the copy and a memory overhead to preserve it. It does mean that subsequent reads will be faster, but it should be up to the domU to decide whether that is useful or not, not the dom0 --- a domU running a database using O_DIRECT **really** doesn''t want dom0 doings any extra caching on its writes. Second is whether O_DIRECT fits the IO model. O_DIRECT has a lot of extra constraints on the sort of IO you can do --- it has to be sector-aligned in memory, in size and on disk, for example. That will probably fit neatly into the environment that a block device backend is running in, but something doing filesystem-level IO forwarding won''t always have the same alignment guarantees. (The page cache gives us the right alignment in a lot of cases, though.) --Stephen _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel