Would it be worthwhile to implement heuristics to auto-tune ''recordsize'', or would that not be worth the effort? -- Regards, Jeremy
Jeremy Teo wrote:> Would it be worthwhile to implement heuristics to auto-tune > ''recordsize'', or would that not be worth the effort?It would be really great to automatically select the proper recordsize for each file! How do you suggest doing so? --matt
One technique would be to keep a histogram of read & write sizes. Presumably one would want to do this only during a ?tuning phase? after the file was first created, or when access patterns change. (A shift to smaller record sizes can be detected by a large proportion of write operations which require block pre-reads; a shift to larger record sizes can be detected by a large proportion of write operations which write more than one block.) The ability to change the block size on-the-fly seems useful here. This message posted from opensolaris.org
On Fri, Oct 13, 2006 at 08:30:27AM -0700, Matthew Ahrens wrote:> Jeremy Teo wrote: > >Would it be worthwhile to implement heuristics to auto-tune > >''recordsize'', or would that not be worth the effort? > > It would be really great to automatically select the proper recordsize > for each file! How do you suggest doing so?I would suggest the following: - on file creation start with record size = 8KB (or some such smallish size), but don''t record this on-disk yet - keep the record size at 8KB until the file exceeds some size, say, .5MB, at which point the most common read size, if there were enough reads, or the most common write size otherwise, should be used to derive the actual file record size (rounding up if need be) - if the selected record size != 8KB then re-write the file with the new record size - record the file''s selected record size in an extended attribute - on truncation keep the existing file record size - on open of non-empty files without associated file record size stick to the original approach (growing the file block size up to the FS record size, defaulting to 128KB) I think we should create a namespace for Solaris-specific extended attributes. The file record size attribute should be writable, but changes in record size should only be allowed when the file is empty or when the file data is in one block. E.g., writing "8KB" to a file''s RS EA when the file''s larger than 8KB or consists of more than one block should appear to succeed, but a subsequent read of the RS EA should show the previous record size. This approach might lead to the creation of new tunables for controlling the heuristic (e.g., which heuristic, initial RS, file size at which RS will be determined, default RS when none can be determined). Nico --
Group, I am not sure I agree with the 8k size. Since "recordsize" is based on the size of filesystem blocks for large files, my first consideration is what will be the max size of the file object. For extremely large files (25 to 100GBs), that are accessed sequentially for both read & write, I would expect 64k or 128k. Putpage functions attempt to grab a number of pages off the vnode and place their modified contents within disk blocks. Thus if disk blocks are larger, then a fewer of them are needed, and can result in a more efficient operations. However, any small change to the filesystem block would result in the entire filesystem block being accessed, so small accesses to the block are very inefficent. Lastly, the access to a larger block will occupy the media for longer periods of continuous time, possibly creating a larger latency than necessary for another non-related op. Hope this helps... Mitchell Erblich ------------------- Nicolas Williams wrote:> > On Fri, Oct 13, 2006 at 08:30:27AM -0700, Matthew Ahrens wrote: > > Jeremy Teo wrote: > > >Would it be worthwhile to implement heuristics to auto-tune > > >''recordsize'', or would that not be worth the effort? > > > > It would be really great to automatically select the proper recordsize > > for each file! How do you suggest doing so? > > I would suggest the following: > > - on file creation start with record size = 8KB (or some such smallish > size), but don''t record this on-disk yet > > - keep the record size at 8KB until the file exceeds some size, say, > .5MB, at which point the most common read size, if there were enough > reads, or the most common write size otherwise, should be used to > derive the actual file record size (rounding up if need be) > > - if the selected record size != 8KB then re-write the file with the > new record size > > - record the file''s selected record size in an extended attribute > > - on truncation keep the existing file record size > > - on open of non-empty files without associated file record size stick > to the original approach (growing the file block size up to the FS > record size, defaulting to 128KB) > > I think we should create a namespace for Solaris-specific extended > attributes. > > The file record size attribute should be writable, but changes in record > size should only be allowed when the file is empty or when the file data > is in one block. E.g., writing "8KB" to a file''s RS EA when the file''s > larger than 8KB or consists of more than one block should appear to > succeed, but a subsequent read of the RS EA should show the previous > record size. > > This approach might lead to the creation of new tunables for controlling > the heuristic (e.g., which heuristic, initial RS, file size at which RS > will be determined, default RS when none can be determined). > > Nico > -- > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
On Fri, Oct 13, 2006 at 09:22:53PM -0700, Erblichs wrote:> For extremely large files (25 to 100GBs), that are accessed > sequentially for both read & write, I would expect 64k or 128k.Lager files accessed sequentially don''t need any special heuristic for record size determination: just use the filesystem''s record size and be done. The bigger the record size, the better -- a form of read ahead. Nico --
Nico, Yes, I agree. But also single random large single read and writes would also benefit from a large record size. So, I didn''t try make that distinction. However, I "guess" that the best random large reads & writes would fall within single filesystem record sizes. No, I haven''t reviewed whether the holes (disk block space) tend to be multiples of record size, page size, or .. Would a write of recordsize that didn''t fall on a record size boundry write into 2 filesystem blocks / records? However, would extremely large record sizes, say 1MB (or more) (what is the limit?), open up write atomicity issues or file corruption issues? Would record sizes like these be equal to mulitple track writes? Also, because of the "disk block" allocation stategy, I wasn''t too sure that any form of multiple disk block contigousness still applied with ZFS with smaller record sizes.. Yes, to minimize seek and rotational latencies and help with read ahead and "write behind"... Oh, but once writes have begun to the file, in the past, this has frozen the recordsize. So "self-tuning" or adjustments NEED to be decided probably at the create of the FS object. OR some type of copy mechanism needs to be done to a new file with a different record size at a later time when the default or past record size was determined to be significantly incorrect. Yes, I assume that many reads /writes will occur in the future that will amortize the copy cost. So, yes group... I am still formulating the "best" algorithm for this. ZFS uses alot of past gained knowledege applied to UFS (page lists stuff, chksum stuff, large file awwareness/support), but adds a new twist to things.. Mitchell Erblich ------------------ Nicolas Williams wrote:> > On Fri, Oct 13, 2006 at 09:22:53PM -0700, Erblichs wrote: > > For extremely large files (25 to 100GBs), that are accessed > > sequentially for both read & write, I would expect 64k or 128k. > > Lager files accessed sequentially don''t need any special heuristic for > record size determination: just use the filesystem''s record size and be > done. The bigger the record size, the better -- a form of read ahead. > > Nico > --
Jeremy Teo wrote:> Would it be worthwhile to implement heuristics to auto-tune > ''recordsize'', or would that not be worth the effort?Here is one relatively straightforward way you could implement this. You can''t (currently) change the recordsize once there are multiple blocks in the file. This shouldn''t be too bad because by the time they''ve written 128k, you should have enough info to make the choice. In fact, that might make a decent algorithm: * Record the first write size (in the ZPL''s znode) * If subsequent writes differ from that size, reset write size to zero * When a write comes in past 128k, see if the write size is still nonzero; if so, then read in the 128k, decrease the blocksize to the write size, fill in the 128k again, and finally do the new write. Obviously you will have to test this algorithm and make sure that it actually detects the recordsize on various databases. They may like to initialize their files with large writes, which would break this. If you have to change the recordsize once the file is big, you will have to rewrite everything[*], which would be time consuming. --matt [*] Or if you''re willing to hack up the DMU and SPA, you''ll "just" have to re-read everything to compute the new checksums and re-write all the indirect blocks.
Matthew Ahrens wrote:> Jeremy Teo wrote: >> Would it be worthwhile to implement heuristics to auto-tune >> ''recordsize'', or would that not be worth the effort? > > It would be really great to automatically select the proper recordsize > for each file! How do you suggest doing so?Maybe I''ve been thinking with my systems hat on to tight but why not have a hook into ZFS where an application, if written to the proper spec, can tell ZFS what it''s desired recordsize is? Then you don''t have to play any guessing games.
Torrey McMahon wrote On 10/15/06 22:13,:> Matthew Ahrens wrote: > >> Jeremy Teo wrote: >> >>> Would it be worthwhile to implement heuristics to auto-tune >>> ''recordsize'', or would that not be worth the effort? >> >> >> It would be really great to automatically select the proper recordsize >> for each file! How do you suggest doing so? > > > > Maybe I''ve been thinking with my systems hat on to tight but why not > have a hook into ZFS where an application, if written to the proper > spec, can tell ZFS what it''s desired recordsize is? Then you don''t have > to play any guessing games.That would work, but the idea behind ZFS is to get rid of special hooks and tunables and just make it work well and fast for the existing API. Neil.
Neil Perrin wrote:> > > Torrey McMahon wrote On 10/15/06 22:13,: >> Matthew Ahrens wrote: >> >>> Jeremy Teo wrote: >>> >>>> Would it be worthwhile to implement heuristics to auto-tune >>>> ''recordsize'', or would that not be worth the effort? >>> >>> >>> It would be really great to automatically select the proper >>> recordsize for each file! How do you suggest doing so? >> >> >> >> Maybe I''ve been thinking with my systems hat on to tight but why not >> have a hook into ZFS where an application, if written to the proper >> spec, can tell ZFS what it''s desired recordsize is? Then you don''t >> have to play any guessing games. > > That would work, but the idea behind ZFS is to get rid of special hooks > and tunables and just make it work well and fast for the existing API.I agree in principal. However, based on some of the other emails in the thread, it seems that the work required to watch the i/o coming in and recalibrating is a lot of work. I can''t see how a hook to tell ZFS what an apps preferred record size to be anywhere near complex or "fraught with peril" but I''ll gladly be re-edumacated. :) Of course we''d probably want an ingress hook for the app and an egress hook to the storage device ... but I''m getting ahead of myself.
Matthew Ahrens writes: > Jeremy Teo wrote: > > Would it be worthwhile to implement heuristics to auto-tune > > ''recordsize'', or would that not be worth the effort? > > Here is one relatively straightforward way you could implement this. > > You can''t (currently) change the recordsize once there are multiple > blocks in the file. This shouldn''t be too bad because by the time > they''ve written 128k, you should have enough info to make the choice. > In fact, that might make a decent algorithm: > > * Record the first write size (in the ZPL''s znode) > * If subsequent writes differ from that size, reset write size to zero > * When a write comes in past 128k, see if the write size is still > nonzero; if so, then read in the 128k, decrease the blocksize to the > write size, fill in the 128k again, and finally do the new write. > > Obviously you will have to test this algorithm and make sure that it > actually detects the recordsize on various databases. They may like to > initialize their files with large writes, which would break this. If > you have to change the recordsize once the file is big, you will have to > rewrite everything[*], which would be time consuming. > > --matt > Oracle will typically create it''s files with 128K writes not recordsize ones. -r
Roch wrote:> Oracle will typically create it''s files with 128K writes > not recordsize ones.Blast from the past... http://www.sun.com/blueprints/0400/ram-vxfs.pdf -- richard
Group, et al, I don''t understand that if the problem is systemic based on the number of continual dirty pages and stress to clean those pages, then why ..... If the problem is FS independent, because any number of different installed FSs can equally consume pages. Thus, to solve the problem on a per FS basis seems to me a bandaid approach.. Then why doesn''t the OS determine that a dangerous level of high watermark number of pages are continually being paged out (we have swapped and have a large percentage of available pages always dirty: based on recent past history) and thus, * force the writes to a set of predetermined pages (limit the number of pages for I/O), * these pages get I/O scheduled immediately, not waiting for a need for these pages and finding them dirty, (hopefully a percentage of these pages will be cleaned and be immediately available if needed in the near future), Yes, the OS could redirect the I/O as being direct without using the page cache, but the assumption is that these procs are behaving as multiple-readers and need the cached page data in the near future. Thus, changing the behaviour to remove whether the pages are cached bcause they CAN totally consume the cache removes the multiple-reader reader to cache the data in the first place, thus... * guarantee that heartbeats are always regular by preserving 5 to 20% of pages for exec / text, * limit the number of interrupts being generated by network so low level SCSI interrupts can page and not be starved, (something the white paper did not mention), (yes, this will cause the loss of UDP based data but we need to generate some form of backpressure / explicit congestion event), * if the files coming in from network were TCP based, hopefully a segment would be dropped and act as a backpressure to the originator of the data, * if the files are being read from the FS, then a max I/O rate should be determined based on the number of pages that are clean and ready to accept FS data, * etc Thus, tuning to determine whether the page cache should be used for write or read, should allow one set of processes not to adversely effect the operation of other processes. And any OS, should only slow down the dirty I/O pages for those specific processes and other processes work being unaware of the I/O issues.. Mitchell Erblich --------------------- Richard Elling - PAE wrote:> > Roch wrote: > > Oracle will typically create it''s files with 128K writes > > not recordsize ones. > > Blast from the past... > http://www.sun.com/blueprints/0400/ram-vxfs.pdf > > -- richard > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Heya Roch, On 10/17/06, Roch <Roch.Bourbonnais at sun.com> wrote: -snip-> Oracle will typically create it''s files with 128K writes > not recordsize ones.Darn, that makes things difficult doesn''t it? :( Come to think of it, maybe we''re approaching things from the wrong perspective. Databases such as Oracle have their own cache *anyway*. The reason why we set recordsize is to 1) Match the block size to the write/read access size, so we don''t read/buffer too much. But if we improved our caching to not cache blocks being accessed by a database, wouldn''t that solve the problem also? (I''m not precisely sure where and how much performance we win from setting an optimal recordsize) Thanks for listening folks! :) -- Regards, Jeremy
No, the reason to try to match recordsize to the write size is so that a small write does not turn into a large read + a large write. In configurations where the disk is kept busy, multiplying 8K of data transfer up to 256K hurts. This is really orthogonal to the cache ? in fact, if we had a switch to disable caching, this problem would get worse instead of better (since we wouldn''t amortize the initial large read over multiple small writes). This message posted from opensolaris.org
Heya Anton, On 10/17/06, Anton B. Rang <Anton.Rang at sun.com> wrote:> No, the reason to try to match recordsize to the write size is so that a small write does not turn into a large read + a large write. In configurations where the disk is kept busy, multiplying 8K of data transfer up to 256K hurts.Ah. I knew i was missing something. What COW giveth, COW taketh away...> This is really orthogonal to the cache ? in fact, if we had a switch to disable caching, this problem would get worse instead of better (since we wouldn''t amortize the initial large read over multiple small writes).Agreed. It looks to me there are only 2 ways to solve this: 1) Set recordsize manually 2) Allow the blocksize of a file be changed even if there are multiple blocks in the file. -- Regards, Jeremy
Jeremy Teo wrote:> Heya Anton, > > On 10/17/06, Anton B. Rang <Anton.Rang at sun.com> wrote: >> No, the reason to try to match recordsize to the write size is so that >> a small write does not turn into a large read + a large write. In >> configurations where the disk is kept busy, multiplying 8K of data >> transfer up to 256K hurts.(Actually ZFS goes up to 128k not 256k (yet!))> Ah. I knew i was missing something. What COW giveth, COW taketh away...Yes, although actually most non-COW filesystems have this same problem, because they don''t write partial blocks either, even though technically they could. (And FYI, checksumming would "take away" the ability to write partial blocks too.)> 1) Set recordsize manually > 2) Allow the blocksize of a file be changed even if there are multiple > blocks in the file.Or, as has been suggested, add an API for apps to tell us the recordsize before they populate the file. --matt
Matthew Ahrens wrote:> > Or, as has been suggested, add an API for apps to tell us the > recordsize before they populate the file.I''ll drop a RFE in and point people at the number.
On October 17, 2006 2:02:19 AM -0700 Erblichs <erblichs at earthlink.net> wrote:> Group, et al, > > I don''t understand that if the problem is systemic based on > the number of continual dirty pages and stress to clean > those pages, then why ..... > > If the problem is FS independent, because any number of > different installed FSs can equally consume pages. > Thus, to solve the problem on a per FS basis seems to me a > bandaid approach..I''m not very well versed in this stuff, but ISTM you can''t guarantee on-disk consistency unless the problem is dealt with per-FS. -frank
On Oct 17, 2006, at 12:43 PM, Matthew Ahrens wrote:> Jeremy Teo wrote: >> Heya Anton, >> On 10/17/06, Anton B. Rang <Anton.Rang at sun.com> wrote: >>> No, the reason to try to match recordsize to the write size is so >>> that a small write does not turn into a large read + a large >>> write. In configurations where the disk is kept busy, >>> multiplying 8K of data transfer up to 256K hurts. > > (Actually ZFS goes up to 128k not 256k (yet!))256K = 128K read + 128K write.> Yes, although actually most non-COW filesystems have this same > problem, because they don''t write partial blocks either, even > though technically they could. (And FYI, checksumming would "take > away" the ability to write partial blocks too.)In direct I/O mode, though, which is commonly used for databases, writes only affect individual disk blocks, not the whole file system blocks. (At least for UFS & QFS, but I presume VxFS is similar.) In the case of QFS in paged mode, only dirty pages are written, not whole file system blocks ("disk allocation units", or "DAUs", in QFS terminology). It''s common to use 2 MB or larger DAUs to reduce allocation overhead, improve contiguity, and reduce the need for indirect blocks. I''m not sure if this is the case for UFS with 8K blocks and 4K pages, but I imagine it is. As you say, checksumming requires that either whole "checksum blocks" (not necessarily file system blocks!) be processed, or that the checksum function is reversible (in the sense that inverse and composition functions for it exist) [ checksum(ABC) = f(g(A),g(B),g (C)) and there exists g^-1(B) such that we can compute checksum(AB''C) = f(g(A),g(B''),g(C)) or checksum(AB''C) = h(checksum(ABC), range(A), range(B), range(C), g^-1(B), g(B'')) ]. [The latter approach comes from a paper I can''t track down right now; if anyone''s familiar with it, I''d love to get the reference again.] -- Anton
Torrey McMahon wrote:> Matthew Ahrens wrote: >> >> Or, as has been suggested, add an API for apps to tell us the >> recordsize before they populate the file. > > > I''ll drop a RFE in and point people at the number.For those playing at home the RFE is 6483154
Hello all, Isn''t a large block size a simple case of prefetching? In other words, if we possessed an intelligent prefetch implementation, would there still be a need for large block sizes? (Thinking aloud) :) -- Regards, Jeremy
Reads? Maybe. Writes are an other matter. Namely the overhead associated with turning a large write into a lot of small writes. (Checksums for example.) Jeremy Teo wrote:> Hello all, > > Isn''t a large block size a simple case of prefetching? In other words, > if we possessed an intelligent prefetch implementation, would there > still be a need for large block sizes? (Thinking aloud) > > :) >
Torrey McMahon writes: > Reads? Maybe. Writes are an other matter. Namely the overhead associated > with turning a large write into a lot of small writes. (Checksums for > example.) > > Jeremy Teo wrote: > > Hello all, > > > > Isn''t a large block size a simple case of prefetching? In other words, > > if we possessed an intelligent prefetch implementation, would there > > still be a need for large block sizes? (Thinking aloud) > > > > :) > > > What Torrey says plus, a file stored with multiple small records still will need multiple head seeks to fetch data (prefetch or not). Given that head seeks are a precious resource large records are, at times, a goodness. Larger records also reduces the amount of metadata. -r > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss