James Cone
2007-Nov-22 00:48 UTC
[zfs-discuss] User-visible non-blocking / atomic ops in ZFS
Hello All, Is any of the following available in ZFS, or is there any plan to add it? - persistent atomic-inc/atomic-dec of a group of bytes in a file? - LL/SC or Compare-and-swap of a group of bytes in a file, or a whole file - multiple renames, where: - all or none of them happen, with regard to: - readers of the files - panic or hard-stop of the machine - other people doing renames or multiple renames ? Regards, James.
Neil Perrin
2007-Nov-22 01:10 UTC
[zfs-discuss] User-visible non-blocking / atomic ops in ZFS
None of the below are available or planned in ZFS. In fact, I''m not aware of those services in any of Sun''s filesystems. What''s the interface for them? Is there a standard or proposed standard? Also what''s the purpose? Maybe the same can be achieved in other ways. Neil. James Cone wrote:> Hello All, > > Is any of the following available in ZFS, or is there any plan to add it? > > - persistent atomic-inc/atomic-dec of a group of bytes in a file? > > - LL/SC or Compare-and-swap of a group of bytes in a file, or a whole > file > > - multiple renames, where: > - all or none of them happen, with regard to: > - readers of the files > - panic or hard-stop of the machine > - other people doing renames or multiple renames > ? > > Regards, > James. > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
can you guess?
2007-Nov-22 03:09 UTC
[zfs-discuss] User-visible non-blocking / atomic ops in ZFS
I''m going to combine three posts here because they all involve jcone: First, as to my message heading: The ''search forum'' mechanism can''t find his posts under the ''jcone'' name (I was curious, because they''re interesting/strange, depending on how one looks at them). I''ve also noticed (once in his case, once in Louwtjie''s) that the ''last post'' column of one thread may reflect a post made to a different thread. Second, in response to your "Indexing other than hash tables" post: The only way you could get a file system like ZFS to perform indexed look-ups for you would be to make each of your ''records'' an entire file with the appropriate look-up name, and ReiserFS may be the only current file system that could handle this reasonably well This is an outgrowth of the Unix mindset that files must only be byte-streams rather than anything more powerful (such as the single- and multi-key indexed files of traditional minicomputer and mainframe systems) - and that''s especially unfortunate in ZFS''s case, because system-managed COW mechanisms just happen to be a dynamite way to handle b-trees (you could do so at the application level on top of ZFS via use of a sparse file plus a facility to deallocate space in it explicitly, but you''d still need an entire separate level of in-file space-allocation/deallocation mechanism). B-trees are the obvious solution to the kind of partial-key and/or key-range queries that you described. Finally, in response to your current post (which sounds more as if it had come from a hardware engineer than from a database type): All the facilities that you describe are traditionally handled by transactions of one form or another, and only read-only transactions can normally be non-blocking (because they simply capture a consistent point-in-time database state and operate upon that, ignoring any subsequent changes that may occur during their lifetimes). Other less-popular but more general non-blocking approaches exist which simply abort upon detecting conflict rather than attempt to wait for the conflict to evaporate, which tends not to scale very well because (unlike the case with non-blocking low-level hardware synchronization) restarting a transaction when you don''t have to can very often result in a *lot* of redundant work being performed; they include some multi-version approaches that implement more general ''time domain addressing'' than that just described for read-only transactions and the rare implementations based upon ''optimistic'' concurrency control that let conflicts occur and then decide whether to abort someone when they attempt to commit. ZFS supports transactions only for its internal use, and cannot feasibly support arbitrarily complex transactions because its atomicity approach depends upon gathering all transaction updates in RAM before writing them back atomically to disk (yes, it could perhaps do so in stages, since the entire new tree structure doesn''t become visible until its root has been made persistent, but that could arbitrarily delay other write activity in the system). While I think that supporting user-level transactions is a useful file-system feature and a few file systems such as Transarc''s Structured File System have actually done so, ZFS would have to change significantly to do so for anything other than *very* limited user-level transactions - hence I wouldn''t hold my breath waiting for such support in ZFS. - bill This message posted from opensolaris.org
James Cone
2007-Nov-22 03:39 UTC
[zfs-discuss] User-visible non-blocking / atomic ops in ZFS
Hi Bill, Yes, that covers all of my selfish questions, thanks. The B-trees I''m used to tree divide in arbitrary places across the whole key, so doing partial-key queries is painful. I can''t find "Structured File System" "Transarc" usefully in Google. Do you have a link handy? If not, never mind. Regards, James. can you guess? wrote:> I''m going to combine three posts here because they all involve jcone: > > First, as to my message heading: > > The ''search forum'' mechanism can''t find his posts under the ''jcone'' name (I was curious, because they''re interesting/strange, depending on how one looks at them). I''ve also noticed (once in his case, once in Louwtjie''s) that the ''last post'' column of one thread may reflect a post made to a different thread. > > Second, in response to your "Indexing other than hash tables" post: > > The only way you could get a file system like ZFS to perform indexed look-ups for you would be to make each of your ''records'' an entire file with the appropriate look-up name, and ReiserFS may be the only current file system that could handle this reasonably well > > This is an outgrowth of the Unix mindset that files must only be byte-streams rather than anything more powerful (such as the single- and multi-key indexed files of traditional minicomputer and mainframe systems) - and that''s especially unfortunate in ZFS''s case, because system-managed COW mechanisms just happen to be a dynamite way to handle b-trees (you could do so at the application level on top of ZFS via use of a sparse file plus a facility to deallocate space in it explicitly, but you''d still need an entire separate level of in-file space-allocation/deallocation mechanism). B-trees are the obvious solution to the kind of partial-key and/or key-range queries that you described. > > Finally, in response to your current post (which sounds more as if it had come from a hardware engineer than from a database type): > > All the facilities that you describe are traditionally handled by transactions of one form or another, and only read-only transactions can normally be non-blocking (because they simply capture a consistent point-in-time database state and operate upon that, ignoring any subsequent changes that may occur during their lifetimes). Other less-popular but more general non-blocking approaches exist which simply abort upon detecting conflict rather than attempt to wait for the conflict to evaporate, which tends not to scale very well because (unlike the case with non-blocking low-level hardware synchronization) restarting a transaction when you don''t have to can very often result in a *lot* of redundant work being performed; they include some multi-version approaches that implement more general ''time domain addressing'' than that just described for read-only transactions and the rare implementations based upon ''optimistic'' concurrency control that let conflicts occur and then decide> whether to abort someone when they attempt to commit. > > ZFS supports transactions only for its internal use, and cannot feasibly support arbitrarily complex transactions because its atomicity approach depends upon gathering all transaction updates in RAM before writing them back atomically to disk (yes, it could perhaps do so in stages, since the entire new tree structure doesn''t become visible until its root has been made persistent, but that could arbitrarily delay other write activity in the system). While I think that supporting user-level transactions is a useful file-system feature and a few file systems such as Transarc''s Structured File System have actually done so, ZFS would have to change significantly to do so for anything other than *very* limited user-level transactions - hence I wouldn''t hold my breath waiting for such support in ZFS. > > - bill > > > This message posted from opensolaris.org > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >
can you guess?
2007-Nov-22 07:43 UTC
[zfs-discuss] User-visible non-blocking / atomic ops in ZFS
> The B-trees I''m used to tree divide in arbitrary > places across the whole > key, so doing partial-key queries is painful.While the b-trees in DEC''s Record Management Services (RMS) allowed multi-segment keys, they treated the entire key as a byte-string as far as prefix searches went (i.e., the segmentation wasn''t significant to that, and there''s no obvious reason why it should have been in other implementations).> > I can''t find "Structured File System" "Transarc" > usefully in Google. Do > you have a link handy? If not, never mind.Well, transarc.com now leads to a porn site, so that''s not much help. And Wikipedia''s entry for Transarc is regrettably sparse. Transarc was a Pittsburgh R&D company formed by some *very* bright CMU people. It''s probably best known for its ''Encina'' distributed transaction environment (SFS was actually part of Encina, but IIRC a separable one), for having developed the distributed file system (DFS) component of the Open Group''s Distributed Computing Environment (DCE), and for AFS, the productized (and now open source) version of CMU''s distributed Andrew file system; my own acquaintance with Transarc became closer when I was helping develop a distributed transactional object system in the mid''''90s and we were using their book "Camelot and Avalon" for high-level design inspiration. They were always closely associated with IBM, which absorbed them as a wholly-owned subsidiary in 1994 (and I''ve heard relatively little about them since). SFS was one of their lesser-known achievements: a record-oriented transactional file system. I''ve always felt that system-managed record-oriented files were useful, in part because a lot of the nitty-gritty space management that''s required (e.g., to handle the structured pages that tend to be necessary to accommodate data that''s allowed to change its size or is required to remain in some key order under insertion/update/deletion activity) duplicates similar space-management required of the system to manage conventional byte-stream files and in part because any kind of system-wide lock- and deadlock-management facilities tend to want to tie into such data at a higher-than-byte-stream level (e.g., because the locked entities may have to move around) - so SFS was interesting to me. Unfortunately, it''s been long enough that I can''t remember too many details about it - e.g., it may or may not have supported interlocked access at the record field level - and at least after a quick search I can''t find any papers about it that I may have downloaded (that era was before I really recognized how evanescent Web material often may be). I actually did get a Google hit at position 19 with the search terms you used (after a plethora of hits on "log structured file system", of course), but it wasn''t very enlightening. Nor were several later ones, until hit 42 at the University of Waterloo - a .pdf that contains at least a brief description starting on page 21 (including a thinly-disguised rip-off of a figure in Gray&Reuter''s classic "Transaction Processing" - but it''s not quite *identical*...). Aha - good old reliable IBM *does* still have some SFS documentation on line that hit 75 noticed; munging that URL a bit led to http://publib.boulder.ibm.com/infocenter/txformp/v5r1/index.jsp?noscript=1 (expand "Encina Books" in the left-hand frame and start digging...). - bill This message posted from opensolaris.org