Sascha Ebach
2004-Dec-04 16:02 UTC
To RForum devs: Writing a reusable NestedSet implementation
Hi, I am adressing the guys behind RForum, but everybody else is also welcomed to join in. Everytime I develop a fairly complex application there is always the question of how to store the content. Usually there are two options. The first one being the easiest and also the one with the most repetition (DRY) is just to store everything in its own table and write a class for it. This one doesn''t scale very well. With scaling in this case I mean everytime you add another content type you have to add another table for it. Being a developer this is not a big problem besides the repetition. But what if you want to give the user the possibility to create her own content types via a web interface? (For example a form with self defined fields or a new media type or what if you suddenly decide to put in products that are for sale or comments to an article ...) The second very common option is to put everything in one table and use a tree structure to represent the data. This is not only useful if you are developing a forum (like RForum), but can be for other application types as well. For example in a CMS (which I am developing using Rails right now) you can store all kinds of content into a single table consisting of different content types like folders, articles, media, forms, products and what not. And this is what I have been doing. Working with tree structures in a database server usually involves some mixture of the adjacency list model (using a parent_id field) with the modified preorder tree traversal algo (using a left and right field) or "nested set". Everytime I reimplement such a nested set it mind boggles me and I wish I wouldn''t have to do it ;) I am certain that other developers feel the same (ohhh, not a nested set again ...) So this time I was glad that I could already find a rails implementation of this in RForum. I gladly stole half of the code and put it in my source. But the RForum impl. is not complete, yet. It doesn''t allow for different things. The most important to me at the moment is moving/copying trees/nodes. I know it must be on your todo list because one of your user stories says "Admin splits a topic into two topics" on http://rforum.andreas-s.net/trac/wiki/UserStories. Wouldn''t it be great if we had a "module NestedSet" the would not be dependent on RForum, my CMS (or any other package that might use it) or even Rails/AR. This strikes me as something very useful for development of web apps with complexer content repositories. I could imagine myself the following features. - CRUD of nodes and trees of course - defining your own table layout (maybe you don''t want to call your left and right fields ''l'' and ''r'', but ''lft'' and ''rgt''), this would make it possible to hook it into legacy systems - fetching the children (subtrees) of a node (all or upto x levels) - fetching only the immediate children of a node (by using parent_id) - copy/move nodes an whole trees - rebuild the the nested set from the parent_id field (already have a method for this) - fetching the path to a node (for those breadcrumb trails) - displaying the tree by all means of outputs via a separat adapter (class NestedSetOutput), for ex. nested HTML lists (<ul><li><ul><li></li></ul></ul>), graphical trees by GraphViz, forum posts of course, DHTML menus (if one must) and whatever you can think of. - being able to use the module without Rails, i.e. AR (like any sane person would ever want to ;) - everything well tested, so the developer can concentrate on testing other parts of the system. Does that sound interesting to you, too? And this would work by simply doing class Post < ActiveRecord::Base inlcude NestedSet::Base include NestedSet::Output::ForumPost # overwrite the standard field names def nested_set_fields {''id'' => ''id'', # key, by which a node is identified ''left'' => ''lft'', # left key for nested set ''right'' => ''rgt'', # right key for nested set ''order'' => ''priority'', # for sorting siblings ''level'' => ''level'', # level or depth of node ''parent'' => ''parent_id'' # for adjacency list stuff } end # we are good to go. you just saved yourself days of developing # and testing your own tree structure end I was actually going to just start this and see how far I can come until after xmas. I thought it might be a good idea to just throw this idea onto this list and listen to what (RForum) ppl have to say about that. Also, I would like to recommend looking at the source code of NestedSet.php (php pear package DB_NestedSet) which already does exactly this. In my opinion this particular php class is reasonably well designed (it is a little bloaty) and could be used as a reference for features that could be implemented. It is even unit tested which is rare for a php class. http://cvs.php.net/co.php/pear/DB_NestedSet/NestedSet.php?r=1.86 Another thing is that I don''t have any public repository like (http://rforum.andreas-s.net/trac) and it would be cool if we could just put it in there. If that is not possible I would have to think of something else. What do you think? I am going to start to extract the nested set functionality that I have so far into a module and when I have something concrete I am going to post to this list again. Maybe we can set something up? (I am off now until tommorow evening, so don''t wait for any immediate responses.) -- Sascha Ebach
Rick Bradley
2004-Dec-05 22:03 UTC
Re: To RForum devs: Writing a reusable NestedSet implementation
* Sascha Ebach (se-eFwX6J65rk9VioaHkBSlcw02NpfuEekPhC4ANOJQIlc@public.gmane.org) [041204 11:03]:> Wouldn''t it be great if we had a "module NestedSet" the would not be > dependent on RForum, my CMS (or any other package that might use it) or > even Rails/AR. This strikes me as something very useful for development > of web apps with complexer content repositories. > > I could imagine myself the following features. > > - CRUD of nodes and trees of course > - defining your own table layout (maybe you don''t want to call your left > and right fields ''l'' and ''r'', but ''lft'' and ''rgt''), this would make it > possible to hook it into legacy systems > - fetching the children (subtrees) of a node (all or upto x levels) > - fetching only the immediate children of a node (by using parent_id) > - copy/move nodes an whole trees > - rebuild the the nested set from the parent_id field (already have a > method for this) > - fetching the path to a node (for those breadcrumb trails) > - displaying the tree by all means of outputs via a separat adapter > (class NestedSetOutput), for ex. nested HTML lists > (<ul><li><ul><li></li></ul></ul>), graphical trees by GraphViz, forum > posts of course, DHTML menus (if one must) and whatever you can think of. > - being able to use the module without Rails, i.e. AR (like any sane > person would ever want to ;) > - everything well tested, so the developer can concentrate on testing > other parts of the system. > > Does that sound interesting to you, too? And this would work by simply doing > > class Post < ActiveRecord::Base > inlcude NestedSet::Base > include NestedSet::Output::ForumPost > > # overwrite the standard field names > def nested_set_fields > {''id'' => ''id'', # key, by which a node is identified > ''left'' => ''lft'', # left key for nested set > ''right'' => ''rgt'', # right key for nested set > ''order'' => ''priority'', # for sorting siblings > ''level'' => ''level'', # level or depth of node > ''parent'' => ''parent_id'' # for adjacency list stuff > } > end > > # we are good to go. you just saved yourself days of developing > # and testing your own tree structure > > endAt first I misread (I think) this to imply that these were binary trees, but I take it that the ''left'' and ''right'' fields simply point to the leftmost and rightmost nodes in the child list of this set/tree node(?). In a previous lifetime I wrote proprietary CMS software, unfortunately in PHP, and (un?)fortunately anything worth looking at is all open-sourced by now. The product that had the most success was a CMS [0] my business partner and I started selling back in 1998 that was based on a graph model, which looks to be a generalization of the nested set model. If nothing else, it''s a testament to the fact that there''s someone out there who will pay good money for just about anything. I did like the way the graph content model worked though. If I wanted to re-implement the basic Model in Rails today (which I''ve considered, but my calendar hasn''t been inclined to entertain) I''d use a few model classes for the content: Graph (id, default root node, some metadata) Node (id, default graph, various data/metadata fields) Edge (id, graph, from node, to node, order) Nodes contain the content. Graphs are collections of Nodes and Edges. Graphs are really multigraphs (i.e., you can have multiple edges from A->B in a single graph). A node can appear in any number of graphs. Edges are ordered, for view rendering purposes. In the CMS I also had node names. A node name is just an alphanumeric unique key (so users can refer to meaningful names instead of always using numbers) and is just a convenience. To render a view you are given either a graph id, a node_id/name, or a graph id and node id/name. Then the system basically figures out which graph this is going to render, and from whence the rendering will start (a node), using the various defaults available (default graph for a node or default root node for a graph) if necessary. The edges for the graph are pulled in from the database and an in-memory tree is made, starting with a graph traversal from the start node: basically, do not follow an edge to a node which has already been seen on the current root-path, which happens to be sufficient to avoid loops, hence we have a tree suitable for rendering as (say) XHTML. Then the nodes actually used in that in-memory traversal are pulled in from storage and the actual (tree) view is displayed. For example (leaving out edge ordering information), a graph and 3 views (based on different start nodes): A -> B (start @ A) (start @ D) (start @ C) | ^ +A +D +C v | +-B +-E +-D C -> D => +-C +-B +-E | +-D +-B v +-E E Since the edges for a graph can be pulled in in one operation (a single query if SQL is the backend), the in-memory traversal is at most linear in the number of edges (not all nodes will necessarily be encountered), and the rendered nodes can then be pulled in in one operation, this tends to be very efficient. The reason this particular breakdown (mainly separating things into "graphs") was used was to avoid some of the costs associated with transitive closure/reduction [1] [2] as stores of content objects get larger. The problem is always going to be there, though, it''s just a question of where you draw the boundaries. Once you allow inclusion of a tree of outside comment threads in the current comment threads, or in our case introduce graph-to-graph links (for example to reuse graphs, even in summary), you start fighting with transitivity again. By keeping the level of aggregation near the size of the viewable chunk the effects of transitivity can be kept under control. Most of the common operations on trees are trivial to support with this model, including zooming (view this node and all "children", or follow "parent" "trees" up a level), reordering, reparenting, reuse, etc. As well as some normally difficult CMS operations such as tracking reuse of content (where else does this item appear?), true objectification (change this content here and all "copies" change), "copy and paste" semantics, as well as alternate views with arbitrary multigraph semantics. This is just an alternate formulation to nested sets. The one thing we learned from building systems in this way (really, before building them this way) was that transitive closure/reduction really can kill performance. In the SQL world if you have to issue a database query for every layer in a tree (for us every layer in a spanning tree) you will eventually pay a high price. If you can aggregate connected components together so that they can be easily retrieved (/stored/updated/etc.) then you hopefully won''t have to devote as much time to later optimizing around a slow storage implementation. [0] http://www.rickbradley.com/code/ce/ (please, forgive the egregious PHP, as well as the complete lack of good MVC separation) [1] http://www2.toki.or.id/book/AlgDesignManual/BOOK/BOOK4/NODE163.HTM (shame the original book''s website appears to be having DNS problems) [2] http://www.cs.hut.fi/~enu/tc.html Rick -- http://www.rickbradley.com MUPRN: 307 | would have won that game random email haiku | (as they were eating Auburn | up there at the stripe).
Rick Bradley
2004-Dec-05 22:08 UTC
Re: To RForum devs: Writing a reusable NestedSet implementation
* Rick Bradley (rick-xSCPAUIMY+WN9aS15agKxg@public.gmane.org) [041205 17:06]:> At first I misread (I think) this to imply that these were binary trees, > but I take it that the ''left'' and ''right'' fields simply point to the > leftmost and rightmost nodes in the child list of this set/tree node(?).Probably bad form to reply to myself, but I see better now what you''re talking about (I''m digging through links I''ve collected over the last week and one was the "Storing Hierarchical Data in a Database" article you''ve linked). I''m wondering now whether there''s a straightforward generalization of the traversal-id method to the general graph formulation I related in the prior mail. Best, Rick -- http://www.rickbradley.com MUPRN: 562 | issues, not only random email haiku | surrounding this but other | decisions being made.
Sascha Ebach
2004-Dec-06 04:33 UTC
Re: To RForum devs: Writing a reusable NestedSet implementation
Hi Rick, thanks for your reply. Now I see why I should have paid more attention to graph theory ;) Surely, any form of tree traversal is bound to have performance problems at one point in time if the tree really starts to fill up. But I guess this will be the case for any content repository technique. If you want to read more about the implementation I meant, here are my sources: http://threebit.net/tutorials/nestedset/tutorial1.html http://cvs.php.net/co.php/pear/DB_NestedSet/NestedSet.php?r=1.86 http://www.sitepoint.com/print/hierarchical-data-database http://www.evolt.org/article/Four_ways_to_work_with_hierarchical_data/17/4047/index.html http://www.intelligententerprise.com/001020/celko.jhtml?_requestid=745072 http://www.intelligententerprise.com/001020/celko1_2.jhtml Celko is surely the most influential source (in the database field). I don''t have his book yet (SQL for Smarties), but I will probably order it soon. He gives the example which is usually referred to by everybody else. What I want to do is quite simply duplicate and rubyfy the code of the DB_NestedSet php implementation. I think that a lot of "content management" projects (for example RForum) could benefit from this if it was really easy to use such a library. The technique you mention could be implemented later. (Maybe once I actually understand it ;) For the beginning it would be nice to have an interface for reading and writing nodes. The developer shouldn''t really need to think about Trees and Nodes. class Article < Contentnode def save: end def move; end def copy; end ... # do all the node and tree stuff in these methods automatically, # if possible end And then later just change the configuration class NestedSet def use_implementation; ''graph'' end end It would be great if something like this were possible. But for the beginning I would settle for less complex and more common solution I mentioned above. Also very interesting is the Java Content Repository. It is a specification which tries to standardize the way content is accessed in an cms like application. Which is a good thing I guess since there is no standardization at all in this field. Get the jsr170-0.12.pdf from the download here http://www.jcp.org/aboutJava/communityprocess/review/jsr170/ The JCR specification talks more about the high level interface to access content rather than the implementation of the low level access techniques (where my interests lie right now). This API could be useful though since I think standardization is a good thing. -- Sascha Ebach
Alex
2004-Dec-29 16:01 UTC
Re: To RForum devs: Writing a reusable NestedSet implementation
Sascha Ebach wrote:> I am adressing the guys behind RForum, but everybody else is also > welcomed to join in.Sorry for not replying until now - for the last month I was behind one of the most paranoid corporate firewalls that I have ever seen. Have a scary task of going through 6000 emails now :) [long feature list of a nested set tree implementation skipped]> Does that sound interesting to you, too?Sorry again, but not interesting enough to actually do it at this very moment. It would be a big distraction from getting RForum 1.0 out of the door. If you decide to do it by yourself, you are most welcome to rip whatever you like from RForum for it.> Another thing is that I don''t have any public repository like > (http://rforum.andreas-s.net/trac) and it would be cool if we could > just put it in there.That''s what RubyForge is for :) The only reason RForum is on Andreas''s box is because we wanted to work on Subversion, and the Forge, sadly, doesn''t have that service yet. Otherwise it is almost perfect. Best regards, Alex
Sascha Ebach
2004-Dec-29 17:27 UTC
Re: To RForum devs: Writing a reusable NestedSet implementation
Alex: Hi, yeah, I was wondering what took you so long.> Sascha Ebach wrote:> Sorry again, but not interesting enough to actually do it at this very > moment. It would be a big distraction from getting RForum 1.0 out of the > door. > > If you decide to do it by yourself, you are most welcome to rip whatever > you like from RForum for it.I am already working on it. On and off. I hope it''ll be useful for Rforum when it is ready. Mainly I was hoping for (your) feedback on what I am doing. For that I would have to release something first, right ;) When the time comes ... -- Sascha Ebach