Jeff Dean
2007-Feb-09 21:22 UTC
[Betternestedset-talk] Improving the nested_set model to remove redundancy
I''ve been working with the Celko sets a bit lately and looking at the rails nested_set and better_nested_set and I think there are a few redundancies/areas of improvement for performance in both models. I may not understand the issues well, so please excuse me if these points are naive: First, I think the parent_id field is completely unnecessary. The point of the nested set model seems to be that you eliminate the need for the join, thereby creating a table with a single fact, in a single place. If you don''t have a parent_id, you identify a root as anything with lft=1. You can achieve multiple roots by using composite keys (by adding a set_id column, for example). Second, the numbering scheme need not be contiguous. If you use the full range of floats given in a database you can insert thousands of children without ever having to recalculate other rows in the database. This will drastically increase performance on inserts/updates, and unless you are working with huge sets, you may never need to recalculate parent boundaries. For more information (if you haven''t already seen it), check out: http://searchoracle.techtarget.com/tip/1,289483,sid13_gci537290,00.html Way down the page, Celko talks about "1) The Algebra Property: For each node (n), (rgt-lft+1)/2 = size of subtree rooted at (n). This requires that the union of the rgt and lft numbers be an unbroken sequence, no gaps. 2) The Between-ness Property that you can find superiors and subordinates with a BETWEEN predicate" Further reading on how to perform much more complex calculations, check out: http://www.dbazine.com/oracle/or-articles/tropashko4 Is there any reason why we would need to keep parent_id or unbroken sequences? -------------- next part -------------- An HTML attachment was scrubbed... URL: http://rubyforge.org/pipermail/betternestedset-talk/attachments/20070209/fd9a012e/attachment.html
Krishna Dole
2007-Feb-10 00:15 UTC
[Betternestedset-talk] Improving the nested_set model to remove redundancy
Hi Jeff, Thanks for writing us about this, and welcome to the list.> First, I think the parent_id field is completely unnecessary. The point of > the nested set model seems to be that you eliminate the need for the join, > thereby creating a table with a single fact, in a single place. If you > don''t have a parent_id, you identify a root as anything with lft=1. You can > achieve multiple roots by using composite keys (by adding a set_id column, > for example).You''re right-- strictly speaking the parent_id field is unnecessary, and thus non-normalized. However, we have chosen to keep it thus far for several reasons: 1) It provides a good backup to the lft/rgt indexes. Nested sets are inherently both volatile and brittle. I feel good about our code''s ability to maintain the left/right values, but if they somehow became corrupted, repairing them without having parent_id would be a devilish manual process at best and impossible at worst. We provide handy methods for both checking and re-indexing using the parent_id. 2) It provides significant performance advantages in a number of situations: the common task of fetching a node''s parent is _much_ faster using parent_id. Fetching immediate children would require a ''level'' column if the parent_id was not present, and maintaining a level column would be a bit of work. (But nice to have-- it''s on the to-do list) 3) It makes reversion to an adjacency-list model trivial (just nuke the lft and rgt cols) if you decide nested sets aren''t for you. One of my goals with the code is to make it easy for people using acts_as_tree to try a nested set model. 4) The parent_id is required (and used to good effect) in the ez-set branch. 5) It doesn''t really cause any problems, performance or otherwise.> > Second, the numbering scheme need not be contiguous. If you use the full > range of floats given in a database you can insert thousands of children > without ever having to recalculate other rows in the database. This will > drastically increase performance on inserts/updates, and unless you are > working with huge sets, you may never need to recalculate parent boundaries.I looked into using floats, but I was turned off by the fact that trees less than 60 levels deep (if I recall correctly) could overrun the index. That said, if you would like to try implementing something like this, please do-- I''m happy to provide what advice I can. Keeping the index sequential has a few benefits: you can find out how many descendants a node has without hitting the database, and it makes finding ''leaf'' nodes easy.> > For more information (if you haven''t already seen it), check out: > > http://searchoracle.techtarget.com/tip/1,289483,sid13_gci537290,00.htmlThanks for linking to that. I''d seen most of it before (Joe has recycled that material a lot of times), but some of the information at the bottom was interesting. Cheers, Krishna
Jeff Dean
2007-Feb-10 00:49 UTC
[Betternestedset-talk] Improving the nested_set model to remove redundancy
All good points - thanks for the clarification! On 2/9/07, Krishna Dole <dontfall at gmail.com> wrote:> > Hi Jeff, > > Thanks for writing us about this, and welcome to the list. > > > First, I think the parent_id field is completely unnecessary. The point > of > > the nested set model seems to be that you eliminate the need for the > join, > > thereby creating a table with a single fact, in a single place. If you > > don''t have a parent_id, you identify a root as anything with lft=1. You > can > > achieve multiple roots by using composite keys (by adding a set_id > column, > > for example). > > You''re right-- strictly speaking the parent_id field is unnecessary, > and thus non-normalized. However, we have chosen to keep it thus far > for several reasons: > > 1) It provides a good backup to the lft/rgt indexes. Nested sets are > inherently both volatile and brittle. I feel good about our code''s > ability to maintain the left/right values, but if they somehow became > corrupted, repairing them without having parent_id would be a devilish > manual process at best and impossible at worst. We provide handy > methods for both checking and re-indexing using the parent_id. > > 2) It provides significant performance advantages in a number of > situations: the common task of fetching a node''s parent is _much_ > faster using parent_id. Fetching immediate children would require a > ''level'' column if the parent_id was not present, and maintaining a > level column would be a bit of work. (But nice to have-- it''s on the > to-do list) > > 3) It makes reversion to an adjacency-list model trivial (just nuke > the lft and rgt cols) if you decide nested sets aren''t for you. One of > my goals with the code is to make it easy for people using > acts_as_tree to try a nested set model. > > 4) The parent_id is required (and used to good effect) in the ez-set > branch. > > 5) It doesn''t really cause any problems, performance or otherwise. > > > > > Second, the numbering scheme need not be contiguous. If you use the > full > > range of floats given in a database you can insert thousands of children > > without ever having to recalculate other rows in the database. This > will > > drastically increase performance on inserts/updates, and unless you are > > working with huge sets, you may never need to recalculate parent > boundaries. > > I looked into using floats, but I was turned off by the fact that > trees less than 60 levels deep (if I recall correctly) could overrun > the index. That said, if you would like to try implementing something > like this, please do-- I''m happy to provide what advice I can. > > Keeping the index sequential has a few benefits: you can find out how > many descendants a node has without hitting the database, and it makes > finding ''leaf'' nodes easy. > > > > > For more information (if you haven''t already seen it), check out: > > > > http://searchoracle.techtarget.com/tip/1,289483,sid13_gci537290,00.html > > Thanks for linking to that. I''d seen most of it before (Joe has > recycled that material a lot of times), but some of the information at > the bottom was interesting. > > Cheers, > Krishna > _______________________________________________ > Betternestedset-talk mailing list > Betternestedset-talk at rubyforge.org > http://rubyforge.org/mailman/listinfo/betternestedset-talk >-------------- next part -------------- An HTML attachment was scrubbed... URL: http://rubyforge.org/pipermail/betternestedset-talk/attachments/20070209/feac712d/attachment-0001.html