The case has been made on Phoronix for F-Trees: They makes use hard drive speeds, not (relatively slow) access times; beat SSD''s; and scale perfectly across multiple cores with hundreds of millions of entries. http://en.wikipedia.org/wiki/TokuDB#Fractal_tree_indexes How TokuDB Fractal Tree Databases Work Via: http://www.phoronix.com/scan.php?page=news_item&px=MTA3NjM Time for someone to get started on ftrfs? Or can it be implemented in Btrfs? https://bugzilla.kernel.org/show_bug.cgi?id=43004 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Mar 28, 2012 at 9:25 AM, Danny Piccirillo <danny.piccirillo@member.fsf.org> wrote:> The case has been made on Phoronix for F-Trees: They makes use hard > drive speeds, not (relatively slow) access times; beat SSD''s; and scale > perfectly across multiple cores with hundreds of millions of entries. > > http://en.wikipedia.org/wiki/TokuDB#Fractal_tree_indexes > > How TokuDB Fractal Tree Databases Work > > Via: http://www.phoronix.com/scan.php?page=news_item&px=MTA3NjM > > Time for someone to get started on ftrfs? Or can it be implemented > in Btrfs? > https://bugzilla.kernel.org/show_bug.cgi?id=43004whoa, very cool stuff. fractals are awesome, cool to see them in use. ... 2010/11, surprised i never heard of it before now. thanks for the reference/links at the very least! aside: i once described fractals to my grandmother (100% devout catholic) as related to my own understanding of the universe -- specifically, i pointed out how their often simple mathematical identity fragments into an infinitely self-similar pattern of seemingly unbounded complexity -- she told me i was describing god ... and, well, we agreed :-) -- C Anthony -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Mar 28, 2012 at 02:25:39PM +0000, Danny Piccirillo wrote:> The case has been made on Phoronix for F-Trees: They makes use hard > drive speeds, not (relatively slow) access times; beat SSD''s; and scale > perfectly across multiple cores with hundreds of millions of entries. > > http://en.wikipedia.org/wiki/TokuDB#Fractal_tree_indexes > > How TokuDB Fractal Tree Databases Work > > Via: http://www.phoronix.com/scan.php?page=news_item&px=MTA3NjM > > Time for someone to get started on ftrfs? Or can it be implemented > in Btrfs? > https://bugzilla.kernel.org/show_bug.cgi?id=43004 >This is dumber shit than usual out of phoronix. I did some just basic calculations for worst case scenarios, let''s say we max out btrfs''s 8 level limit, so we have 7 levels of nodes and then our leaves. Lets just for simplicity sake say we can fit 1 billion objects into this kind of tree, and for each node/leaf we can only hold 10 objects. (Keep in mind these are just random numbers I''m pulling out of my ass and may not work out right with such a large tree, just work with me here). So worst case scenario doing a binary search for an object with 10 objects is 4 comparisons, so with a full 8 level btree we''re doing 4 comparisons per level, so 32 comparisons worst possible case to find our object. Now let''s consider a fractal tree with 1 billion objects. So that would have a 29 row fractal tree (if I did my math right). Now I have to make some assumptions about how you would implement a fractal tree here, but let''s assume that every level tells you it''s min and max value to make it easier to tell which level you need to search in. So worst case you''re object is in the 29th level, so you have to read 29 blocks in and check the min/max levels to figure out which row to search. Let''s be fair and say these are in memory, we''re still doing 29 comparisons right out of the gate to find the right row. Now we have the right row, but this is a file system and the rows are split up into blocks, so we not only have to binary search each block, but the blocks themselves to find the right block with our object. And we can''t do this directed either because we have no way of indexing the individual blocks, so worst case scenario we''re having to read in 268435456 blocks to then do a normal binary search on _each_ block! Now lets again say just to give fractal trees an added benefit that each block has it''s own min/max setting, we only have to binary search the one that will have our actual data, but we''re still talking about reading in 33554432 times the number of blocks as compared to a btree!!!!!! And this doesn''t even begin to touch the ability to handle multi-threaded workloads. Imagine having to move all of these blocks around in memory because your insert created a new row. You''d have to lock each row as you move stuff around (at the very least), so if you hit a particularly hot row your multithreaded performance is going to look like you are running on an Atari 2600. So no I don''t think it''s time for someone to get started on ftrfs. Thanks, Josef -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 03/28/2012 02:42 PM, Josef Bacik wrote:> On Wed, Mar 28, 2012 at 02:25:39PM +0000, Danny Piccirillo wrote: >> The case has been made on Phoronix for F-Trees: They makes use >> hard drive speeds, not (relatively slow) access times; beat >> SSD''s; and scale perfectly across multiple cores with hundreds of >> millions of entries.> So no I don''t think it''s time for someone to get started on ftrfs. > Thanks,Thanks for doing the research to back up the answer I was going to respond with based on intuition. +1 - -Jeff - -- Jeff Mahoney SUSE Labs -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.18 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iQIcBAEBAgAGBQJPc1xgAAoJEB57S2MheeWydJIP/R7jHiloaPLfwffuFS3gobmy h0DqX8YezGMs70kLBYTXeu3EVuFtX8IFGJDGoVp3pkM9nS6F6iK9LbRWDC7IDAy7 t1taQoUqy0HMmmrkXfZ8KWmXWv424/Zq5aHfnegL/oq4OrUwnHtjzJBbsx4fvezW mnPrNlOxaLWgXSyUs+1hDjvcgfmnRjpgURHfsfGNgX3gTUE4xNKCFXD4zs0bs5YO NcpLa/66R5UwINPLOazHt9QpC2jHxPb0j93YZBipwtbtXPj1W+od/0rOgPvc9gaK hzi7kRiegt50V2LVOJnofdaBC0AlCfRlcXRUNOil1vgFe6s8sft1KOdl8MShQ/+t JUTjb1j7Z3efds42nnahvNj8wV4T6z6qLlhSQiOulYbVarBsfrWoM3EJY2ObzIlM uOjCIPwHr/fzMJ9wVyLgK2ksssZmYAVgQ3YyS9G1CHIPe9xgW9Ld570mhhXRoLRj HG5QZkfkmFzlde+S/gGgrdrvTWDT1zDsUhe+IGYu3jtPjXVs1udB+BN3c7rTWjf+ gQUOdcS/6XHoTXWpgZ7p5DUb8qF7Nv+ABBcEWeiaIEPH7uUZ22/XZEdf2euSCTp2 +TRZfoj9PZ17cszkllG4n+kc5H4gdKMYRyvNQo9mQ2TvogsmVOgIY+0Fys2ZdOkF CaZ6ti9ZLkofawzImOV5 =UaEh -----END PGP SIGNATURE----- -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Mar 28, 2012 at 02:42:04PM -0400, Josef Bacik wrote:> On Wed, Mar 28, 2012 at 02:25:39PM +0000, Danny Piccirillo wrote: > > The case has been made on Phoronix for F-Trees: They makes use hard > > drive speeds, not (relatively slow) access times; beat SSD''s; and scale > > perfectly across multiple cores with hundreds of millions of entries. > > > > http://en.wikipedia.org/wiki/TokuDB#Fractal_tree_indexes > > > > How TokuDB Fractal Tree Databases Work > > > > Via: http://www.phoronix.com/scan.php?page=news_item&px=MTA3NjM > > > > Time for someone to get started on ftrfs? Or can it be implemented > > in Btrfs? > > https://bugzilla.kernel.org/show_bug.cgi?id=43004 > > > > This is dumber shit than usual out of phoronix. I did some just basic > calculations for worst case scenarios, let''s say we max out btrfs''s 8 level > limit, so we have 7 levels of nodes and then our leaves. Lets just for > simplicity sake say we can fit 1 billion objects into this kind of tree, and for > each node/leaf we can only hold 10 objects. (Keep in mind these are just random > numbers I''m pulling out of my ass and may not work out right with such a large > tree, just work with me here). > > So worst case scenario doing a binary search for an object with 10 objects is 4 > comparisons, so with a full 8 level btree we''re doing 4 comparisons per level, > so 32 comparisons worst possible case to find our object. > > Now let''s consider a fractal tree with 1 billion objects. So that would have a > 29 row fractal tree (if I did my math right). Now I have to make some > assumptions about how you would implement a fractal tree here, but let''s assume > that every level tells you it''s min and max value to make it easier to tell > which level you need to search in. So worst case you''re object is in the 29th > level, so you have to read 29 blocks in and check the min/max levels to figure > out which row to search. Let''s be fair and say these are in memory, we''re still > doing 29 comparisons right out of the gate to find the right row. Now we have > the right row, but this is a file system and the rows are split up into blocks, > so we not only have to binary search each block, but the blocks themselves to > find the right block with our object. And we can''t do this directed either > because we have no way of indexing the individual blocks, so worst case scenario > we''re having to read in 268435456 blocks to then do a normal binary search on > _each_ block! Now lets again say just to give fractal trees an added benefit > that each block has it''s own min/max setting, we only have to binary search the > one that will have our actual data, but we''re still talking about reading in > 33554432 times the number of blocks as compared to a btree!!!!!! >Ok looks like I wasn''t being completely fair here, part of the presentation they show talks about using forward pointers to point to where the element would be in the next row. So if we assume we''re using forward pointers and every row has a min/max indicator you can walk down each row and do a binary search to get as close as possible to what you want and then follow the forward pointer down to the next level. The problem with this is that in the worst case you end up having to binary search on each row, which makes the SMP case even crappier because you have to make sure nothing changes while you are walking down the rows and following the forward pointers. The math here is a little trickier since I can''t easily picture what the worst case scenario would be per level, but lets say O(log N/2) where N is the number of elements in the row. So in the situation I describe you are looking at having to do minimum of 29 reads, one for each row, and then add into that all the reads you need to binary search from your forward pointer on to the end of the row, which is going to increase as you go down the tree. So probably not millions times more work than b-trees, but at least 10s if not 100s of times more work in the worst case. Thanks, Josef -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> but lets say O(log N/2) where N is the number of elements in the row. > So in the situation I describe you are looking at having to do minimum > of 29 reads, one for each row,Hmm. Levels are powers of two and are either full or empty. So the total item count tells you which levels are full or empty. (gdb) print/t 1000000000 $3 = 111011100110101100101000000000 So no, definitely not reading 29 levels. After realizing this, I have to be honest: I''m not finding your hand-wavey dismissal of the technology convincing at all. :) There''s a *ton* of detail that they don''t describe about how to actually translate the logical notion of power-of-two node sizes into coherent block IO that scales. I don''t doubt that it''s possible. I very much doubt that the required complexity and cpu cost in the kernel would be worth it for generic file system users, though. Hooo boy, do I doubt it. - z -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Mar 28, 2012 at 03:50:07PM -0400, Zach Brown wrote:> > >but lets say O(log N/2) where N is the number of elements in the row. > >So in the situation I describe you are looking at having to do minimum > >of 29 reads, one for each row, > > Hmm. > > Levels are powers of two and are either full or empty. So the total > item count tells you which levels are full or empty. > > (gdb) print/t 1000000000 > $3 = 111011100110101100101000000000 > > So no, definitely not reading 29 levels. >You are still going to have to have at least 29 levels to accomodate 1 billion objects, though they won''t all be full (sorry I missed the must be full or empty bit). So it looks like we''ll have to actually search what 13 rows right? So still more rows than a b-tree, and again you are talking about binary searching each row''s blocks and then following its forward pointer down to the next one, so maybe not 100''s of times slower than a b-tree, but we''re not talking about a 5-10% difference either, probably still measured in multiples.> After realizing this, I have to be honest: I''m not finding your > hand-wavey dismissal of the technology convincing at all. :) >Hah my math was a little off I''ll grant you but the basic idea still stands, once we start getting into larger and larger rows we''re going to have to binary search just to find the right _block_ to use, which in my mind is much more of a problem than having to binary search inside of multiple blocks. At the very least as the number of objects grows the time it takes to find something in the tree increases exponentially.> There''s a *ton* of detail that they don''t describe about how to actually > translate the logical notion of power-of-two node sizes into coherent > block IO that scales. I don''t doubt that it''s possible.I imagine there is, but based on what little information they''ve shown I don''t see how it''s a hands down win against b-trees. If anything we''re talking about having to solve really complex problems in order to get any sort of good performance out of this thing. Thanks, Josef -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> I imagine there is, but based on what little information they''ve shown > I don''t see how it''s a hands down win against b-trees. If anything > we''re talking about having to solve really complex problems in order > to get any sort of good performance out of this thing.Oh, absolutely. Tack on COW and online repair and the complexity goes through the roof. - z -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> > You are still going to have to have at least 29 levels to accomodate 1 > billion > objects, though they won''t all be full (sorry I missed the must be full or > empty > bit). So it looks like we''ll have to actually search what 13 rows right? > So > still more rows than a b-tree, and again you are talking about binary > searching > each row''s blocks and then following its forward pointer down to the next > one, > so maybe not 100''s of times slower than a b-tree, but we''re not talking > about a > 5-10% difference either, probably still measured in multiples.They say that they can do the block pointers in a way that limits the number for searches per row to a constant, so it''s not that bad. They do less random seeks, but bigger ones at the cost of more cpu usage.> > Hah my math was a little off I''ll grant you but the basic idea still > stands, > once we start getting into larger and larger rows we''re going to have to > binary > search just to find the right _block_ to use, which in my mind is much > more of a > problem than having to binary search inside of multiple blocks. At the > very > least as the number of objects grows the time it takes to find something > in the > tree increases exponentially.That''s not what they say. A lot of info is missing though.> >> There''s a *ton* of detail that they don''t describe about how to actually >> translate the logical notion of power-of-two node sizes into coherent >> block IO that scales. I don''t doubt that it''s possible. > > I imagine there is, but based on what little information they''ve shown I > don''t > see how it''s a hands down win against b-trees. If anything we''re talking > about > having to solve really complex problems in order to get any sort of good > performance out of this thing. Thanks,They don''t claim to win hands down for the search case, they say they are similar for the search case, but much better at inserts. I don''t think it''s good for the linux fs general use case, but it''s not as bad as you think. But it will be very hard to implement anyway unless they release more details. Niels -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Mar 28, 2012 at 10:44:03PM +0200, Niels de Carpentier wrote:> > > > > You are still going to have to have at least 29 levels to accomodate 1 > > billion > > objects, though they won''t all be full (sorry I missed the must be full or > > empty > > bit). So it looks like we''ll have to actually search what 13 rows right? > > So > > still more rows than a b-tree, and again you are talking about binary > > searching > > each row''s blocks and then following its forward pointer down to the next > > one, > > so maybe not 100''s of times slower than a b-tree, but we''re not talking > > about a > > 5-10% difference either, probably still measured in multiples. > > They say that they can do the block pointers in a way that limits the > number for searches per row to a constant, so it''s not that bad. They do > less random seeks, but bigger ones at the cost of more cpu usage.I''d like to see how they do that. The fact is you are still going to get random seeks since you have to binary search the blocks in an entire row since there is no way you can read a several thousand block row into memory to search it, so once your rows get pretty big you are doing just as much if not more random io as a btree.> > > > Hah my math was a little off I''ll grant you but the basic idea still > > stands, > > once we start getting into larger and larger rows we''re going to have to > > binary > > search just to find the right _block_ to use, which in my mind is much > > more of a > > problem than having to binary search inside of multiple blocks. At the > > very > > least as the number of objects grows the time it takes to find something > > in the > > tree increases exponentially. > > That''s not what they say. A lot of info is missing though. > > > >> There''s a *ton* of detail that they don''t describe about how to actually > >> translate the logical notion of power-of-two node sizes into coherent > >> block IO that scales. I don''t doubt that it''s possible. > > > > I imagine there is, but based on what little information they''ve shown I > > don''t > > see how it''s a hands down win against b-trees. If anything we''re talking > > about > > having to solve really complex problems in order to get any sort of good > > performance out of this thing. Thanks, > > They don''t claim to win hands down for the search case, they say they are > similar for the search case, but much better at inserts. > > I don''t think it''s good for the linux fs general use case, but it''s not as > bad as you think. But it will be very hard to implement anyway unless they > release more details. >I don''t buy that its better in the insert case either for similar reasons, you are talking about having to rewrite entire rows to maintain the sequential nature of everything, so if you end up adding a giant row you have to go and write the whole thing out again, so you are talking about a lot more IO than with b-trees, albeit sequential. So maybe it''s a win since it''s sequential but I wonder at what tree size that benefit no longer exists. Of course all this analysis is based on a power point presentation and there may be some detail I''m missing, but that''s mostly my point, claiming that b-trees are now obsolete because we can maybe do inserts faster is just idiotic. Thanks, Josef -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> I''d like to see how they do that. The fact is you are still going to get > random > seeks since you have to binary search the blocks in an entire row since > there is > no way you can read a several thousand block row into memory to search it, > so > once your rows get pretty big you are doing just as much if not more > random io > as a btree.Why? The rows are sequential on disk, so you''re never doing more random seeks than the number of rows as long as you can search faster than the disk can provide data. If they indeed can limit the number of searches per row to a constant, it shouldn''t be an issue at all.> I don''t buy that its better in the insert case either for similar reasons, > you > are talking about having to rewrite entire rows to maintain the sequential > nature of everything, so if you end up adding a giant row you have to go > and > write the whole thing out again, so you are talking about a lot more IO > than > with b-trees, albeit sequential. So maybe it''s a win since it''s > sequential but > I wonder at what tree size that benefit no longer exists.The worst case might be bad, but on average it should be good. I don''t doubt it is always better on rotating disks, probably not on ssd''s.> > Of course all this analysis is based on a power point presentation and > there may > be some detail I''m missing, but that''s mostly my point, claiming that > b-trees > are now obsolete because we can maybe do inserts faster is just idiotic.It''s a presentation from a company, so lots of marketing. Of course it doesn''t make btrees obsolete, it''s optimized for specific cases. Like they say they reduce random seeks at the cost of disk bandwidth and cpu usage. It depends on the use case if that is useful or not. Probably not for btrfs, since the future will be ssd''s, and meta data will generally be cached. Niels -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Niels de Carpentier <niels <at> decarpentier.com> writes:> > I''d like to see how they do that. The fact is you are still going to get > > random > > seeks since you have to binary search the blocks in an entire row since > > there is > > no way you can read a several thousand block row into memory to search it, > > so > > once your rows get pretty big you are doing just as much if not more > > random io > > as a btree. > > Why? The rows are sequential on disk, so you''re never doing more random > seeks than the number of rows as long as you can search faster than the > disk can provide data. If they indeed can limit the number of searches per > row to a constant, it shouldn''t be an issue at all. > > > I don''t buy that its better in the insert case either for similar reasons, > > you > > are talking about having to rewrite entire rows to maintain the sequential > > nature of everything, so if you end up adding a giant row you have to go > > and > > write the whole thing out again, so you are talking about a lot more IO > > than > > with b-trees, albeit sequential. So maybe it''s a win since it''s > > sequential but > > I wonder at what tree size that benefit no longer exists. > > The worst case might be bad, but on average it should be good. I don''t > doubt it is always better on rotating disks, probably not on ssd''s. > > > > Of course all this analysis is based on a power point presentation and > > there may > > be some detail I''m missing, but that''s mostly my point, claiming that > > b-trees > > are now obsolete because we can maybe do inserts faster is just idiotic. > > It''s a presentation from a company, so lots of marketing. Of course it > doesn''t make btrees obsolete, it''s optimized for specific cases. Like they > say they reduce random seeks at the cost of disk bandwidth and cpu usage. > It depends on the use case if that is useful or not. Probably not for > btrfs, since the future will be ssd''s, and meta data will generally be > cached. > > Niels >I''m reviving a very old thread from here: http://thread.gmane.org/gmane.comp.file-systems.btrfs/16484 I found a bunch of more recent info from the company doing most of the work behind fractal tree FS that I''d love to hear your thoughts on. https://www.youtube.com/watch?v=c-n2LGPpQEw http://www.tokutek.com/2013/02/mongodb-fractal-tree-indexes-high-compression https://www.usenix.org/conference/hotstorage12/tokufs-streaming-file-system Does this address earlier concerns at all? I just hope their work will be released as unpatented GPL-compatible free software. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html