Carlson, Timothy S
2011-May-16 18:58 UTC
[Lustre-discuss] Anybody actually using Flash (Fusion IO specifically) for meta data?
Folks, I know that flash based technology gets talked about from time to time on the list, but I was wondering if anybody has actually implemented FusionIO devices for metadata. The last thread I can find on the mailing list that relates to this topic dates from 3 years ago. The software driving the Fusion cards has come quite a ways since then and I''ve got good experience using the device as a raw disk. I''m just fishing around to see if anybody has implemented one of these devices in a reasonably sized Lustre config where "reasonably" is left open to interpretation. I''m thinking >500T and a few million files. Thanks! Tim
Dardo D Kleiner - CONTRACTOR
2011-May-17 16:26 UTC
[Lustre-discuss] Anybody actually using Flash (Fusion IO specifically) for meta data?
Short answer: of course it works - they''re just block devices after all - but you''ll find that you won''t realize the performance gains you might expect (at least not for an MDT). Aside from simply being fast OSTs, there are several areas that would allow Lustre to take advantage of these kinds of devices: 1) SMP scaling for the MDS - the problem right now is that the low latency of these devices really shines best when you have many threads scattering small I/O. The current (1.8.x) Lustre MDS doesn''t do this. 2) Flashcache/bcache over traditional disk storage (OST or MDT) - this can be done today, of course. There''s some interop issues in my testing, but when it works it does what it says it does. It still won''t really help an MDT though. 3) Targeted device mapping of the metadata portions of an OST on traditional disk (e.g. extent lists) onto flash. #1 is substantial work (ongoing I believe). #2 is pretty nifty, basically grow your local page cache beyond RAM - helps when "hot" working set is large. #3 is trickier and though I haven''t tried it I understand there''s real effort ongoing in this regard. Filesystem size in this discussion is mostly irrelevant for an MDT, its just whether or not the device is big enough for the number of objects (a few million is *not* many). A huge number of clients thrashing about creating/modifying/deleting is where these things have the most potential. - Dardo On 5/16/11 2:58 PM, Carlson, Timothy S wrote:> > Folks, > > I know that flash based technology gets talked about from time to time on the list, but I was wondering if anybody has actually implemented FusionIO devices for metadata. The last thread I can find on the mailing list that relates to this topic dates from 3 years ago. The software driving the Fusion cards has come quite a ways since then and I''ve got good experience using the device as a raw disk. I''m just fishing around to see if anybody has implemented one of these devices in a reasonably sized Lustre config where "reasonably" is left open to interpretation. I''m thinking>500T and a few million files. > > Thanks! > > Tim > > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss > >
Kevin Van Maren
2011-May-19 16:28 UTC
[Lustre-discuss] Anybody actually using Flash (Fusion IO specifically) for meta data?
Dardo D Kleiner - CONTRACTOR wrote:> Short answer: of course it works - they''re just block devices after all - but you''ll find that you won''t realize the performance gains you might expect (at least not for an MDT). >Yes. See the email thread "improving metadata performance" and Robin Humble''s talk at LUG. The MDT disk is rarely the bottleneck (although that could change with full size-on-mds support), which others had discovered using a ram-based (tmpfs) MDT. As for putting the entire filesystem on flash, sure that would be pretty nifty, but expensive. Not being able to do failover, with storage on internal PCIe cards, is a downside.> Aside from simply being fast OSTs, there are several areas that would allow Lustre to take advantage of these kinds of devices: > > 1) SMP scaling for the MDS - the problem right now is that the low latency of these devices really shines best when you have many threads scattering small I/O. The current (1.8.x) Lustre MDS doesn''t > do this. >SMP scaling is a big issue. In Lustre 1.8.x the maximum performance is not more than 8 CPUs (maybe fewer) for the MDT -- additional cpu cores results in _lower_ performance. There are patches for Lustre 2.x to improve SMP scaling, but I haven''t tested a workload.> 2) Flashcache/bcache over traditional disk storage (OST or MDT) - this can be done today, of course. There''s some interop issues in my testing, but when it works it does what it says it does. It > still won''t really help an MDT though. > 3) Targeted device mapping of the metadata portions of an OST on traditional disk (e.g. extent lists) onto flash. > > #1 is substantial work (ongoing I believe). #2 is pretty nifty, basically grow your local page cache beyond RAM - helps when "hot" working set is large. #3 is trickier and though I haven''t tried it > I understand there''s real effort ongoing in this regard. >flex_bg is in ext4, which allows the inodes to be packed together.> Filesystem size in this discussion is mostly irrelevant for an MDT, its just whether or not the device is big enough for the number of objects (a few million is *not* many). A huge number of clients > thrashing about creating/modifying/deleting is where these things have the most potential. > > - Dardo > > On 5/16/11 2:58 PM, Carlson, Timothy S wrote: > >> Folks, >> >> I know that flash based technology gets talked about from time to time on the list, but I was wondering if anybody has actually implemented FusionIO devices for metadata. The last thread I can find on the mailing list that relates to this topic dates from 3 years ago. The software driving the Fusion cards has come quite a ways since then and I''ve got good experience using the device as a raw disk. I''m just fishing around to see if anybody has implemented one of these devices in a reasonably sized Lustre config where "reasonably" is left open to interpretation. I''m thinking>500T and a few million files. >> >> Thanks! >> >> Tim >> >>
Andreas Dilger
2011-May-19 18:44 UTC
[Lustre-discuss] Anybody actually using Flash (Fusion IO specifically) for meta data?
On May 19, 2011, at 10:28, Kevin Van Maren wrote:> Dardo D Kleiner - CONTRACTOR wrote: >> Short answer: of course it works - they''re just block devices after all - but you''ll find that you won''t realize the performance gains you might expect (at least not for an MDT). >> > > Yes. See the email thread "improving metadata performance" and Robin > Humble''s talk at LUG. The MDT disk is rarely the bottleneck (although > that could change with full size-on-mds support), which others had > discovered using a ram-based (tmpfs) MDT.I will assert that MDT disk performance is rarely the bottleneck only for filesystem modifying operations, because the seek latency is largely hidden by the linear IO of the journal, and because most metadata benchmarks are done on test filesystems that are empty (i.e. free inodes are all contiguous). I think for real-world usage on filesystems that are aged, and/or cold-cache operations (just mounted, or larger than can fit in RAM) that SSD can help significantly.> As for putting the entire filesystem on flash, sure that would be pretty > nifty, but expensive. Not being able to do failover, with storage on > internal PCIe cards, is a downside.I doubt this will be possible for a long time to come, due to cost, even if the PCI cards have external interfaces (as I''ve heard some high-end ones do).>> Aside from simply being fast OSTs, there are several areas that would allow Lustre to take advantage of these kinds of devices: >> >> 1) SMP scaling for the MDS - the problem right now is that the low latency of these devices really shines best when you have many threads scattering small I/O. The current (1.8.x) Lustre MDS doesn''t >> do this. >> > > SMP scaling is a big issue. In Lustre 1.8.x the maximum performance is > not more than 8 CPUs (maybe fewer) for the MDT -- additional cpu cores > results in _lower_ performance. There are patches for Lustre 2.x to > improve SMP scaling, but I haven''t tested a workload. > >> 2) Flashcache/bcache over traditional disk storage (OST or MDT) - this can be done today, of course. There''s some interop issues in my testing, but when it works it does what it says it does. It >> still won''t really help an MDT though. >> 3) Targeted device mapping of the metadata portions of an OST on traditional disk (e.g. extent lists) onto flash. >> >> #1 is substantial work (ongoing I believe). #2 is pretty nifty, basically grow your local page cache beyond RAM - helps when "hot" working set is large. #3 is trickier and though I haven''t tried it >> I understand there''s real effort ongoing in this regard. >> > > flex_bg is in ext4, which allows the inodes to be packed together.As an FYI, a patch to enable flex_bg (and other ext4 features) by default was just landed to the master branch for 2.1. It also reduces the number of inodes created on large OSTs (i.e. pretty much any new OST), and increases the number of inodes created on the MDT. That is more inline with typical users of Lustre today, and testing so far has shown that flex_bg reduces mke2fs and e2fsck time noticably. The higher MDT inode ratio is also helpful for flash users, since it more efficiently uses the space on the MDT.>> Filesystem size in this discussion is mostly irrelevant for an MDT, its just whether or not the device is big enough for the number of objects (a few million is *not* many). A huge number of clients >> thrashing about creating/modifying/deleting is where these things have the most potential. >> >> - Dardo >> >> On 5/16/11 2:58 PM, Carlson, Timothy S wrote: >> >>> Folks, >>> >>> I know that flash based technology gets talked about from time to time on the list, but I was wondering if anybody has actually implemented FusionIO devices for metadata. The last thread I can find on the mailing list that relates to this topic dates from 3 years ago. The software driving the Fusion cards has come quite a ways since then and I''ve got good experience using the device as a raw disk. I''m just fishing around to see if anybody has implemented one of these devices in a reasonably sized Lustre config where "reasonably" is left open to interpretation. I''m thinking>500T and a few million files. >>> >>> Thanks! >>> >>> Tim >>> >>> > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discussCheers, Andreas -- Andreas Dilger Principal Engineer Whamcloud, Inc.
Carlson, Timothy S
2011-May-19 19:45 UTC
[Lustre-discuss] Anybody actually using Flash (Fusion IO specifically) for meta data?
> On May 19, 2011, at 10:28, Kevin Van Maren wrote: > > Dardo D Kleiner - CONTRACTOR wrote: > > As for putting the entire filesystem on flash, sure that would be > pretty > > nifty, but expensive. Not being able to do failover, with storage on > > internal PCIe cards, is a downside. > > [Andreas added this comment] > I doubt this will be possible for a long time to come, due to cost, > even if > the PCI cards have external interfaces (as I''ve heard some high-end > ones do).I hate to snip out most of a thread, but I want to focus on the issues of cost and failover. As for cost, I really don''t think this is an issue. If I am investing in a file system that is either approaching a Petabyte or is larger than a Petabyte then I don''t see that purchasing a 5K-10K flash device is really a cost factor. It is not quite in the noise, but it is going to be less than 5% of the total purchase price of a the file system. Failover is an issue. I''ve been keeping some loose statistics on my current Lustre configurations (a Petabyte or so in total) and looking at what components fail and where redundancy/failover could be improved. So far, metadata server failure hasn''t entered the picture. The problem with Lustre is it is now just too damn robust to random reboots :). Thanks Tim
David Dillow
2011-May-19 20:45 UTC
[Lustre-discuss] Anybody actually using Flash (Fusion IO specifically) for meta data?
On Thu, 2011-05-19 at 12:45 -0700, Carlson, Timothy S wrote:> > On May 19, 2011, at 10:28, Kevin Van Maren wrote: > > > Dardo D Kleiner - CONTRACTOR wrote: > > > As for putting the entire filesystem on flash, sure that would be > > pretty > > > nifty, but expensive. Not being able to do failover, with storage on > > > internal PCIe cards, is a downside. > > > > [Andreas added this comment] > > I doubt this will be possible for a long time to come, due to cost, > > even if > > the PCI cards have external interfaces (as I''ve heard some high-end > > ones do). > > I hate to snip out most of a thread, but I want to focus on the issues of cost and failover. > > As for cost, I really don''t think this is an issue. If I am investing > in a file system that is either approaching a Petabyte or is larger > than a Petabyte then I don''t see that purchasing a 5K-10K flash device > is really a cost factor. It is not quite in the noise, but it is going > to be less than 5% of the total purchase price of a the file system.Uhm, They are talking about building the entire file system out of flash. Have you priced Petabytes of flash lately? There are small countries you could buy for less money... ;) Buying flash for just the MDT would be in the 5% range you mention, but you only get better cache-cold performance. Better to spend the money on more memory for your MDS, and avoid downtime, IMO. -- Dave Dillow National Center for Computational Science Oak Ridge National Laboratory (865) 241-6602 office