Dear list, I have a big fundamental question : if the load that I''ll put on the FS is more IOPS-intensive than throughput-intensive (because I''ll access lots of medium-sized files ~5 MB from a small number of clients), should I better go Lustre or PVFS2 ? Also, if the main load is IOPS, shouldn''t I oversize MDS/MDT in terms of CPU/RAM and storage perf (ie. : max of 15K SAS RAID10 spindles possible) ? on the budget side, may I use asynchronous DRBD to mirror MDT (internal storage), or should I only got a good shared storage (direct or iscsi) ? Today I''m leaning towards Lustre, because I''ve tested it against glusterfs, and gluster performed little less good than lustre but poorly failed the bonnie++ create/delete tests. Also I didn''t gave a shot at PVFS2 yet... Thank you. And yes, I do not intend to start a flame war, just a better understanding of wich FS best suits our needs. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20091202/7649a49d/attachment.html
On 2009-12-02, at 09:20, Francois Chassaing wrote:> I have a big fundamental question : > if the load that I''ll put on the FS is more IOPS-intensive than > throughput-intensive (because I''ll access lots of medium-sized files > ~5 MB from a small number of clients), should I better go Lustre or > PVFS2 ?I don''t think PVFS2 is necessarily better at IOPS than Lustre. This is mostly dependent upon the storage configuration.> Also, if the main load is IOPS, shouldn''t I oversize MDS/MDT in > terms of CPU/RAM and storage perf (ie. : max of 15K SAS RAID10 > spindles possible) ?The Lustre MDS/MDT is used only at file lookup/open/close, but is not involved during actual IO operations. Still, this means in your case that the MDS is getting 2 RPCs (open + close, which can be done asynchronously in memory) for every 5 OST RPCs (5MB read/write, which happen synchronously), so the MDS will definitely need to scale but not necessarily at 2/5 of the total OST size. Typical numbers for a high-end MDT node (16-core, 64GB of RAM, DDR IB) is about 8-10k creates/sec, up to 20k lookups/sec from many clients. Depending on the number of files you are planning to have in the filesystem, I would suggest SSDs for the MDT filesystem, especially if you have a large working set and are doing read-mostly access.> on the budget side, may I use asynchronous DRBD to mirror MDT > (internal storage), or should I only got a good shared storage > (direct or iscsi) ?Some people on this list have used DRBD, but we haven''t tested it ourselves. I _suspect_ (though have not necessarily tested this) that if you are using DRBD it would be possible to have lower-performance storage on the backup server without significantly impacting the primary server performance, if you are willing to run slower in the rare case when you are failed-over to the backup.> Today I''m leaning towards Lustre, because I''ve tested it against > glusterfs, and gluster performed little less good than lustre but > poorly failed the bonnie++ create/delete tests. Also I didn''t gave a > shot at PVFS2 yet...Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc.
Andreas Dilger wrote:> On 2009-12-02, at 09:20, Francois Chassaing wrote: >> I have a big fundamental question : >> if the load that I''ll put on the FS is more IOPS-intensive than >> throughput-intensive (because I''ll access lots of medium-sized files >> ~5 MB from a small number of clients), should I better go Lustre or >> PVFS2 ? > > I don''t think PVFS2 is necessarily better at IOPS than Lustre. This > is mostly dependent upon the storage configuration. > >> Also, if the main load is IOPS, shouldn''t I oversize MDS/MDT in >> terms of CPU/RAM and storage perf (ie. : max of 15K SAS RAID10 >> spindles possible) ? > > The Lustre MDS/MDT is used only at file lookup/open/close, but is not > involved during actual IO operations. Still, this means in your case > that the MDS is getting 2 RPCs (open + close, which can be done > asynchronously in memory) for every 5 OST RPCs (5MB read/write, which > happen synchronously), so the MDS will definitely need to scale but > not necessarily at 2/5 of the total OST size. > > Typical numbers for a high-end MDT node (16-core, 64GB of RAM, DDR IB) > is about 8-10k creates/sec, up to 20k lookups/sec from many clients. > > Depending on the number of files you are planning to have in the > filesystem, I would suggest SSDs for the MDT filesystem, especially if > you have a large working set and are doing read-mostly access. >Andreas, Has anyone reported results of an SSD based MDT? Craig>> on the budget side, may I use asynchronous DRBD to mirror MDT >> (internal storage), or should I only got a good shared storage >> (direct or iscsi) ? > > Some people on this list have used DRBD, but we haven''t tested it > ourselves. I _suspect_ (though have not necessarily tested this) that > if you are using DRBD it would be possible to have lower-performance > storage on the backup server without significantly impacting the > primary server performance, if you are willing to run slower in the > rare case when you are failed-over to the backup. > >> Today I''m leaning towards Lustre, because I''ve tested it against >> glusterfs, and gluster performed little less good than lustre but >> poorly failed the bonnie++ create/delete tests. Also I didn''t gave a >> shot at PVFS2 yet... > > > Cheers, Andreas > -- > Andreas Dilger > Sr. Staff Engineer, Lustre Group > Sun Microsystems of Canada, Inc. > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss >-- Craig Tierney (craig.tierney at noaa.gov)
On 2009-12-02, at 12:15, Craig Tierney wrote:> Andreas Dilger wrote: >> On 2009-12-02, at 09:20, Francois Chassaing wrote: >>> I have a big fundamental question : >>> if the load that I''ll put on the FS is more IOPS-intensive than >>> throughput-intensive (because I''ll access lots of medium-sized files >>> ~5 MB from a small number of clients), should I better go Lustre or >>> PVFS2 ? >> >> I don''t think PVFS2 is necessarily better at IOPS than Lustre. This >> is mostly dependent upon the storage configuration. >> >>> Also, if the main load is IOPS, shouldn''t I oversize MDS/MDT in >>> terms of CPU/RAM and storage perf (ie. : max of 15K SAS RAID10 >>> spindles possible) ? >> >> The Lustre MDS/MDT is used only at file lookup/open/close, but is not >> involved during actual IO operations. Still, this means in your case >> that the MDS is getting 2 RPCs (open + close, which can be done >> asynchronously in memory) for every 5 OST RPCs (5MB read/write, which >> happen synchronously), so the MDS will definitely need to scale but >> not necessarily at 2/5 of the total OST size. >> >> Typical numbers for a high-end MDT node (16-core, 64GB of RAM, DDR >> IB) >> is about 8-10k creates/sec, up to 20k lookups/sec from many clients. >> >> Depending on the number of files you are planning to have in the >> filesystem, I would suggest SSDs for the MDT filesystem, especially >> if >> you have a large working set and are doing read-mostly access. > > Has anyone reported results of an SSD based MDT?We have done internal testing, and the performance for many workloads is somewhat faster, but not a TON faster. This is because Lustre is already doing async IO on the MDS, unlike NFS, so decent streaming IO performance and lots of RAM meet many of the create/lookup performance targets. If you have a huge filesystem that is doing a lot of random lookup, create, and unlink operations (i.e. the working set is larger than the MDS RAM, about 4kB per file for random operations, 16M files on a 64GB MDS) then the high IOPS rate of SSDs will definitely make a huge difference (i.e. keeping 20k lookups/sec on DDR instead of falling to mdt_disks * 100). Since that isn''t a common workload for our customers, we haven''t done a lot of testing in that area, but it is definitely something I''m curious about.>>> on the budget side, may I use asynchronous DRBD to mirror MDT >>> (internal storage), or should I only got a good shared storage >>> (direct or iscsi) ? >> >> Some people on this list have used DRBD, but we haven''t tested it >> ourselves. I _suspect_ (though have not necessarily tested this) >> that >> if you are using DRBD it would be possible to have lower-performance >> storage on the backup server without significantly impacting the >> primary server performance, if you are willing to run slower in the >> rare case when you are failed-over to the backup. >> >>> Today I''m leaning towards Lustre, because I''ve tested it against >>> glusterfs, and gluster performed little less good than lustre but >>> poorly failed the bonnie++ create/delete tests. Also I didn''t gave a >>> shot at PVFS2 yet... >> >> >> Cheers, Andreas >> -- >> Andreas Dilger >> Sr. Staff Engineer, Lustre Group >> Sun Microsystems of Canada, Inc. >> >> _______________________________________________ >> Lustre-discuss mailing list >> Lustre-discuss at lists.lustre.org >> http://lists.lustre.org/mailman/listinfo/lustre-discuss >> > > > -- > Craig Tierney (craig.tierney at noaa.gov) > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discussCheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc.