Hi, we would like to establish a small Lustre instance and for the OST planning to use standard Dell PE1950 servers (2x QuadCore + 16 GB Ram) and for the disk a JBOD (MD1000) steered by the PE1950 internal Raid controller (Raid-6). Any experience (good or bad) with such a config ? thanxs, Martin
After I got the kinks work out, this setup FLIES, especially over infiniband. It''s actually what I''m running for a 50tb lustre set up. I would strongly recommend either two 7 disk raid5s with a hot spare or one raid6, per MD1000.. I don''t know what the performance difference is like. I also believe ldiskfs will now allow you to format a single partition larger than 8TB. If this is not the case, go with the two smaller raid5s. If you use LVM and split up a single physical partition into two virtual partitions...you''re performance will hurt. Also, don''t bother partitioning the raids. Use raw block devices (i.e. /dev/sdX). I''ve also seen significantly better performance out of RHEL5 than RHEL4 but that was with the perc5, so I can''t speak for the perc6. The key (at least with the perc5s) can be found in this article- http://thias.marmotte.net/archives/2008/01/05/Dell-PERC5E-and-MD1000-performance-tweaks.html. It makes a WORLD of difference. Good luck! -Aaron oh and PS I''d also avoid putting more than 3 MD1000s per 1950 for bandwidth reasons. ----- Original Message ----- From: "Martin Gasthuber" <martin.gasthuber at desy.de> To: "Lustre" <lustre-discuss at clusterfs.com> Sent: Wednesday, March 26, 2008 7:53:31 AM GMT -05:00 US/Canada Eastern Subject: [Lustre-discuss] HW experience Hi, we would like to establish a small Lustre instance and for the OST planning to use standard Dell PE1950 servers (2x QuadCore + 16 GB Ram) and for the disk a JBOD (MD1000) steered by the PE1950 internal Raid controller (Raid-6). Any experience (good or bad) with such a config ? thanxs, Martin _______________________________________________ Lustre-discuss mailing list Lustre-discuss at lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20080326/ecc708d5/attachment-0002.html
On Wed, 26 Mar 2008, Martin Gasthuber wrote:
Hi,
I have a lustre setup with two OSS and 8 MD1000 connected to two
perc5e controllers with ~90 TB of raw storage. The 6 mds are on a
different perc5i controller. It uses bonded gigabit ethernet. Performance
is not bad. I have another lustre setup with a single oss and a 3ware
9650. The performance is better than that setup. If the new perc6 is
based on lsi 8888, since it is supports raid 6 (i could be wrong on this
assumption), then the performance of perc 5 and 6 can''t be directly
comparable as one is an intel iop based controller and the 8888 is a
power pc based one. Some benchmarks on the internet list the 8888 as a
very good controller with one of the best performance available. So if
perc6 is based on 8888, it could be likely better than perc 5.
If there is only one md1000, then you should look at the
performance of split mode with one lun per raid5 (better raid6 for peace
of mind) with one hot spare (in case of raid6). Perc 6 was my first choice
over perc 5 just to get raid 6, but it was not out yet when the equipment
was purchased. The md1000''s that i have here have seagate barracuda es
drives which are supposed to be enterprise hard drives. But sata drives do
have a high failure rate. I have 6 failed in the last 4-5 months (out of
120 in total), which is not that high especially when the arrays are almost
full. Perc 5 does have some weird characteristics, for example a month ago
a drive in one MD1000 enclosure started showing "unexpected sense"
errors.
Unfortunately it started happening twice or thrice per second and the
monitoring deamon started sending email for every instance as i configured
the alert level too high. Finally when the drive failed, one more drive
fell off another volume for no apparant reason. Dell is also unable to
explain that. I was fortunate that the drive didn''t fall off the same
raid
5 volume as the actual failing drive. The fell off drive is still a good
drive and is now in the global hotspare pool. It was good on hindsight
that i didn''t go for a single big raid5 volume with 13 or 14 drives
and
went for two raid 5 volumes per enclosure ( 7 + 6 plus two hot spares)
presisely for this scenario. This reduces risk of multiple simultaneous
drive failures in the same volume. But as i mentioned before, if perc6 is
based on lsi 8888, then it is a totally new product not directly
comparable to perc 5.
Also watch out for some critical Lustre bugs which are
showstoppers. Like this -
https://bugzilla.lustre.org/show_bug.cgi?id=13438 Until this patch came
out, the two OSS crashed on a daily basis for two months. There is also
another bug that affects nfs exports that needs a reboot of the lustre client
that
does the nfs export. I went around it for now by using a virtual machine
for nfs export of lustre volumes so that a reboot won''t affect running
compute jobs. There is also another problem as explained in an email in
the list a few days ago with clients getting evicted. So i was
concentrating much more on MD1000''s in the beginning, but in the end i
was more than happy when i
got a working stable lustre configuration and now not too keen to extract
the last ounce of performance out of the MD1000 anymore :-)
Regards
Balagopal
> Hi,
>
> we would like to establish a small Lustre instance and for the OST
> planning to use standard Dell PE1950 servers (2x QuadCore + 16 GB Ram) and
> for the disk a JBOD (MD1000) steered by the PE1950 internal Raid controller
> (Raid-6). Any experience (good or bad) with such a config ?
>
> thanxs,
> Martin
>
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>
Hi Martin, On Wednesday 26 March 2008 04:53:31 Martin Gasthuber wrote:> we would like to establish a small Lustre instance and for the OST > planning to use standard Dell PE1950 servers (2x QuadCore + 16 GB Ram) > and for the disk a JBOD (MD1000) steered by the PE1950 internal Raid > controller (Raid-6). Any experience (good or bad) with such a config ?I also have a 50TB Lustre setup based on this hardware: 8 PE1950 OSSes connected to two MD1000 OSTes each. The MDS uses a MD3000 as a MDT for high-availability (redundancy is not currently in use, though, I never managed to get it working reliably). Can''t say much about the PERC6 controller, since I''m using its older brother PERC5, but memory wise, you should be good with 16B. We planned 4GB per OSS (2xOST each) at the beginning, but we had to double that to avoid memory exhaustion [1]. It will depend on the load induced by the clients, though. MD1000s'' performance is great as long as you set the read-ahead settings as Aaron mentioned. /scratch $ iozone -c -c -R -b ~/iozone.xls -C -r 64k -s 24m -i 0 -i 1 -i 2 -i8 -t50 "Throughput report Y-axis is type of test X-axis is number of processes" "Record size = 64 Kbytes " "Output is in Kbytes/sec" " Initial write " 1317906.72 " Rewrite " 2423618.81 " Read " 3484409.47 " Re-read " 4023550.60 " Random read " 3361937.08 " Mixed workload " 2994666.57 " Random write " 1777569.04 [1]http://lists.lustre.org/pipermail/lustre-discuss/2008-February/004874.html Cheers, -- Kilian