Hello All I have a brand spanking new 40TB Hardware Raid6 array to play around with. I am looking for recommendations for which filesystem to use. I am trying not to break this up into multiple file systems as we are going to use it for backups. Other factors is performance and reliability. CentOS 5.6 array is /dev/sdb So here is what I have tried so far reiserfs is limited to 16TB ext4 does not seem to be fully baked in 5.6 yet. parted 1.8 does not support creating ext4 (strange) Anyone work with large filesystems like this that have any suggestions/recommendations? -- Matthew Feinberg matthew at choopa.com AIM: matthewchoopa
On 04/12/11 12:23 AM, Matthew Feinberg wrote:> Hello All > > I have a brand spanking new 40TB Hardware Raid6 arraynever mind file systems... is that one raid set? do you have any idea how LONG rebuilding that is going to take when there are any drive hiccups? or how painfully slow writes will be until its rebuilt? is that something like 22 x 2TB or 16 x 3TB? I'll bet a raid rebuild takes nearly a WEEK, maybe even longer.. I am very strongly NOT in favor of raid6, even for nearline bulk backup storage. I would sacrifice the space and format that as raid10, and have at LEAST a couple hot spares too.
On Tue, Apr 12, 2011 at 9:23 AM, Matthew Feinberg <matthew at choopa.com> wrote:> Hello All > > I have a brand spanking new 40TB Hardware Raid6 array to play around > with. I am looking for recommendations for which filesystem to use. I am > trying not to break this up into multiple file systems as we are going > to use it for backups. Other factors is performance and reliability.We've been very happy with XFS, as it allows us to add diskspace through LVM and grow the filesystem online - we've had to reboot the server when we add new diskenclosures, but that's not XFS's fault... BR Bent> > CentOS 5.6 > > array is /dev/sdb > > So here is what I have tried so far > reiserfs is limited to 16TB > ext4 does not seem to be fully baked in 5.6 yet. parted 1.8 does not > support creating ext4 (strange) > > Anyone work with large filesystems like this that have any > suggestions/recommendations? > > -- > Matthew Feinberg > matthew at choopa.com > AIM: matthewchoopa > > _______________________________________________ > CentOS mailing list > CentOS at centos.org > http://lists.centos.org/mailman/listinfo/centos >
Le 12/04/2011 09:23, Matthew Feinberg a ?crit :> Hello All > > I have a brand spanking new 40TB Hardware Raid6 array to play around > with. I am looking for recommendations for which filesystem to use. I am > trying not to break this up into multiple file systems as we are going > to use it for backups. Other factors is performance and reliability. > > CentOS 5.6 > > array is /dev/sdb > > So here is what I have tried so far > reiserfs is limited to 16TB > ext4 does not seem to be fully baked in 5.6 yet. parted 1.8 does not > support creating ext4 (strange) > > Anyone work with large filesystems like this that have any > suggestions/recommendations?Hi Matthew, I would go for xfs, which is now supported in CentOS. This is what I use for a 16 TB storage, with CentOS 5.3 (Rocks Cluster), and it woks fine. No problem with lengthy fsck, as with ext3 (which does not support such capacities). I did not try yet ext4... Alain -- =========================================================Alain P?an - LPP/CNRS Administrateur Syst?me/R?seau Laboratoire de Physique des Plasmas - UMR 7648 Observatoire de Saint-Maur 4, av de Neptune, Bat. A 94100 Saint-Maur des Foss?s Tel : 01-45-11-42-39 - Fax : 01-48-89-44-33 ==========================================================
Torres, Giovanni (NIH/NINDS) [C]
2011-Apr-12 12:34 UTC
[CentOS] 40TB File System Recommendations
On Apr 12, 2011, at 3:23 AM, Matthew Feinberg wrote: ext4 does not seem to be fully baked in 5.6 yet. parted 1.8 does not support creating ext4 (strange) The CentOS homepage states that ext4 is now a fully supported filesystem in 5.6.
On Tuesday, April 12, 2011 02:51:45 PM John R Pierce wrote:> On 04/12/11 6:02 AM, Marian Marinov wrote: > > > > Yes... but with such RAID10 solution you get only half of the disk space... so > > from 10 2TB drives you get only 10TB instead of 16TB with RAID6. > > those disks are $100 each. whats your data worth?Where can I get an enterprise-class 2TB drive for $100? Commodity SATA isn't enterprise-class. SAS is; FC is, SCSI is. A 500GB FC drive with EMC firmware new is going to set you back ten times that, at least. What's youre data worth indeed, putting it on commodity disk.... :-)> in this case, the OP is talking about a 40TB array, so thats a TWENTY > TWO drive raid. NOONE I know in the storage business will use larger > than a 8 or 10 drive raid set.EMC allows RAID groups up to 16 drives on Clariion storage. I've been doing this with EMC stuff for a while, with RAID6 plus a hotspare per DAE; that's a 14 drive RAID group plus the hotspare on one DAE. Some systems I forgo the dedicated per-DAE hotspare and spread a 16 drive RAID6 group and a 14 drive RAID6 group across two DAE's with hotspares on other DAE's. Works ok, and I've had double drive soft failures on a single RAID6 group that successfully hotspared (and back). This is partially due to the custom EMC firmware on the drives, and the interaction with the storage processor. Rebuild time is several hours, but with more smaller drives it's not too bad.> If you really need such a massive > volume, you stripe several smaller raidsets, so the raid6 version would > be 2 x 12 x 2TB or 24 drives for raid6+0 == 40TB.Or you do metaLUNs, or similar using LVM.
Thank you everyone for the advice and great information. From what I am gathering XFS is the way to go. A couple more questions. What partitioning utility is suggested? parted and fdisk do not seem to be doing the job. Raid Level. I am considering moving away from the raid6 due to possible write performance issues. The array is 22 disks. I am not opposed to going with raid10 but I am looking for a good balance of performance/capacity. Hardware or software raid. Is there an advantage either way on such a large array? -- Matthew Feinberg matthew at choopa.com AIM: matthewchoopa
On 4/13/11, Rudi Ahlers <Rudi at softdux.com> wrote:> I haven't had problems doing it this way yet.Thanks for the confirmation. Could you please outline the general steps to expand an existing RAID 10 with another RAID 1 device? I'm trying to test this out but unfortunately being the noob that I am, all I have managed so far is a couple of /dev/loop raid 1 arrays that cannot be deleted nor combined into a raid 0 array.
On Tuesday, April 12, 2011 06:49:08 PM Drew wrote:> > Where can I get an enterprise-class 2TB drive for $100? Commodity SATA isn't enterprise-class.> I can get Seagate's Constellation ES series SATA drives in 1TB for > $125. 2TB will run me around $225.Yeah, those are reasonable near-line drives for archival storage, or when you have a very small number of servers accessing the storage, and large amounts of cache. EMC used Barracuda ES SATA drives in their Clariion CX3 boxes for a while; used a dual attach 4G FC bridge controller to go from the DAE backplane to the SATA port, and emulated the dual attach functionality of FC with it. I'm not 100% sure, but I think the SATA drive itself got EMC-specific firmware.
On Tuesday, April 12, 2011 07:00:26 PM compdoc wrote:> I've had good luck with green, 5400 rpm Samsung drives. They don't spin down > automatically and work fine in my raid 5 arrays. The cost is about $80 for > 2TB drives.And that's a good price point for a commodity drive; not something I would count on for long-term use, but still a good price point.> I also have a few 5900 rpm Seagate ST32000542AS drives, but not currently in > raids. They don't spin down, so I'm sure they would be fine in a raid.The biggest issue isn't the spindown. Google 'WDTLER' and see the other, bigger, issue. In a nutshell, TLER (Time-Limited Error Recovery; see https://secure.wikimedia.org/wikipedia/en/wiki/TLER ) allows the drive to not try to recover soft errors quite as long. The error recovery time can cause the drive to drop out of RAID sets and be marked as faulted.> Just because they are so tiny on the outside, 2.5 inch drives like the > Seagate Constellation and WD Raptors are great. Unfortunately, the don't > come any larger than 1TB, so I use them in special situations.FWIW, EMC's new VNX storage systems are at the 2.5 inch formfactor, with SSD and mechanical platter drives as options, using 6G SAS interfaces.
On Thursday, April 14, 2011 10:37:15 AM Christopher Chan wrote:> I used XFS extensively when I was running mail server farms for > the mail queue filesystem and I only remember one or two incidents when > the filesystem was marked read-only for no reason (seemingly - never had > the time to find out why) but a reboot fixed those.I've had that happen, recently, with ext3 on CentOS 4. FWIW.
On Thursday, April 14, 2011 11:20:23 AM Les Mikesell wrote:> Same here, CentOS5 and ext3. Rare and random across identical hardware. > So far I've blamed the hardware.I don't have that luxury. This is one VM on a VMware ESX 3.5U5 host, and the storage is EMC Clariion fibre-channel, with the VMware VMFS3 in between. Same storage RAID groups serve other VMs that haven't shown the problem. Happened regardless of the ESX host on which the guest was running; I even svmotioned the vmx/vmdk over to a different RAID group, and after roughly two weeks it did it again. I haven't had the issue since the 4.9 update, and transitioning from the all-in-one vmware-tools package to the OSP stuff at packages.vmware.com (did that for a different reason, that of the 'can't reboot/restart vmxnet if IPv6 enabled' issue on ESX 3.5). Only the one VM guest had the problem; several C4 VM's, too. This one has the Scalix mailstore on it. Reboot into single user, disable the journal, fsck, re-enable the journal, things are ok. Well, the last time it happened I didn't disable the journal before the fsck/reboot, but didn't suffer any data loss even then (journal replay in the 'fs went read-only journal stopped' case isn't something you want to have happen in the general case).
Lamar Owen
2011-Apr-14 15:55 UTC
[CentOS] Ext3 remount ro (was:Re: 40TB File System Recommendations)
On Thursday, April 14, 2011 11:41:07 AM Peter Kjellstr?m wrote:> The default behaviour for ext3 on CentOS-5 is to remount read-only, as a > safety measure, when something goes wrong beneath it (see mount option > "errors" in man mount). The root cause can be any of a long list of hardware > or software (kernel) problems (typically not ext3's fault though).The root cause made its appearance as clamd getting oom-killed. Eight hours of rampant oom-killer activity, and the fs goes bang. Plenty of memory allocated by the host; perhaps too much memory for the 32-bit guest. But, as I said, the combination of the 4.9 update and going with VMware's OSP setup from packages.vmware.com seem to have fixed the underlying issue. Looking at a whole e-mail system overhaul anyway; while Scalix the package is preforming well for what we need it to do, Scalix the company has been incredibly slow on the next update. Looking to go to Zarafa on C6 x86_64, perhaps. MS Outlook public folder/shared calendar/shared contacts/group scheduling support number one criterion; and Exchange is not the answer. So an upgrade of the existing system isn't on the radar at the moment, but a full migration to something else is.
Lamar Owen wrote:> On Thursday, April 14, 2011 10:37:15 AM Christopher Chan wrote: >> I used XFS extensively when I was running mail server farms for themail queue filesystem and I only remember one or two incidents when the filesystem was marked read-only for no reason (seemingly - never had the time to find out why) but a reboot fixed those.> > I've had that happen, recently, with ext3 on CentOS 4.I've had that happen, also. It usually indicates a drive dying. mark
On Thursday, April 14, 2011 02:17:41 PM aurfalien at gmail.com wrote:> However if you like XFS, I'll assume you liek IRIX so check the 5dwm > project which is the IRIX desktop for Linux.Cool. Now if they ported the Audio DAT ripping program for IRIX to Linux, I'd be able to get rid of my O2..... (SGI got special DAT tape drive firmware made by Seagate that can read and write Audio DAT tapes in a particular Seagate/Archive Python DDS-1 drive; SGI also put the software wot work with Audio DAT's in IRIX. I use that program occasionally on my O2, and previously on my Indigo2/IMPACT, to 'rip' Audio DAT's for my professional audio production side business.... I can also master to Audio DAT with the same program, making it quite nice indeed.).
One was 32 bit, the other 64 bit. Christopher Chan <christopher.chan at bradbury.edu.hk> wrote:>On Thursday, April 14, 2011 07:26 AM, John Jasen wrote: >> On 04/12/2011 08:19 PM, Christopher Chan wrote: >>> On Tuesday, April 12, 2011 10:36 PM, John Jasen wrote: >>>> On 04/12/2011 10:21 AM, Boris Epstein wrote: >>>>> On Tue, Apr 12, 2011 at 3:36 AM, Alain P?an >>>>> <alain.pean at lpp.polytechnique.fr >>>>> <mailto:alain.pean at lpp.polytechnique.fr>> wrote: >>>> >>>> <snipped: two recommendations for XFS> >>>> >>>> I would chime in with a dis-commendation for XFS. At my previous >>>> employer, two cases involving XFS resulted in irrecoverable data >>>> corruption. These were on RAID systems running from 4 to 20 TB. >>>> >>>> >>> >>> What were those circumstances? Crash? Power outage? What are the >>> components of the RAID systems? >> >> One was a hardware raid over fibre channel, which silently corrupted >> itself. System checked out fine, raid array checked out fine, xfs was >> replaced with ext3, and the system ran without issue. >> >> Second was multiple hardware arrays over linux md raid0, also over fibre >> channel. This was not so silent corruption, as in xfs would detect it >> and lock the filesystem into read-only before it, pardon the pun, truly >> fscked itself. Happened two or three times, before we gave up, split up >> the raid, and went ext3, Again, no issues. > >32-bit kernel by any chance? >_______________________________________________ >CentOS mailing list >CentOS at centos.org >http://lists.centos.org/mailman/listinfo/centos
On Apr 15, 2011, at 12:32 PM, Rudi Ahlers <Rudi at SoftDux.com> wrote:> > > On Fri, Apr 15, 2011 at 6:26 PM, Ross Walker <rswwalker at gmail.com> wrote: > On Apr 15, 2011, at 9:17 AM, Rudi Ahlers <Rudi at SoftDux.com> wrote: > >> >> >> On Fri, Apr 15, 2011 at 3:05 PM, Christopher Chan <christopher.chan at bradbury.edu.hk> wrote: >> On Friday, April 15, 2011 07:24 PM, Benjamin Franz wrote: >> > On 04/14/2011 09:00 PM, Christopher Chan wrote: >> >> >> >> Wanna try that again with 64MB of cache only and tell us whether there >> >> is a difference in performance? >> >> >> >> There is a reason why 3ware 85xx cards were complete rubbish when used >> >> for raid5 and which led to the 95xx/96xx series. >> >> _ >> > >> > I don't happen to have any systems I can test with the 1.5TB drives >> > without controller cache right now, but I have a system with some old >> > 500GB drives (which are about half as fast as the 1.5TB drives in >> > individual sustained I/O throughput) attached directly to onboard SATA >> > ports in a 8 x RAID6 with *no* controller cache at all. The machine has >> > 16GB of RAM and bonnie++ therefore used 32GB of data for the test. >> > >> > Version 1.96 ------Sequential Output------ --Sequential Input- >> > --Random- >> > Concurrency 1 -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- >> > --Seeks-- >> > Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP >> > /sec %CP >> > pbox3 32160M 389 98 76709 22 91071 26 2209 95 264892 26 >> > 590.5 11 >> > Latency 24190us 1244ms 1580ms 60411us 69901us >> > 42586us >> > Version 1.96 ------Sequential Create------ --------Random >> > Create-------- >> > pbox3 -Create-- --Read--- -Delete-- -Create-- --Read--- >> > -Delete-- >> > files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP >> > /sec %CP >> > 16 10910 31 +++++ +++ +++++ +++ 29293 80 +++++ +++ >> > +++++ +++ >> > Latency 775us 610us 979us 740us 370us >> > 380us >> > >> > Given that the underlaying drives are effectively something like half as >> > fast as the drives in the other test, the results are quite comparable. >> >> Woohoo, next we will be seeing md raid6 also giving comparable results >> if that is the case. I am not the only person on this list that thinks >> cache is king for raid5/6 on hardware raid boards and the using hardware >> raid + bbu cache for better performance one of the two reasons why we >> don't do md raid5/6. >> >> >> > >> > Cache doesn't make a lot of difference when you quickly write a lot more >> > data than the cache can hold. The limiting factor becomes the slowest >> > component - usually the drives themselves. Cache isn't magic performance >> > pixie dust. It helps in certain use cases and is nearly irrelevant in >> > others. >> > >> >> Yeah, you are right - but cache is primarily to buffer the writes for >> performance. Why else go through the expense of getting bbu cache? So >> what happens when you tweak bonnie a bit? >> _______________________________________________ >> >> >> >> As matter of interest, does anyone know how to use an SSD drive for cach purposes on Linux software RAID drives? ZFS has this feature and it makes a helluva difference to a storage server's performance. > > Put the file system's log device on it. > > -Ross > > > _______________________________________________ > > > > Well, ZFS has a separate ZIL for that purpose, and the ZIL adds extra protection / redundancy to the whole pool. > > But the Cache / L2ARC drive caches all common reads & writes (simply put) onto SSD to improve overall system performance. > > So I was wondering if one could do this with mdraid or even just EXT3 / EXT4?Ext3/4 and XFS allow specifying an external log device which if is an SSD can speed up writes. All these file systems aggressively use page cache for read/write cache. The only thing you don't get is L2ARC type cache, but I heard of a dm-cache project that might provide provide that type of cache. -Ross -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.centos.org/pipermail/centos/attachments/20110415/413018fa/attachment-0005.html>
> > As matter of interest, does anyone know how to use an SSD drive for cach > purposes on Linux software RAID drives? ZFS has this feature and it > makes a helluva difference to a storage server's performance.You cannot. You can however use one for the external journal of ext3/4 in full journaling mode for something similar.