John Ouellette
2009-Apr-08 04:54 UTC
[Lustre-discuss] Lustre as a reliable f/s for an archive data centre
Hi -- I work for an astronomical data archive centre which stores about 300TB of data. We are considering options for replacing our current data management system, and one proposed alternative uses Lustre as one component. From what I have read about Lustre, it is largely targeted to HPC installations and not data centres, so I''m a bit worried about its applicability here. Although we do have a requirement for high throughput (although not to really support 1000s of clients: more likely a few dozen nodes), our primary concerns are reliability and data integrity. From reading the docs, it looks as though Lustre can be made to be very reliable, but how about data integrity? Are there problems with data corruption? If the MDS data is lost, is it possible to rebuild this, given only the file-system data? How easy is it to back-up Lustre? Do you back-up the MDS data and the OST data, or do you back-up through a Lustre client? Thanks in advance for any answers or pointers, John Ouellette
Aaron Porter
2009-Apr-08 17:11 UTC
[Lustre-discuss] Lustre as a reliable f/s for an archive data centre
On Tue, Apr 7, 2009 at 9:54 PM, John Ouellette <john.ouellette at nrc-cnrc.gc.ca> wrote:> Although we do have a requirement for high throughput (although not to > really support 1000s of clients: more likely a few dozen nodes), our > primary concerns are reliability and data integrity. ?From reading the > docs, it looks as though Lustre can be made to be very reliable, but how > about data integrity?Also -- are there configurations that can provide data availability across single or multiple OST failures?
Kevin Fox
2009-Apr-08 20:26 UTC
[Lustre-discuss] Lustre as a reliable f/s for an archive data centre
We currently use Lustre for an archive data cluster. df -h Filesystem Size Used Avail Use% Mounted on n15:/nwfsv2-mds1/client 1.2P 306T 789T 28% /nwfs To deal with some of the archive issues (non hpc), we run the cluster a little differently then the norm. We have set the default stripe to 1 wide. Our OSS''s are white box, 24 1tb drives hanging off of a couple of 3ware controllers set up in raid 6''s. This is much cheaper then the redundant fiber channel setups that you usually see in HPC Lustres. Because of the hardware listed above, Lustre backups/restores are a real pain. A normal backup would take forever dealing with a lustre of this size. We have implemented our own backup system to deal with it. It involves a modified e2scan utility and a fuse filesystem. If I can find some time, I plan on trying to release this GPL some day. One of our main requirements was to be able to restore an OSS/OST as quickly as possible if there was a failure. We separate and colocate each OST''s data on tape to allow for quick restores. We have had a few OSS failures in the years of running the system and have been able to quickly restore just that OSS''s data each time. Without this type of system, the tape drives would have to read just about all of the data stored on tape to get at the relevant bits. Since we have 208 OST''s, restoring an OSS with this method gains us something like 52 times the performance. To deal with an OSS loss, we configure that OSS out. The file system continues as normal with any access to files residing on an affected OST throwing IO errors. In the mean time we can restore the OST''s data. The stripe size was set to 1 so that it does not cross OSS''s. That way if an OSS/OST is lost it doesn''t affect as many files. Having the stripe size set to 1 is also assumed by the backup subsystem. It allows for the colocation I described above. My plan is to enhance the filesystem to handle stripe > 1 some day but have not been able to free up the time to do so yet. As for the MDS, I have code to try to back that up, but haven''t used it in production or tested a restore of the data. What we usually do is take a dump of the MDS on our down times and have used that and the backups to restore the data in case of an MDS failure. We put in a lot more redundancy (raid 1 over two raid 6''s on separate controllers) into our MDS then the rest of the system so we haven''t had as many problems with it as the OSS''s. As far as data corruption, lustre currently doesn''t keep checksums so its really left up to the io subsystem to handle that. Pick a good one. We have had problems with our 3ware controllers corrupting data at times but so far, have been able to restore any affected data from backups. We have bumped into a few kernel/lustre corrupting bugs related to 32bit boxes and 2+TB block devices a few times but not in a while. We were able to restore data from backups to handle this to, but that story is a whole book unto itself. So, is Lustre as an archive file system doable, yes. is it recommended? Depends how much effort you want to put into it. Kevin On Tue, 2009-04-07 at 21:54 -0700, John Ouellette wrote:> Hi -- I work for an astronomical data archive centre which stores > about > 300TB of data. We are considering options for replacing our current > data management system, and one proposed alternative uses Lustre as > one > component. From what I have read about Lustre, it is largely targeted > to HPC installations and not data centres, so I''m a bit worried about > its applicability here. > > Although we do have a requirement for high throughput (although not to > really support 1000s of clients: more likely a few dozen nodes), our > primary concerns are reliability and data integrity. From reading the > docs, it looks as though Lustre can be made to be very reliable, but > how > about data integrity? Are there problems with data corruption? If > the > MDS data is lost, is it possible to rebuild this, given only the > file-system data? How easy is it to back-up Lustre? Do you back-up > the > MDS data and the OST data, or do you back-up through a Lustre client?> Thanks in advance for any answers or pointers, > John Ouellette > > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss > >
Kevin Fox
2009-Apr-08 20:31 UTC
[Lustre-discuss] Lustre as a reliable f/s for an archive datacentre
A lot of sites pair OSS''s with fibre channel. If an OSS fails, the OST it served is then served by its buddy. Kevin On Wed, 2009-04-08 at 10:11 -0700, Aaron Porter wrote:> On Tue, Apr 7, 2009 at 9:54 PM, John Ouellette > <john.ouellette at nrc-cnrc.gc.ca> wrote: > > Although we do have a requirement for high throughput (although not > to > > really support 1000s of clients: more likely a few dozen nodes), our > > primary concerns are reliability and data integrity. From reading > the > > docs, it looks as though Lustre can be made to be very reliable, but > how > > about data integrity? > > Also -- are there configurations that can provide data availability > across single or multiple OST failures? > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss > >
John Ouellette
2009-Apr-08 20:54 UTC
[Lustre-discuss] Lustre as a reliable f/s for an archive data centre
Hi Kevin -- your usage sounds similar to ours, and the challenges you''ve faced are likely similar to what we''re looking at. I''d be interested in learning more about your architecture, and any recommendations that you have (ie. what would you do differently). One complication we have with back-ups is that (currently) our tape back-ups are done to a local Grid computing site, and the hardware is owned and maintained by them. We need to use TSM to do our back-ups: I''m not sure that the ''infinite incremental'' scheme of TSM would work well with Lustre. Our current home-grown data management system uses vanilla Linux boxes as storage nodes, and a database (on Sybase) to manage files and file metadata. To maintain file integrity, every file that is put into the system (using our API) is checksummed, and the on-disk files are compared to the metadata db by a continually cycling background task. Also, we pair-up storage nodes so that each file automatically gets put onto two nodes. With the files on two identical nodes, we can take one down for maintenance while still having full access to the data, and can recover one node from its mirror. This mirroring is in addition to the off-site tape back-up. This system is great in its simplicity (we can recover the entire file management system from the contents of the storage node''s file-system, although we''ve never had to), but either needs to be largely refactored or replaced (hence the interest in things like Lustre). Lustre does not give file-management capabilities, so we were looking into using iRODS on top of Lustre. I''m not sure what you mean by "we have set the default stripe to 1 wide". Does this affect how the blocks are written to disk? One problem I forsee with backing up the OSTs is that each OST (might) only hold a fraction of a file, and without the MDS data you don''t know what part of what file. In your architecture, can you take OSS''s offline without losing data access? My suspicion is that we''d only get this if the OSTs of that host were also connected to another OSS. Thx, J. Kevin Fox wrote:> > We currently use Lustre for an archive data cluster. > > df -h > Filesystem Size Used Avail Use% Mounted on > n15:/nwfsv2-mds1/client > 1.2P 306T 789T 28% /nwfs > > To deal with some of the archive issues (non hpc), we run the cluster a > little differently then the norm. We have set the default stripe to 1 > wide. Our OSS''s are white box, 24 1tb drives hanging off of a couple of > 3ware controllers set up in raid 6''s. This is much cheaper then the > redundant fiber channel setups that you usually see in HPC Lustres. > > Because of the hardware listed above, Lustre backups/restores are a real > pain. A normal backup would take forever dealing with a lustre of this > size. We have implemented our own backup system to deal with it. It > involves a modified e2scan utility and a fuse filesystem. If I can find > some time, I plan on trying to release this GPL some day. > > One of our main requirements was to be able to restore an OSS/OST as > quickly as possible if there was a failure. We separate and colocate > each OST''s data on tape to allow for quick restores. We have had a few > OSS failures in the years of running the system and have been able to > quickly restore just that OSS''s data each time. Without this type of > system, the tape drives would have to read just about all of the data > stored on tape to get at the relevant bits. Since we have 208 OST''s, > restoring an OSS with this method gains us something like 52 times the > performance. > > To deal with an OSS loss, we configure that OSS out. The file system > continues as normal with any access to files residing on an affected OST > throwing IO errors. In the mean time we can restore the OST''s data. The > stripe size was set to 1 so that it does not cross OSS''s. That way if an > OSS/OST is lost it doesn''t affect as many files. Having the stripe size > set to 1 is also assumed by the backup subsystem. It allows for the > colocation I described above. My plan is to enhance the filesystem to > handle stripe > 1 some day but have not been able to free up the time to > do so yet. > > As for the MDS, I have code to try to back that up, but haven''t used it > in production or tested a restore of the data. What we usually do is > take a dump of the MDS on our down times and have used that and the > backups to restore the data in case of an MDS failure. We put in a lot > more redundancy (raid 1 over two raid 6''s on separate controllers) into > our MDS then the rest of the system so we haven''t had as many problems > with it as the OSS''s. > > As far as data corruption, lustre currently doesn''t keep checksums so > its really left up to the io subsystem to handle that. Pick a good one. > We have had problems with our 3ware controllers corrupting data at times > but so far, have been able to restore any affected data from backups. We > have bumped into a few kernel/lustre corrupting bugs related to 32bit > boxes and 2+TB block devices a few times but not in a while. We were > able to restore data from backups to handle this to, but that story is a > whole book unto itself. > > So, is Lustre as an archive file system doable, yes. is it recommended? > Depends how much effort you want to put into it. > > Kevin > > > On Tue, 2009-04-07 at 21:54 -0700, John Ouellette wrote: > > Hi -- I work for an astronomical data archive centre which stores > > about > > 300TB of data. We are considering options for replacing our current > > data management system, and one proposed alternative uses Lustre as > > one > > component. From what I have read about Lustre, it is largely targeted > > to HPC installations and not data centres, so I''m a bit worried about > > its applicability here. > > > > Although we do have a requirement for high throughput (although not to > > really support 1000s of clients: more likely a few dozen nodes), our > > primary concerns are reliability and data integrity. From reading the > > docs, it looks as though Lustre can be made to be very reliable, but > > how > > about data integrity? Are there problems with data corruption? If > > the > > MDS data is lost, is it possible to rebuild this, given only the > > file-system data? How easy is it to back-up Lustre? Do you back-up > > the > > MDS data and the OST data, or do you back-up through a Lustre client? > > > Thanks in advance for any answers or pointers, > > John Ouellette > > > > > > _______________________________________________ > > Lustre-discuss mailing list > > Lustre-discuss at lists.lustre.org > > http://lists.lustre.org/mailman/listinfo/lustre-discuss > > > > >
Aaron Porter
2009-Apr-08 21:00 UTC
[Lustre-discuss] Lustre as a reliable f/s for an archive data centre
On Wed, Apr 8, 2009 at 1:54 PM, John Ouellette <john.ouellette at nrc-cnrc.gc.ca> wrote:> In your architecture, can you take OSS''s offline without losing data > access? ?My suspicion is that we''d only get this if the OSTs of that > host were also connected to another OSS.The Wiki seems to indicate that 1.8 will allow this, but then the same Wiki says 1.8 should have come out 6-8 months ago...
Kevin Fox
2009-Apr-08 21:40 UTC
[Lustre-discuss] Lustre as a reliable f/s for an archive datacentre
Its been in Lustre for a while. We''re using it in some of our Lustre 1.4 clusters. Kevin On Wed, 2009-04-08 at 14:00 -0700, Aaron Porter wrote:> On Wed, Apr 8, 2009 at 1:54 PM, John Ouellette > <john.ouellette at nrc-cnrc.gc.ca> wrote: > > In your architecture, can you take OSS''s offline without losing data > > access? My suspicion is that we''d only get this if the OSTs of that > > host were also connected to another OSS. > > The Wiki seems to indicate that 1.8 will allow this, but then the same > Wiki says 1.8 should have come out 6-8 months ago... > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss > >
John Ouellette
2009-Apr-08 22:00 UTC
[Lustre-discuss] Lustre as a reliable f/s for an archive datacentre
I''m afraid I''m still not up-to-speed with Lustre (still just reading docs). Do you mean that you can configure Lustre to have N+1 redundancy wrt file data? i.e. If you have two independent OSSs, can you configure Lustre so you can take one down and still have access to the data (with no direct hardware connections from the second OSS to the OSTs of the box that''s being taken down)? Thx, John Kevin Fox wrote:> Its been in Lustre for a while. We''re using it in some of our Lustre 1.4 > clusters. > > Kevin > > On Wed, 2009-04-08 at 14:00 -0700, Aaron Porter wrote: >> On Wed, Apr 8, 2009 at 1:54 PM, John Ouellette >> <john.ouellette at nrc-cnrc.gc.ca> wrote: >>> In your architecture, can you take OSS''s offline without losing data >>> access? My suspicion is that we''d only get this if the OSTs of that >>> host were also connected to another OSS. >> The Wiki seems to indicate that 1.8 will allow this, but then the same >> Wiki says 1.8 should have come out 6-8 months ago... >> _______________________________________________ >> Lustre-discuss mailing list >> Lustre-discuss at lists.lustre.org >> http://lists.lustre.org/mailman/listinfo/lustre-discuss >> >> >-- Dr. John Ouellette Operations Manager Canadian Astronomy Data Centre Herzberg Institute of Astrophysics National Research Council Canada 5071 West Saanich Road, Victoria BC V9E 2E7 Canada Phone: 250-363-3037
Kevin Fox
2009-Apr-08 22:09 UTC
[Lustre-discuss] Lustre as a reliable f/s for an archive data centre
On Wed, 2009-04-08 at 13:54 -0700, John Ouellette wrote:> Hi Kevin -- your usage sounds similar to ours, and the challenges > you''ve > faced are likely similar to what we''re looking at. I''d be interested > in > learning more about your architecture, and any recommendations that > you > have (ie. what would you do differently).Yup. This sounds very similar.> > One complication we have with back-ups is that (currently) our tape > back-ups are done to a local Grid computing site, and the hardware is > owned and maintained by them. We need to use TSM to do our back-ups: > I''m not sure that the ''infinite incremental'' scheme of TSM would work > well with Lustre.Actually, we are using TSM with incremental backups on top of the fuse file sytem. We basically provide a subdirectory in the root of the file system per OST, and then spawn off a backup run using --virtual-node-name for each ost on that subdirectory. We have 4 Dell 1950''s each running the backup file system, and we run 16 tsm instances at a time on them until all ost''s are backed up. It usually takes ~10-12 hours to complete a backup pass. This architecture was built to accommodate our Lustre 1.4 system. Rumor has it that the later 1.6 releases can have a lustre client and oss on the same box, and using the lbfs, you could then do backups from each oss directly instead of through a set of backup nodes. I attempted this with early 1.4 releases but it wasn''t supported back then. Weird (weird) stuff happed if you tried it back then. I''ve been meaning to try this again, since it takes no code changes to the lbfs, but don''t currently have a suitable 1.6 lustre available.> > Our current home-grown data management system uses vanilla Linux boxes > as storage nodes, and a database (on Sybase) to manage files and file > metadata. To maintain file integrity, every file that is put into the > system (using our API) is checksummed, and the on-disk files are > compared to the metadata db by a continually cycling background task. > Also, we pair-up storage nodes so that each file automatically gets > put > onto two nodes. With the files on two identical nodes, we can take > one > down for maintenance while still having full access to the data, and > can > recover one node from its mirror. This mirroring is in addition to > the > off-site tape back-up.Lustre doesn''t currently support Raid1 striping. That would solve the problem of taking one OST down. I don''t know where that is on the road map. Raid1ing like your doing has the benefit of being able to take an OST down. The drawback is space cost. We''re using raid6''s and haven''t had much data unavailability. We''re using about 1/6 our space for redundancy. Your using 1/2. I''m not sure, but I think it would probably be cheaper to just make the OST''s fiber channel attached and use raid 6 with OSS failover pairs then to raid1 everything. Checksumming is on the roadmap I think. If you striped 1, and gathered the metadata like I do with the e2scan patch, you could checksum the data directly on the oss''s. I''ve been meaning on writing a system like this at some point (And actually have had to do it manually once, in a disaster) but haven''t had time to yet. As far as raid1 pairing nodes, you might be able to hack something together using drbd and oss failover. No clue if its been tried before.> This system is great in its simplicity (we can recover the entire file > management system from the contents of the storage node''s file-system, > although we''ve never had to), but either needs to be largely > refactored > or replaced (hence the interest in things like Lustre). Lustre does > not > give file-management capabilities, so we were looking into using iRODS > on top of Lustre.I''ve been meaning to look more at iRODS, but haven''t had time time. :) If you go down that route, please let me know how you like it.> I''m not sure what you mean by "we have set the default stripe to 1 > wide". Does this affect how the blocks are written to disk?Indirectly. Blocks are striped across OST''s in a Raid0 manner. If you set the stripe size to 1, the whole file is written to only one ost. If you mount the underlying OST''s file system and look at one of the files, you see exactly what you see catting the file from a lustre client. This makes backups and reliability better, but at the cost of performance.> One > problem I forsee with backing up the OSTs is that each OST (might) > only > hold a fraction of a file, and without the MDS data you don''t know > what > part of what file.Yup. This is why we stripe 1 wide.> In your architecture, can you take OSS''s offline without losing data > access? My suspicion is that we''d only get this if the OSTs of that > host were also connected to another OSS.Correct. We cant take an OSS down without the data being unavailable. Kevin> Thx, > J. > > Kevin Fox wrote: > > > > We currently use Lustre for an archive data cluster. > > > > df -h > > Filesystem Size Used Avail Use% Mounted on > > n15:/nwfsv2-mds1/client > > 1.2P 306T 789T 28% /nwfs > > > > To deal with some of the archive issues (non hpc), we run the > cluster a > > little differently then the norm. We have set the default stripe to > 1 > > wide. Our OSS''s are white box, 24 1tb drives hanging off of a couple > of > > 3ware controllers set up in raid 6''s. This is much cheaper then the > > redundant fiber channel setups that you usually see in HPC Lustres. > > > > Because of the hardware listed above, Lustre backups/restores are a > real > > pain. A normal backup would take forever dealing with a lustre of > this > > size. We have implemented our own backup system to deal with it. It > > involves a modified e2scan utility and a fuse filesystem. If I can > find > > some time, I plan on trying to release this GPL some day. > > > > One of our main requirements was to be able to restore an OSS/OST as > > quickly as possible if there was a failure. We separate and colocate > > each OST''s data on tape to allow for quick restores. We have had a > few > > OSS failures in the years of running the system and have been able > to > > quickly restore just that OSS''s data each time. Without this type of > > system, the tape drives would have to read just about all of the > data > > stored on tape to get at the relevant bits. Since we have 208 OST''s, > > restoring an OSS with this method gains us something like 52 times > the > > performance. > > > > To deal with an OSS loss, we configure that OSS out. The file system > > continues as normal with any access to files residing on an affected > OST > > throwing IO errors. In the mean time we can restore the OST''s data. > The > > stripe size was set to 1 so that it does not cross OSS''s. That way > if an > > OSS/OST is lost it doesn''t affect as many files. Having the stripe > size > > set to 1 is also assumed by the backup subsystem. It allows for the > > colocation I described above. My plan is to enhance the filesystem > to > > handle stripe > 1 some day but have not been able to free up the > time to > > do so yet. > > > > As for the MDS, I have code to try to back that up, but haven''t used > it > > in production or tested a restore of the data. What we usually do is > > take a dump of the MDS on our down times and have used that and the > > backups to restore the data in case of an MDS failure. We put in a > lot > > more redundancy (raid 1 over two raid 6''s on separate controllers) > into > > our MDS then the rest of the system so we haven''t had as many > problems > > with it as the OSS''s. > > > > As far as data corruption, lustre currently doesn''t keep checksums > so > > its really left up to the io subsystem to handle that. Pick a good > one. > > We have had problems with our 3ware controllers corrupting data at > times > > but so far, have been able to restore any affected data from > backups. We > > have bumped into a few kernel/lustre corrupting bugs related to > 32bit > > boxes and 2+TB block devices a few times but not in a while. We were > > able to restore data from backups to handle this to, but that story > is a > > whole book unto itself. > > > > So, is Lustre as an archive file system doable, yes. is it > recommended? > > Depends how much effort you want to put into it. > > > > Kevin > > > > > > On Tue, 2009-04-07 at 21:54 -0700, John Ouellette wrote: > > > Hi -- I work for an astronomical data archive centre which stores > > > about > > > 300TB of data. We are considering options for replacing our > current > > > data management system, and one proposed alternative uses Lustre > as > > > one > > > component. From what I have read about Lustre, it is largely > targeted > > > to HPC installations and not data centres, so I''m a bit worried > about > > > its applicability here. > > > > > > Although we do have a requirement for high throughput (although > not to > > > really support 1000s of clients: more likely a few dozen nodes), > our > > > primary concerns are reliability and data integrity. From reading > the > > > docs, it looks as though Lustre can be made to be very reliable, > but > > > how > > > about data integrity? Are there problems with data corruption? > If > > > the > > > MDS data is lost, is it possible to rebuild this, given only the > > > file-system data? How easy is it to back-up Lustre? Do you > back-up > > > the > > > MDS data and the OST data, or do you back-up through a Lustre > client? > > > > > Thanks in advance for any answers or pointers, > > > John Ouellette > > > > > > > > > _______________________________________________ > > > Lustre-discuss mailing list > > > Lustre-discuss at lists.lustre.org > > > http://lists.lustre.org/mailman/listinfo/lustre-discuss > > > > > > > > > >
Kevin Fox
2009-Apr-08 22:25 UTC
[Lustre-discuss] Lustre as a reliable f/s for an archive datacentre
No, I''m afraid not. What you can do is fibre attach (or some other shared protocol, say iscsi) the storage to two different OSS''s. When OSS1 fails, OSS2 can share out the OST until OSS1 comes back. If your storage is directly attached to only one OSS, there is currently no way to have that OST''s data available when the OSS goes off line. (Well, other then trying the drbd trick I mentioned in my other message. Thats untested, YMMV, IANAL, etc, etc :) Kevin On Wed, 2009-04-08 at 15:00 -0700, John Ouellette wrote:> I''m afraid I''m still not up-to-speed with Lustre (still just reading > docs). Do you mean that you can configure Lustre to have N+1 > redundancy > wrt file data? i.e. If you have two independent OSSs, can you > configure > Lustre so you can take one down and still have access to the data > (with > no direct hardware connections from the second OSS to the OSTs of the > box that''s being taken down)? > > Thx, > John > > Kevin Fox wrote: > > Its been in Lustre for a while. We''re using it in some of our Lustre > 1.4 > > clusters. > > > > Kevin > > > > On Wed, 2009-04-08 at 14:00 -0700, Aaron Porter wrote: > >> On Wed, Apr 8, 2009 at 1:54 PM, John Ouellette > >> <john.ouellette at nrc-cnrc.gc.ca> wrote: > >>> In your architecture, can you take OSS''s offline without losing > data > >>> access? My suspicion is that we''d only get this if the OSTs of > that > >>> host were also connected to another OSS. > >> The Wiki seems to indicate that 1.8 will allow this, but then the > same > >> Wiki says 1.8 should have come out 6-8 months ago... > >> _______________________________________________ > >> Lustre-discuss mailing list > >> Lustre-discuss at lists.lustre.org > >> http://lists.lustre.org/mailman/listinfo/lustre-discuss > >> > >> > > > > -- > Dr. John Ouellette > Operations Manager > Canadian Astronomy Data Centre > Herzberg Institute of Astrophysics > National Research Council Canada > 5071 West Saanich Road, Victoria BC V9E 2E7 Canada > Phone: 250-363-3037 > >
Adam Gandelman
2009-Apr-09 06:47 UTC
[Lustre-discuss] Lustre as a reliable f/s for an archive data centre
Kevin Fox wrote:> If you striped 1, and gathered the metadata like I do with the e2scan > patch, you could checksum the data directly on the oss''s. I''ve been > meaning on writing a system like this at some point (And actually have > had to do it manually once, in a disaster) but haven''t had time to yet. > > As far as raid1 pairing nodes, you might be able to hack something > together using drbd and oss failover. No clue if its been tried before. > >I''ve just finished putting together a basic Lustre cluster in a lab using DRBD for redundancy and heartbeat for failover on both MDS and OSS. Failover is working fine on both. We''re using mostly commodity hardware. So far the setup seems like a solid candidate for a very cost-efficient way of bringing redundancy and high-availability to Lustre. In the coming weeks we''ll be growing the cluster as we do benchmarking and hopefully have something to add to the DRBD section of the Lustre wiki(which is pretty thin ATM) Adam Gandelman
Heiko Schröter
2009-Apr-09 06:58 UTC
[Lustre-discuss] DRDB transfer rate (was: Lustre as a reliable f/s for an archive data centre)
On Donnerstag, 9. April 2009 08:47:00 Adam Gandelman wrote:> Kevin Fox wrote: > > If you striped 1, and gathered the metadata like I do with the e2scan > > patch, you could checksum the data directly on the oss''s. I''ve been > > meaning on writing a system like this at some point (And actually have > > had to do it manually once, in a disaster) but haven''t had time to yet. > > > > As far as raid1 pairing nodes, you might be able to hack something > > together using drbd and oss failover. No clue if its been tried before. > > I''ve just finished putting together a basic Lustre cluster in a lab > using DRBD for redundancy and heartbeat for failover on both MDS and > OSS. Failover is working fine on both. We''re using mostly commodity > hardware. So far the setup seems like a solid candidate for a very > cost-efficient way of bringing redundancy and high-availability to > Lustre. In the coming weeks we''ll be growing the cluster as we do > benchmarking and hopefully have something to add to the DRBD section of > the Lustre wiki(which is pretty thin ATM) >Did you have the chance to measure the data transfer inside DRBD ? We did observe it can be pretty slow (5MB/s). But might be related to a misconfiguration and or hardware issue though ... Heiko
Peter Kjellstrom
2009-Apr-09 07:24 UTC
[Lustre-discuss] Lustre as a reliable f/s for an archive datacentre
On Wednesday 08 April 2009, Kevin Fox wrote:> A lot of sites pair OSS''s with fibre channel. If an OSS fails, the OST > it served is then served by its buddy.He did ask for "OST failure" not "OSS failure". The former is not something lustre can do. /Peter> Kevin > > On Wed, 2009-04-08 at 10:11 -0700, Aaron Porter wrote:...> > Also -- are there configurations that can provide data availability > > across single or multiple OST failures?-------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part. Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20090409/00433f9a/attachment.bin
Peter Grandi
2009-Apr-11 14:10 UTC
[Lustre-discuss] Lustre as a reliable f/s for an archive data centre
> Hi -- I work for an astronomical data archive centre which > stores about 300TB of data.[ ... ] Although we do have a > requirement for high throughput (although not to really > support 1000s of clients: more likely a few dozen nodes),That "few dozen nodes" determines a bit what kind of performance you need to achieve, and how, because Lustre has indeed huge performance but *in the aggregate*: that is it can achieve 100GB/s of *aggregate* throughput on a 1Gb/s network by having 1,000 clients each transferring at 100MB/s to/from 1,000 servers.> our primary concerns are reliability and data integrity.Reliability and data integrity on a 300TB archive is an unsolved research problem, as long as you really mean them. There are a number of snake-oil salesmen that will promise them to you though.> From reading the docs, it looks as though Lustre can be made > to be very reliable,Reading the docs? They are *very clear* that there is no way to make Lustre *as such* very reliable (until replication is implemented by Lustre itself). Lustre currently has the reliability of 1/Nth (where N is the striping) of the underlying storage system, the (un)reliability of its own software layer reducing that further.> but how about data integrity? Are there problems with data > corruption?Worrying about file system level handling of data corruption in a "astronomical data archive" may be the wrong approach. For data curation *filesystem* (and even more so *storage*) integrity are convenience and performance issues, not a data integrity issue. Data integrity must be end-to-end. That is, you cannot use a file system of any sort of a data archive. A data archive is a rather different thing from a file system even if it resides in a file system.> If the MDS data is lost, is it possible to rebuild this, given > only the file-system data?No. Symbolic names are only on the MDS.> How easy is it to back-up Lustre?It is extremely easy. Backing up 300TB of data is hard.> Do you back-up the MDS data and the OST data, or do you > back-up through a Lustre client?Either way, but again backing up 300TB of data is hard.> Thanks in advance for any answers or pointers,The above are the answers to the questions asked, but those questions seemed to be quite misguided, because what Lustre is and does is quite obvious, and some of the previous followups to your questions are misguided or confused a bit. I''ll try first to explain clearly what Lustre is, and then to reply to some similar questions but perhaps more appropriate ones. Lustre is a data-parallel directory based chunked single namespace single storage pool network metafilesystem: * Metafilesystem: it uses other file systems as storage devices, instead of using block devices. * Network: Lustre is in essence a set of network protocols. There is also some data representation, but one can only use Lustre over a network. The network part of the Lustre implementation (LNET) is probably more important than the rest. * Directory based: which metafilesystem file name corresponds to which base filesystem file(s) is kept in a separate directory. * Chunked: the metafilesystem can be an aggregation of many independent but related base filesystems, the "chunks". * Single namespace: the directory service maintains a single metafilesystem namespace over all the base filesystems. * Single storage pool: the directory service maintains a single pool of available space over all the base filesystems. If a file is striped, it can be larger than any single base filesystem. * Data-parallel: once the list of base filesystem files for a metafilesystem name has been obtained by a client from the directory server, and the base filesystem files have been "mounted", they can be accessed in parallel. The directory server cannot (yet). In a simplified way, consider that two files ''a'' and ''b'' in directory ''d'' are implemented thusly (if ''a'' is not striped and ''b'' is, and there are two data servers): * lustre://dirserv/d/a with inum I1 - lustrei://dataserv1/I1 * lustre://dirserv/d/b with inum I2 - lustrei://dataserv1/I2-1 - lustrei://dataserv2/I2-2 - lustrei://dataserv1/I2-3 - lustrei://dataserv2/I2-4 .... That''s basically all... :-) The Lustre client software is in effect an extended ''autofs''/''amd'', and treats the Lustre directory server as a kind of LDAP server containing a list of automount pairs; thus it creates a top level mount point for each Lustre namespace, noting for each of those which directory server it corresponds to. Following the example above the Lustre client creates a top level mount point such as ''/mnt/l'' with reference to ''lustre://dirserv/''; as processes access paths under the mountpoint it auto-"mount"s each of the underlying files, so that if a processes accesses ''/mnt/l/d/a'' what happens is that ''lustrei://dataserv1/I1'' is auto-"mounted" as ''/mnt/l/d/a'', and ''lustrei://dataserv[12]/I2-*'' as ''/mnt/l/d/b/I2-*'' (and in the latter case it provides the illusion to processes that ''/mnt/l/d/b'' is a single file). The good questions to ask here are: * For a 300TB data archive, do we need a single name space and a single storage pool? Well, it really depends. Probably not, but it is convenient. * Is there anything better or more cost effective than Lustre for a 300TB data archive? Likely not. Unless you don''t care for single namespace or single storage pool. * How to backup a 300TB archive? Well, that''s a research question. I personally think that the only practical way is another 300TB archive. Other people think tape libraries can be used. Perhaps... * What do you mean by "reliable"? It can be about loss of service or loss of data. Lustre service can be made fairly reliable by redundancy in the network and in the directory and data servers, within limits. Lustre data can be made quite reliable with redundancy in the storage subsystems of the directory and storage servers. There is unavoidable common mode failure in the use of the same metafilesystem and filesystem code. In practice ''ext3''/''ext4'' are pretty reliable and Lustre itself is pretty good too. * What do you mean by "integrity"? It can be about detecting the loss of integrity or recovering from a loss of integrity. Detecting loss integrity in the data must be done *at least* end-to-end. It can be done at lower levels too, but that is not sufficient. Restoring integrity can be done by detecting loss of integrity in the representation of the data (disk blocks, links, metadata, ...) and using redundancy *as a matter of convenience*. Lustre does a bit of detecting loss of integrity in the metadata, and for now no integrity restoration. In this it has overall the same integrity properties as the underlying filesystem and storage systems, minus the integrity issues of the directory system. In practice it is pretty good. My impression is that even if the Lustre design is definitely targeted at coarse grained data parallel computation, not archival, it is handy to use it as a kind of single namespace and single storage pool anyhow, as there are quite few practical/low cost alternatives, and it is fairly easy to setup. But the really critical factors are the design of the data archive (the level above) and the storage/network system (the level below).