thr3ads.net - Lustre discuss - [Lustre-discuss] Lustre as a reliable f/s for an archive data centre [Apr 2009]

If this information is useful, please help other people find it:
Share via:

John Ouellette

2009-Apr-08 04:54 UTC

[Lustre-discuss] Lustre as a reliable f/s for an archive data centre

Hi -- I work for an astronomical data archive centre which stores about 
300TB of data.  We are considering options for replacing our current 
data management system, and one proposed alternative uses Lustre as one 
component.  From what I have read about Lustre, it is largely targeted 
to HPC installations and not data centres, so I''m a bit worried about 
its applicability here.

Although we do have a requirement for high throughput (although not to 
really support 1000s of clients: more likely a few dozen nodes), our 
primary concerns are reliability and data integrity.  From reading the 
docs, it looks as though Lustre can be made to be very reliable, but how 
about data integrity?   Are there problems with data corruption?  If the 
MDS data is lost, is it possible to rebuild this, given only the 
file-system data?  How easy is it to back-up Lustre?  Do you back-up the 
MDS data and the OST data, or do you back-up through a Lustre client?

Thanks in advance for any answers or pointers,
John Ouellette

Aaron Porter

2009-Apr-08 17:11 UTC

head link

[Lustre-discuss] Lustre as a reliable f/s for an archive data centre

On Tue, Apr 7, 2009 at 9:54 PM, John Ouellette
<john.ouellette at nrc-cnrc.gc.ca> wrote:> Although we do have a requirement for high throughput (although not to
> really support 1000s of clients: more likely a few dozen nodes), our
> primary concerns are reliability and data integrity. ?From reading the
> docs, it looks as though Lustre can be made to be very reliable, but how
> about data integrity?
Also -- are there configurations that can provide data availability
across single or multiple OST failures?

Kevin Fox

2009-Apr-08 20:26 UTC

head link

[Lustre-discuss] Lustre as a reliable f/s for an archive data centre

We currently use Lustre for an archive data cluster.

df -h
Filesystem            Size  Used Avail Use% Mounted on
n15:/nwfsv2-mds1/client
                      1.2P  306T  789T  28% /nwfs

To deal with some of the archive issues (non hpc), we run the cluster a
little differently then the norm. We have set the default stripe to 1
wide. Our OSS''s are white box, 24 1tb drives hanging off of a couple of
3ware controllers set up in raid 6''s. This is much cheaper then the
redundant fiber channel setups that you usually see in HPC Lustres.

Because of the hardware listed above, Lustre backups/restores are a real
pain. A normal backup would take forever dealing with a lustre of this
size. We have implemented our own backup system to deal with it. It
involves a modified e2scan utility and a fuse filesystem. If I can find
some time, I plan on trying to release this GPL some day.

One of our main requirements was to be able to restore an OSS/OST as
quickly as possible if there was a failure. We separate and colocate
each OST''s data on tape to allow for quick restores. We have had a few
OSS failures in the years of running the system and have been able to
quickly restore just that OSS''s data each time. Without this type of
system, the tape drives would have to read just about all of the data
stored on tape to get at the relevant bits. Since we have 208 OST''s,
restoring an OSS with this method gains us something like 52 times the
performance.

To deal with an OSS loss, we configure that OSS out. The file system
continues as normal with any access to files residing on an affected OST
throwing IO errors. In the mean time we can restore the OST''s data. The
stripe size was set to 1 so that it does not cross OSS''s. That way if
an
OSS/OST is lost it doesn''t affect as many files. Having the stripe size
set to 1 is also assumed by the backup subsystem. It allows for the
colocation I described above. My plan is to enhance the filesystem to
handle stripe > 1 some day but have not been able to free up the time to
do so yet.

As for the MDS, I have code to try to back that up, but haven''t used it
in production or tested a restore of the data. What we usually do is
take a dump of the MDS on our down times and have used that and the
backups to restore the data in case of an MDS failure. We put in a lot
more redundancy (raid 1 over two raid 6''s on separate controllers) into
our MDS then the rest of the system so we haven''t had as many problems
with it as the OSS''s.

As far as data corruption, lustre currently doesn''t keep checksums so
its really left up to the io subsystem to handle that. Pick a good one.
We have had problems with our 3ware controllers corrupting data at times
but so far, have been able to restore any affected data from backups. We
have bumped into a few kernel/lustre corrupting bugs related to 32bit
boxes and 2+TB block devices a few times but not in a while. We were
able to restore data from backups to handle this to, but that story is a
whole book unto itself.

So, is Lustre as an archive file system doable, yes. is it recommended?
Depends how much effort you want to put into it. 

Kevin

On Tue, 2009-04-07 at 21:54 -0700, John Ouellette wrote:> Hi -- I work for an astronomical data archive centre which stores
> about
> 300TB of data.  We are considering options for replacing our current
> data management system, and one proposed alternative uses Lustre as
> one
> component.  From what I have read about Lustre, it is largely targeted
> to HPC installations and not data centres, so I''m a bit worried
about
> its applicability here.
> 
> Although we do have a requirement for high throughput (although not to
> really support 1000s of clients: more likely a few dozen nodes), our
> primary concerns are reliability and data integrity.  From reading the
> docs, it looks as though Lustre can be made to be very reliable, but
> how
> about data integrity?   Are there problems with data corruption?  If
> the
> MDS data is lost, is it possible to rebuild this, given only the
> file-system data?  How easy is it to back-up Lustre?  Do you back-up
> the
> MDS data and the OST data, or do you back-up through a Lustre client?
> Thanks in advance for any answers or pointers,
> John Ouellette
> 
> 
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
> 
>

Kevin Fox

2009-Apr-08 20:31 UTC

head link

[Lustre-discuss] Lustre as a reliable f/s for an archive datacentre

A lot of sites pair OSS''s with fibre channel. If an OSS fails, the OST
it served is then served by its buddy.

Kevin

On Wed, 2009-04-08 at 10:11 -0700, Aaron Porter wrote:> On Tue, Apr 7, 2009 at 9:54 PM, John Ouellette
> <john.ouellette at nrc-cnrc.gc.ca> wrote:
> > Although we do have a requirement for high throughput (although not
> to
> > really support 1000s of clients: more likely a few dozen nodes), our
> > primary concerns are reliability and data integrity.  From reading
> the
> > docs, it looks as though Lustre can be made to be very reliable, but
> how
> > about data integrity?
> 
> Also -- are there configurations that can provide data availability
> across single or multiple OST failures?
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
> 
>

John Ouellette

2009-Apr-08 20:54 UTC

head link

[Lustre-discuss] Lustre as a reliable f/s for an archive data centre

Hi Kevin -- your usage sounds similar to ours, and the challenges
you''ve
faced are likely similar to what we''re looking at.  I''d be
interested in
learning more about your architecture, and any recommendations that you 
have (ie. what would you do differently). 

One complication we have with back-ups is that (currently) our tape 
back-ups are done to a local Grid computing site, and the hardware is 
owned and maintained by them.  We need to use TSM to do our back-ups: 
I''m not sure that the ''infinite incremental'' scheme
of TSM would work
well with Lustre.

Our current home-grown data management system uses vanilla Linux boxes 
as storage nodes, and a database (on Sybase) to manage files and file 
metadata.  To maintain file integrity, every file that is put into the 
system (using our API) is checksummed, and the on-disk files are 
compared to the metadata db by a continually cycling background task.  
Also, we pair-up storage nodes so that each file automatically gets put 
onto two nodes.  With the files on two identical nodes, we can take one 
down for maintenance while still having full access to the data, and can 
recover one node from its mirror.  This mirroring is in addition to the 
off-site tape back-up.

This system is great in its simplicity (we can recover the entire file 
management system from the contents of the storage node''s file-system, 
although we''ve never had to), but either needs to be largely refactored
or replaced (hence the interest in things like Lustre).  Lustre does not 
give file-management capabilities, so we were looking into using iRODS 
on top of Lustre.

I''m not sure what you mean by "we have set the default stripe to 1
wide".  Does this affect how the blocks are written to disk?  One 
problem I forsee with backing up the OSTs is that each OST (might) only 
hold a fraction of a file, and without the MDS data you don''t know what
part of what file.

In your architecture, can you take OSS''s offline without losing data 
access?  My suspicion is that we''d only get this if the OSTs of that 
host were also connected to another OSS.

Thx,
J.

Kevin Fox wrote:>
> We currently use Lustre for an archive data cluster.
>
> df -h
> Filesystem            Size  Used Avail Use% Mounted on
> n15:/nwfsv2-mds1/client
>                       1.2P  306T  789T  28% /nwfs
>
> To deal with some of the archive issues (non hpc), we run the cluster a
> little differently then the norm. We have set the default stripe to 1
> wide. Our OSS''s are white box, 24 1tb drives hanging off of a
couple of
> 3ware controllers set up in raid 6''s. This is much cheaper then
the
> redundant fiber channel setups that you usually see in HPC Lustres.
>
> Because of the hardware listed above, Lustre backups/restores are a real
> pain. A normal backup would take forever dealing with a lustre of this
> size. We have implemented our own backup system to deal with it. It
> involves a modified e2scan utility and a fuse filesystem. If I can find
> some time, I plan on trying to release this GPL some day.
>
> One of our main requirements was to be able to restore an OSS/OST as
> quickly as possible if there was a failure. We separate and colocate
> each OST''s data on tape to allow for quick restores. We have had a
few
> OSS failures in the years of running the system and have been able to
> quickly restore just that OSS''s data each time. Without this type
of
> system, the tape drives would have to read just about all of the data
> stored on tape to get at the relevant bits. Since we have 208
OST''s,
> restoring an OSS with this method gains us something like 52 times the
> performance.
>
> To deal with an OSS loss, we configure that OSS out. The file system
> continues as normal with any access to files residing on an affected OST
> throwing IO errors. In the mean time we can restore the OST''s
data. The
> stripe size was set to 1 so that it does not cross OSS''s. That way
if an
> OSS/OST is lost it doesn''t affect as many files. Having the stripe
size
> set to 1 is also assumed by the backup subsystem. It allows for the
> colocation I described above. My plan is to enhance the filesystem to
> handle stripe > 1 some day but have not been able to free up the time to
> do so yet.
>
> As for the MDS, I have code to try to back that up, but haven''t
used it
> in production or tested a restore of the data. What we usually do is
> take a dump of the MDS on our down times and have used that and the
> backups to restore the data in case of an MDS failure. We put in a lot
> more redundancy (raid 1 over two raid 6''s on separate controllers)
into
> our MDS then the rest of the system so we haven''t had as many
problems
> with it as the OSS''s.
>
> As far as data corruption, lustre currently doesn''t keep checksums
so
> its really left up to the io subsystem to handle that. Pick a good one.
> We have had problems with our 3ware controllers corrupting data at times
> but so far, have been able to restore any affected data from backups. We
> have bumped into a few kernel/lustre corrupting bugs related to 32bit
> boxes and 2+TB block devices a few times but not in a while. We were
> able to restore data from backups to handle this to, but that story is a
> whole book unto itself.
>
> So, is Lustre as an archive file system doable, yes. is it recommended?
> Depends how much effort you want to put into it.
>
> Kevin
>
>
> On Tue, 2009-04-07 at 21:54 -0700, John Ouellette wrote:
> > Hi -- I work for an astronomical data archive centre which stores
> > about
> > 300TB of data.  We are considering options for replacing our current
> > data management system, and one proposed alternative uses Lustre as
> > one
> > component.  From what I have read about Lustre, it is largely targeted
> > to HPC installations and not data centres, so I''m a bit
worried about
> > its applicability here.
> >
> > Although we do have a requirement for high throughput (although not to
> > really support 1000s of clients: more likely a few dozen nodes), our
> > primary concerns are reliability and data integrity.  From reading the
> > docs, it looks as though Lustre can be made to be very reliable, but
> > how
> > about data integrity?   Are there problems with data corruption?  If
> > the
> > MDS data is lost, is it possible to rebuild this, given only the
> > file-system data?  How easy is it to back-up Lustre?  Do you back-up
> > the
> > MDS data and the OST data, or do you back-up through a Lustre client?
>
> > Thanks in advance for any answers or pointers,
> > John Ouellette
> >
> >
> > _______________________________________________
> > Lustre-discuss mailing list
> > Lustre-discuss at lists.lustre.org
> > http://lists.lustre.org/mailman/listinfo/lustre-discuss
> >
> >
>

Aaron Porter

2009-Apr-08 21:00 UTC

head link

[Lustre-discuss] Lustre as a reliable f/s for an archive data centre

On Wed, Apr 8, 2009 at 1:54 PM, John Ouellette
<john.ouellette at nrc-cnrc.gc.ca> wrote:> In your architecture, can you take OSS''s offline without losing
data
> access? ?My suspicion is that we''d only get this if the OSTs of
that
> host were also connected to another OSS.
The Wiki seems to indicate that 1.8 will allow this, but then the same
Wiki says 1.8 should have come out 6-8 months ago...

Kevin Fox

2009-Apr-08 21:40 UTC

head link

[Lustre-discuss] Lustre as a reliable f/s for an archive datacentre

Its been in Lustre for a while. We''re using it in some of our Lustre
1.4
clusters.

Kevin

On Wed, 2009-04-08 at 14:00 -0700, Aaron Porter wrote:> On Wed, Apr 8, 2009 at 1:54 PM, John Ouellette
> <john.ouellette at nrc-cnrc.gc.ca> wrote:
> > In your architecture, can you take OSS''s offline without
losing data
> > access?  My suspicion is that we''d only get this if the OSTs
of that
> > host were also connected to another OSS.
> 
> The Wiki seems to indicate that 1.8 will allow this, but then the same
> Wiki says 1.8 should have come out 6-8 months ago...
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
> 
>

John Ouellette

2009-Apr-08 22:00 UTC

head link

[Lustre-discuss] Lustre as a reliable f/s for an archive datacentre

I''m afraid I''m still not up-to-speed with Lustre (still just
reading
docs).  Do you mean that you can configure Lustre to have N+1 redundancy 
wrt file data?  i.e. If you have two independent OSSs, can you configure 
Lustre so you can take one down and still have access to the data (with 
no direct hardware connections from the second OSS to the OSTs of the 
box that''s being taken down)?

Thx,
John

Kevin Fox wrote:> Its been in Lustre for a while. We''re using it in some of our
Lustre 1.4
> clusters.
> 
> Kevin
> 
> On Wed, 2009-04-08 at 14:00 -0700, Aaron Porter wrote:
>> On Wed, Apr 8, 2009 at 1:54 PM, John Ouellette
>> <john.ouellette at nrc-cnrc.gc.ca> wrote:
>>> In your architecture, can you take OSS''s offline without
losing data
>>> access?  My suspicion is that we''d only get this if the
OSTs of that
>>> host were also connected to another OSS.
>> The Wiki seems to indicate that 1.8 will allow this, but then the same
>> Wiki says 1.8 should have come out 6-8 months ago...
>> _______________________________________________
>> Lustre-discuss mailing list
>> Lustre-discuss at lists.lustre.org
>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>>
>>
> 
-- 
Dr. John Ouellette
Operations Manager
Canadian Astronomy Data Centre
Herzberg Institute of Astrophysics
National Research Council Canada
5071 West Saanich Road, Victoria BC V9E 2E7 Canada
Phone: 250-363-3037

Kevin Fox

2009-Apr-08 22:09 UTC

head link

[Lustre-discuss] Lustre as a reliable f/s for an archive data centre

On Wed, 2009-04-08 at 13:54 -0700, John Ouellette wrote:> Hi Kevin -- your usage sounds similar to ours, and the challenges
> you''ve
> faced are likely similar to what we''re looking at.  I''d
be interested
> in
> learning more about your architecture, and any recommendations that
> you
> have (ie. what would you do differently).
Yup. This sounds very similar.> 
> One complication we have with back-ups is that (currently) our tape
> back-ups are done to a local Grid computing site, and the hardware is
> owned and maintained by them.  We need to use TSM to do our back-ups:
> I''m not sure that the ''infinite incremental''
scheme of TSM would work
> well with Lustre.
Actually, we are using TSM with incremental backups on top of the fuse
file sytem. We basically provide a subdirectory in the root of the file
system per OST, and then spawn off a backup run using
--virtual-node-name for each ost on that subdirectory.

We have 4 Dell 1950''s each running the backup file system, and we run
16
tsm instances at a time on them until all ost''s are backed up. It
usually takes ~10-12 hours to complete a backup pass.

This architecture was built to accommodate our Lustre 1.4 system. Rumor
has it that the later 1.6 releases can have a lustre client and oss on
the same box, and using the lbfs, you could then do backups from each
oss directly instead of through a set of backup nodes. I attempted this
with early 1.4 releases but it wasn''t supported back then. Weird
(weird)
stuff happed if you tried it back then. I''ve been meaning to try this
again, since it takes no code changes to the lbfs, but don''t currently
have a suitable 1.6 lustre available.> 
> Our current home-grown data management system uses vanilla Linux boxes
> as storage nodes, and a database (on Sybase) to manage files and file
> metadata.  To maintain file integrity, every file that is put into the
> system (using our API) is checksummed, and the on-disk files are
> compared to the metadata db by a continually cycling background task. 
> Also, we pair-up storage nodes so that each file automatically gets
> put
> onto two nodes.  With the files on two identical nodes, we can take
> one
> down for maintenance while still having full access to the data, and
> can
> recover one node from its mirror.  This mirroring is in addition to
> the
> off-site tape back-up.
Lustre doesn''t currently support Raid1 striping. That would solve the
problem of taking one OST down. I don''t know where that is on the road
map.

Raid1ing like your doing has the benefit of being able to take an OST
down. The drawback is space cost. We''re using raid6''s and
haven''t had
much data unavailability. We''re using about 1/6 our space for
redundancy. Your using 1/2. I''m not sure, but I think it would probably
be cheaper to just make the OST''s fiber channel attached and use raid 6
with OSS failover pairs then to raid1 everything.

Checksumming is on the roadmap I think.

If you striped 1, and gathered the metadata like I do with the e2scan
patch, you could checksum the data directly on the oss''s. I''ve
been
meaning on writing a system like this at some point (And actually have
had to do it manually once, in a disaster) but haven''t had time to yet.

As far as raid1 pairing nodes, you might be able to hack something
together using drbd and oss failover. No clue if its been tried before.
> This system is great in its simplicity (we can recover the entire file
> management system from the contents of the storage node''s
file-system,
> although we''ve never had to), but either needs to be largely
> refactored
> or replaced (hence the interest in things like Lustre).  Lustre does
> not
> give file-management capabilities, so we were looking into using iRODS
> on top of Lustre.
I''ve been meaning to look more at iRODS, but haven''t had time
time. :)
If you go down that route, please let me know how you like it.
> I''m not sure what you mean by "we have set the default stripe
to 1
> wide".  Does this affect how the blocks are written to disk?
Indirectly. Blocks are striped across OST''s in a Raid0 manner. If you
set the stripe size to 1, the whole file is written to only one ost. If
you mount the underlying OST''s file system and look at one of the
files,
you see exactly what you see catting the file from a lustre client.

This makes backups and reliability better, but at the cost of
performance.
>   One
> problem I forsee with backing up the OSTs is that each OST (might)
> only
> hold a fraction of a file, and without the MDS data you don''t know
> what
> part of what file.
Yup. This is why we stripe 1 wide.
> In your architecture, can you take OSS''s offline without losing
data
> access?  My suspicion is that we''d only get this if the OSTs of
that
> host were also connected to another OSS.
Correct. We cant take an OSS down without the data being unavailable.

Kevin
> Thx,
> J.
> 
> Kevin Fox wrote:
> >
> > We currently use Lustre for an archive data cluster.
> >
> > df -h
> > Filesystem            Size  Used Avail Use% Mounted on
> > n15:/nwfsv2-mds1/client
> >                       1.2P  306T  789T  28% /nwfs
> >
> > To deal with some of the archive issues (non hpc), we run the
> cluster a
> > little differently then the norm. We have set the default stripe to
> 1
> > wide. Our OSS''s are white box, 24 1tb drives hanging off of a
couple
> of
> > 3ware controllers set up in raid 6''s. This is much cheaper
then the
> > redundant fiber channel setups that you usually see in HPC Lustres.
> >
> > Because of the hardware listed above, Lustre backups/restores are a
> real
> > pain. A normal backup would take forever dealing with a lustre of
> this
> > size. We have implemented our own backup system to deal with it. It
> > involves a modified e2scan utility and a fuse filesystem. If I can
> find
> > some time, I plan on trying to release this GPL some day.
> >
> > One of our main requirements was to be able to restore an OSS/OST as
> > quickly as possible if there was a failure. We separate and colocate
> > each OST''s data on tape to allow for quick restores. We have
had a
> few
> > OSS failures in the years of running the system and have been able
> to
> > quickly restore just that OSS''s data each time. Without this
type of
> > system, the tape drives would have to read just about all of the
> data
> > stored on tape to get at the relevant bits. Since we have 208
OST''s,
> > restoring an OSS with this method gains us something like 52 times
> the
> > performance.
> >
> > To deal with an OSS loss, we configure that OSS out. The file system
> > continues as normal with any access to files residing on an affected
> OST
> > throwing IO errors. In the mean time we can restore the OST''s
data.
> The
> > stripe size was set to 1 so that it does not cross OSS''s.
That way
> if an
> > OSS/OST is lost it doesn''t affect as many files. Having the
stripe
> size
> > set to 1 is also assumed by the backup subsystem. It allows for the
> > colocation I described above. My plan is to enhance the filesystem
> to
> > handle stripe > 1 some day but have not been able to free up the
> time to
> > do so yet.
> >
> > As for the MDS, I have code to try to back that up, but
haven''t used
> it
> > in production or tested a restore of the data. What we usually do is
> > take a dump of the MDS on our down times and have used that and the
> > backups to restore the data in case of an MDS failure. We put in a
> lot
> > more redundancy (raid 1 over two raid 6''s on separate
controllers)
> into
> > our MDS then the rest of the system so we haven''t had as many
> problems
> > with it as the OSS''s.
> >
> > As far as data corruption, lustre currently doesn''t keep
checksums
> so
> > its really left up to the io subsystem to handle that. Pick a good
> one.
> > We have had problems with our 3ware controllers corrupting data at
> times
> > but so far, have been able to restore any affected data from
> backups. We
> > have bumped into a few kernel/lustre corrupting bugs related to
> 32bit
> > boxes and 2+TB block devices a few times but not in a while. We were
> > able to restore data from backups to handle this to, but that story
> is a
> > whole book unto itself.
> >
> > So, is Lustre as an archive file system doable, yes. is it
> recommended?
> > Depends how much effort you want to put into it.
> >
> > Kevin
> >
> >
> > On Tue, 2009-04-07 at 21:54 -0700, John Ouellette wrote:
> > > Hi -- I work for an astronomical data archive centre which stores
> > > about
> > > 300TB of data.  We are considering options for replacing our
> current
> > > data management system, and one proposed alternative uses Lustre
> as
> > > one
> > > component.  From what I have read about Lustre, it is largely
> targeted
> > > to HPC installations and not data centres, so I''m a bit
worried
> about
> > > its applicability here.
> > >
> > > Although we do have a requirement for high throughput (although
> not to
> > > really support 1000s of clients: more likely a few dozen nodes),
> our
> > > primary concerns are reliability and data integrity.  From
reading
> the
> > > docs, it looks as though Lustre can be made to be very reliable,
> but
> > > how
> > > about data integrity?   Are there problems with data corruption?
> If
> > > the
> > > MDS data is lost, is it possible to rebuild this, given only the
> > > file-system data?  How easy is it to back-up Lustre?  Do you
> back-up
> > > the
> > > MDS data and the OST data, or do you back-up through a Lustre
> client?
> >
> > > Thanks in advance for any answers or pointers,
> > > John Ouellette
> > >
> > >
> > > _______________________________________________
> > > Lustre-discuss mailing list
> > > Lustre-discuss at lists.lustre.org
> > > http://lists.lustre.org/mailman/listinfo/lustre-discuss
> > >
> > >
> >
> 
>

Kevin Fox

2009-Apr-08 22:25 UTC

head link

[Lustre-discuss] Lustre as a reliable f/s for an archive datacentre

No, I''m afraid not.

What you can do is fibre attach (or some other shared protocol, say
iscsi) the storage to two different OSS''s. When OSS1 fails, OSS2 can
share out the OST until OSS1 comes back.

If your storage is directly attached to only one OSS, there is currently
no way to have that OST''s data available when the OSS goes off line.
(Well, other then trying the drbd trick I mentioned in my other message.
Thats untested, YMMV, IANAL, etc, etc :)

Kevin


On Wed, 2009-04-08 at 15:00 -0700, John Ouellette wrote:> I''m afraid I''m still not up-to-speed with Lustre (still
just reading
> docs).  Do you mean that you can configure Lustre to have N+1
> redundancy
> wrt file data?  i.e. If you have two independent OSSs, can you
> configure
> Lustre so you can take one down and still have access to the data
> (with
> no direct hardware connections from the second OSS to the OSTs of the
> box that''s being taken down)?
> 
> Thx,
> John
> 
> Kevin Fox wrote:
> > Its been in Lustre for a while. We''re using it in some of our
Lustre
> 1.4
> > clusters.
> >
> > Kevin
> >
> > On Wed, 2009-04-08 at 14:00 -0700, Aaron Porter wrote:
> >> On Wed, Apr 8, 2009 at 1:54 PM, John Ouellette
> >> <john.ouellette at nrc-cnrc.gc.ca> wrote:
> >>> In your architecture, can you take OSS''s offline
without losing
> data
> >>> access?  My suspicion is that we''d only get this if
the OSTs of
> that
> >>> host were also connected to another OSS.
> >> The Wiki seems to indicate that 1.8 will allow this, but then the
> same
> >> Wiki says 1.8 should have come out 6-8 months ago...
> >> _______________________________________________
> >> Lustre-discuss mailing list
> >> Lustre-discuss at lists.lustre.org
> >> http://lists.lustre.org/mailman/listinfo/lustre-discuss
> >>
> >>
> >
> 
> --
> Dr. John Ouellette
> Operations Manager
> Canadian Astronomy Data Centre
> Herzberg Institute of Astrophysics
> National Research Council Canada
> 5071 West Saanich Road, Victoria BC V9E 2E7 Canada
> Phone: 250-363-3037
> 
>

Adam Gandelman

2009-Apr-09 06:47 UTC

head link

[Lustre-discuss] Lustre as a reliable f/s for an archive data centre

Kevin Fox wrote:> If you striped 1, and gathered the metadata like I do with the e2scan
> patch, you could checksum the data directly on the oss''s.
I''ve been
> meaning on writing a system like this at some point (And actually have
> had to do it manually once, in a disaster) but haven''t had time to
yet.
>
> As far as raid1 pairing nodes, you might be able to hack something
> together using drbd and oss failover. No clue if its been tried before.
>
>   

I''ve just finished putting together a basic Lustre cluster in a lab 
using DRBD for redundancy and heartbeat for failover on both MDS and 
OSS.  Failover is working fine on both.  We''re using mostly commodity 
hardware.  So far the setup seems like a solid candidate for a very 
cost-efficient way of bringing redundancy and high-availability to 
Lustre.  In the coming weeks  we''ll be growing the cluster as we do 
benchmarking and hopefully have something to add to the DRBD section of 
the Lustre  wiki(which is pretty thin ATM)

Adam Gandelman

Heiko Schröter

2009-Apr-09 06:58 UTC

head link

[Lustre-discuss] DRDB transfer rate (was: Lustre as a reliable f/s for an archive data centre)

On Donnerstag, 9. April 2009 08:47:00 Adam Gandelman
wrote:> Kevin Fox wrote:
> > If you striped 1, and gathered the metadata like I do with the e2scan
> > patch, you could checksum the data directly on the oss''s.
I''ve been
> > meaning on writing a system like this at some point (And actually have
> > had to do it manually once, in a disaster) but haven''t had
time to yet.
> >
> > As far as raid1 pairing nodes, you might be able to hack something
> > together using drbd and oss failover. No clue if its been tried
before.
>
> I''ve just finished putting together a basic Lustre cluster in a
lab
> using DRBD for redundancy and heartbeat for failover on both MDS and
> OSS.  Failover is working fine on both.  We''re using mostly
commodity
> hardware.  So far the setup seems like a solid candidate for a very
> cost-efficient way of bringing redundancy and high-availability to
> Lustre.  In the coming weeks  we''ll be growing the cluster as we
do
> benchmarking and hopefully have something to add to the DRBD section of
> the Lustre  wiki(which is pretty thin ATM)
>Did you have the chance to measure the data transfer inside DRBD ?

We did observe it can be pretty slow (5MB/s). But might be related to a 
misconfiguration and or hardware issue though ...

Heiko

Peter Kjellstrom

2009-Apr-09 07:24 UTC

head link

[Lustre-discuss] Lustre as a reliable f/s for an archive datacentre

On Wednesday 08 April 2009, Kevin Fox wrote:> A lot of sites pair OSS''s with fibre channel. If an OSS fails, the
OST
> it served is then served by its buddy.
He did ask for "OST failure" not "OSS failure". The former
is not something
lustre can do.

/Peter
> Kevin
>
> On Wed, 2009-04-08 at 10:11 -0700, Aaron Porter wrote:
...> > Also -- are there configurations that can provide data availability
> > across single or multiple OST failures?-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part.
Url :
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20090409/00433f9a/attachment.bin

Peter Grandi

2009-Apr-11 14:10 UTC

head link

[Lustre-discuss] Lustre as a reliable f/s for an archive data centre

> Hi -- I work for an astronomical data archive centre which
> stores about 300TB of data.[ ... ] Although we do have a
> requirement for high throughput (although not to really
> support 1000s of clients: more likely a few dozen nodes),
That "few dozen nodes" determines a bit what kind of performance
you need to achieve, and how, because Lustre has indeed huge
performance but *in the aggregate*: that is it can achieve
100GB/s of *aggregate* throughput on a 1Gb/s network by having
1,000 clients each transferring at 100MB/s to/from 1,000
servers.
> our primary concerns are reliability and data integrity.
Reliability and data integrity on a 300TB archive is an unsolved
research problem, as long as you really mean them. There are a
number of snake-oil salesmen that will promise them to you though.
> From reading the docs, it looks as though Lustre can be made
> to be very reliable,
Reading the docs? They are *very clear* that there is no way to
make Lustre *as such* very reliable (until replication is
implemented by Lustre itself).

Lustre currently has the reliability of 1/Nth (where N is the
striping) of the underlying storage system, the (un)reliability
of its own software layer reducing that further.
> but how about data integrity?  Are there problems with data
> corruption?
Worrying about file system level handling of data corruption in a
"astronomical data archive" may be the wrong approach.

For data curation *filesystem* (and even more so *storage*)
integrity are convenience and performance issues, not a data
integrity issue. Data integrity must be end-to-end. That is,
you cannot use a file system of any sort of a data archive.
A data archive is a rather different thing from a file system
even if it resides in a file system.
> If the MDS data is lost, is it possible to rebuild this, given
> only the file-system data?
No. Symbolic names are only on the MDS.
> How easy is it to back-up Lustre?
It is extremely easy. Backing up 300TB of data is hard.
> Do you back-up the MDS data and the OST data, or do you
> back-up through a Lustre client?
Either way, but again backing up 300TB of data is hard.
> Thanks in advance for any answers or pointers,
The above are the answers to the questions asked, but those
questions seemed to be quite misguided, because what Lustre is
and does is quite obvious, and some of the previous followups to
your questions are misguided or confused a bit.

I''ll try first to explain clearly what Lustre is, and then to
reply to some similar questions but perhaps more appropriate ones.

Lustre is a data-parallel directory based chunked single namespace
single storage pool network metafilesystem:

* Metafilesystem: it uses other file systems as storage devices,
  instead of using block devices.

* Network: Lustre is in essence a set of network protocols. There
  is also some data representation, but one can only use Lustre
  over a network. The network part of the Lustre implementation
  (LNET) is probably more important than the rest.

* Directory based: which metafilesystem file name corresponds to
  which base filesystem file(s) is kept in a separate directory.

* Chunked: the metafilesystem can be an aggregation of many
  independent but related base filesystems, the "chunks".

* Single namespace: the directory service maintains a single
  metafilesystem namespace over all the base filesystems.

* Single storage pool: the directory service maintains a single
  pool of available space over all the base filesystems. If a file
  is striped, it can be larger than any single base filesystem.

* Data-parallel: once the list of base filesystem files for a
  metafilesystem name has been obtained by a client from the
  directory server, and the base filesystem files have been
  "mounted", they can be accessed in parallel. The directory server
  cannot (yet).

In a simplified way, consider that two files ''a'' and
''b'' in
directory ''d'' are implemented thusly (if ''a''
is not striped and ''b''
is, and there are two data servers):

  * lustre://dirserv/d/a with inum I1
    - lustrei://dataserv1/I1

  * lustre://dirserv/d/b with inum I2
    - lustrei://dataserv1/I2-1
    - lustrei://dataserv2/I2-2
    - lustrei://dataserv1/I2-3
    - lustrei://dataserv2/I2-4
    ....

That''s basically all... :-)

The Lustre client software is in effect an extended
''autofs''/''amd'',
and treats the Lustre directory server as a kind of LDAP server
containing a list of automount pairs; thus it creates a top level
mount point for each Lustre namespace, noting for each of those
which directory server it corresponds to.

Following the example above the Lustre client creates a top level
mount point such as ''/mnt/l'' with reference to
''lustre://dirserv/'';
as processes access paths under the mountpoint it auto-"mount"s
each of the underlying files, so that if a processes accesses
''/mnt/l/d/a'' what happens is that
''lustrei://dataserv1/I1'' is
auto-"mounted" as ''/mnt/l/d/a'', and
''lustrei://dataserv[12]/I2-*''
as ''/mnt/l/d/b/I2-*'' (and in the latter case it provides the
illusion to processes that ''/mnt/l/d/b'' is a single file).

The good questions to ask here are:

* For a 300TB data archive, do we need a single name space and a
  single storage pool?

  Well, it really depends. Probably not, but it is convenient.

* Is there anything better or more cost effective than Lustre for a
  300TB data archive?

  Likely not. Unless you don''t care for single namespace or single
  storage pool.

* How to backup a 300TB archive?

  Well, that''s a research question. I personally think that the
  only practical way is another 300TB archive. Other people think
  tape libraries can be used. Perhaps...

* What do you mean by "reliable"?

  It can be about loss of service or loss of data.

  Lustre service can be made fairly reliable by redundancy in the
  network and in the directory and data servers, within limits.

  Lustre data can be made quite reliable with redundancy in the
  storage subsystems of the directory and storage servers.

  There is unavoidable common mode failure in the use of the same
  metafilesystem and filesystem code.

  In practice ''ext3''/''ext4'' are pretty
reliable and Lustre itself
  is pretty good too.

* What do you mean by "integrity"?

  It can be about detecting the loss of integrity or recovering
  from a loss of integrity.

  Detecting loss integrity in the data must be done *at least*
  end-to-end. It can be done at lower levels too, but that is not
  sufficient.

  Restoring integrity can be done by detecting loss of integrity in
  the representation of the data (disk blocks, links, metadata,
  ...) and using redundancy *as a matter of convenience*.

  Lustre does a bit of detecting loss of integrity in the metadata,
  and for now no integrity restoration. In this it has overall the
  same integrity properties as the underlying filesystem and
  storage systems, minus the integrity issues of the directory
  system. In practice it is pretty good.

My impression is that even if the Lustre design is definitely
targeted at coarse grained data parallel computation, not archival,
it is handy to use it as a kind of single namespace and single
storage pool anyhow, as there are quite few practical/low cost
alternatives, and it is fairly easy to setup.

But the really critical factors are the design of the data archive
(the level above) and the storage/network system (the level below).

Lustre discuss - Apr 2009 - Lustre as a reliable f/s for an archive data centre

[Lustre-discuss] Lustre as a reliable f/s for an archive data centre

[Lustre-discuss] Lustre as a reliable f/s for an archive data centre

[Lustre-discuss] Lustre as a reliable f/s for an archive data centre

[Lustre-discuss] Lustre as a reliable f/s for an archive datacentre

[Lustre-discuss] Lustre as a reliable f/s for an archive data centre

[Lustre-discuss] Lustre as a reliable f/s for an archive data centre

[Lustre-discuss] Lustre as a reliable f/s for an archive datacentre

[Lustre-discuss] Lustre as a reliable f/s for an archive datacentre

[Lustre-discuss] Lustre as a reliable f/s for an archive data centre

[Lustre-discuss] Lustre as a reliable f/s for an archive datacentre

[Lustre-discuss] Lustre as a reliable f/s for an archive data centre

[Lustre-discuss] DRDB transfer rate (was: Lustre as a reliable f/s for an archive data centre)

[Lustre-discuss] Lustre as a reliable f/s for an archive datacentre

[Lustre-discuss] Lustre as a reliable f/s for an archive data centre