Heya all, I''m investigating potential solutions for a storage deployment. Lustre piqued my interest due to ease of scalability and awesome aggregate throughput potential. Wondering if there''s any provision in Lustre for handling catastrophic loss of a node containing an OST; eg. replication/mirroring of OSTs to other nodes? I''m gathering from the 1.8.0 documentation that there''s no protection of this sort for data other than underlying RAID configs on any individual node, at least not without attempting to do some interesting stuff with DRDB. Just started looking at Lustre over the past day though, so I''d totally appreciate an authoritative answer in case I''m misinterpreting the documentation. :) Thanks, -- -------------------------------------------------------------------------------------------------- Gary Gogick senior systems administrator | workhabit,inc. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20090619/b966a362/attachment.html
Gary Gogick wrote:> Heya all, > > I''m investigating potential solutions for a storage deployment. > Lustre piqued my interest due to ease of scalability and awesome > aggregate throughput potential. > > Wondering if there''s any provision in Lustre for handling catastrophic > loss of a node containing an OST; eg. replication/mirroring of OSTs to > other nodes? > > I''m gathering from the 1.8.0 documentation that there''s no protection > of this sort for data other than underlying RAID configs on any > individual node, at least not without attempting to do some > interesting stuff with DRDB. Just started looking at Lustre over the > past day though, so I''d totally appreciate an authoritative answer in > case I''m misinterpreting the documentation. :)Correct. Lustre failover can be used to support catastrophic failure of a _node_, but not the _storage_. If your configuration makes LUNs available to two nodes, it is possible to configure Lustre to operate across the failure of a server. If your LUN fails catastrophically, all the data on that lun is gone. It is possible to bring Lustre up without it, but none of the files on that OST would be available. If you are concerned about this case, then backups are your friend. While drdb could be used to make a lun "available" to two nodes, it will have a significant impact on performance, and (AFAIK) does not do synchronous replication, so an fsck would be required prior to mounting the OST on the second node, and there would be some data loss. Kevin
Okay - that''s what I feared; glad to have it confirmed. Thanks Kevin, appreciate the quick response. :) -- -------------------------------------------------------------------------------------------------- Gary Gogick senior systems administrator | workhabit,inc. On Fri, Jun 19, 2009 at 2:15 PM, Kevin Van Maren <Kevin.Vanmaren at sun.com>wrote:> Gary Gogick wrote: > >> Heya all, >> >> I''m investigating potential solutions for a storage deployment. Lustre >> piqued my interest due to ease of scalability and awesome aggregate >> throughput potential. >> Wondering if there''s any provision in Lustre for handling catastrophic >> loss of a node containing an OST; eg. replication/mirroring of OSTs to other >> nodes? >> >> I''m gathering from the 1.8.0 documentation that there''s no protection of >> this sort for data other than underlying RAID configs on any individual >> node, at least not without attempting to do some interesting stuff with >> DRDB. Just started looking at Lustre over the past day though, so I''d >> totally appreciate an authoritative answer in case I''m misinterpreting the >> documentation. :) >> > > Correct. > > Lustre failover can be used to support catastrophic failure of a _node_, > but not the _storage_. If your configuration makes LUNs available to two > nodes, it is possible to configure Lustre to operate across the failure of a > server. > > If your LUN fails catastrophically, all the data on that lun is gone. It > is possible to bring Lustre up without it, but none of the files on that OST > would be available. If you are concerned about this case, then backups are > your friend. > > While drdb could be used to make a lun "available" to two nodes, it will > have a significant impact on performance, and (AFAIK) does not do > synchronous replication, so an fsck would be required prior to mounting the > OST on the second node, and there would be some data loss. > > Kevin > >-------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20090619/5433255a/attachment.html
Dr. Hung-Sheng Tsao (LaoTsao)
2009-Jun-19 18:55 UTC
[Lustre-discuss] OST redundancy between nodes?
it seems that one can use dual connected array to two nodes and use software mirroring between the two array to give U double protection 1)HW-raid within array 2)SW raid between arrays hth Kevin Van Maren wrote:> Gary Gogick wrote: > >> Heya all, >> >> I''m investigating potential solutions for a storage deployment. >> Lustre piqued my interest due to ease of scalability and awesome >> aggregate throughput potential. >> >> Wondering if there''s any provision in Lustre for handling catastrophic >> loss of a node containing an OST; eg. replication/mirroring of OSTs to >> other nodes? >> >> I''m gathering from the 1.8.0 documentation that there''s no protection >> of this sort for data other than underlying RAID configs on any >> individual node, at least not without attempting to do some >> interesting stuff with DRDB. Just started looking at Lustre over the >> past day though, so I''d totally appreciate an authoritative answer in >> case I''m misinterpreting the documentation. :) >> > > Correct. > > Lustre failover can be used to support catastrophic failure of a _node_, > but not the _storage_. If your configuration makes LUNs available to > two nodes, it is possible to configure Lustre to operate across the > failure of a server. > > If your LUN fails catastrophically, all the data on that lun is gone. > It is possible to bring Lustre up without it, but none of the files on > that OST would be available. If you are concerned about this case, then > backups are your friend. > > While drdb could be used to make a lun "available" to two nodes, it will > have a significant impact on performance, and (AFAIK) does not do > synchronous replication, so an fsck would be required prior to mounting > the OST on the second node, and there would be some data loss. > > Kevin > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss >-------------- next part -------------- A non-text attachment was scrubbed... Name: hung-sheng_tsao.vcf Type: text/x-vcard Size: 386 bytes Desc: not available Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20090619/b63d6641/attachment.vcf
Will definitely have to keep that in mind as a possibility for the future. Our initial storage systems aren''t HW-raid capable, unfortunately. Really love the ease of adding more storage (and performance) with Lustre; might just have to refactor into it down the road. Thanks for the idea, though! -- -------------------------------------------------------------------------------------------------- Gary Gogick senior systems administrator | workhabit,inc. On Fri, Jun 19, 2009 at 2:55 PM, Dr. Hung-Sheng Tsao (LaoTsao) < Hung-Sheng.Tsao at sun.com> wrote:> it seems that one can use dual connected array to two nodes and use > software mirroring between the two array to give U double protection > 1)HW-raid within array > 2)SW raid between arrays > hth > >-------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20090619/df8b454e/attachment.html
On Fri, Jun 19, 2009 at 1:15 PM, Kevin Van Maren<Kevin.Vanmaren at sun.com> wrote:> Gary Gogick wrote: >> Heya all, >> >> I''m investigating potential solutions for a storage deployment. >> Lustre piqued my interest due to ease of scalability and awesome >> aggregate throughput potential. >> >> Wondering if there''s any provision in Lustre for handling catastrophic >> loss of a node containing an OST; eg. replication/mirroring of OSTs to >> other nodes?I am confused about this. Will the files in that OST be unavailable or some of the files in that filesystem be unavailable? My impression is that lustre would stripe file data across many OSTs in terms of objects. So wouldn''t failure of one OST will potentially corrupt the files which have stripes/objects stored over that OST? Please correct me if I am wrong. - CS.>> >> I''m gathering from the 1.8.0 documentation that there''s no protection >> of this sort for data other than underlying RAID configs on any >> individual node, at least not without attempting to do some >> interesting stuff with DRDB. ?Just started looking at Lustre over the >> past day though, so I''d totally appreciate an authoritative answer in >> case I''m misinterpreting the documentation. :) > > Correct. > > Lustre failover can be used to support catastrophic failure of a _node_, > but not the _storage_. ?If your configuration makes LUNs available to > two nodes, it is possible to configure Lustre to operate across the > failure of a server. > > If your LUN fails catastrophically, all the data on that lun is gone. > It is possible to bring Lustre up without it, but none of the files on > that OST would be available. ?If you are concerned about this case, then > backups are your friend. > > While drdb could be used to make a lun "available" to two nodes, it will > have a significant impact on performance, and (AFAIK) does not do > synchronous replication, so an fsck would be required prior to mounting > the OST on the second node, and there would be some data loss. > > Kevin > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss >
On Thu, 2009-06-25 at 10:21 -0500, Carlos Santana wrote:> > I am confused about this. Will the files in that OST be unavailable or > some of the files in that filesystem be unavailable?Both. An OST contains objects. For singly striped files (the default), a single object is the entire file (data). So losing an OST means losing the object which means losing the file (contents).> My impression is that lustre would stripe file data across many OSTs > in terms of objects.It *may*. By default it does not.> So wouldn''t failure of one OST will potentially > corrupt the files which have stripes/objects stored over that OST?Yes. This is the other side of the "both" I mentioned above. b. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 197 bytes Desc: This is a digitally signed message part Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20090625/5ce29caa/attachment.bin
Thanks Brian. I was wondering what will happen during OST failure - if client is making some read/write operation - if client requests read/write after OST fails When I made OSS unavailable the client waited/got delayed response till OSS connected back. I am not sure about OST failure though. Any clues? - CS. On Thu, Jun 25, 2009 at 10:34 AM, Brian J. Murrell<Brian.Murrell at sun.com> wrote:> On Thu, 2009-06-25 at 10:21 -0500, Carlos Santana wrote: >> >> I am confused about this. Will the files in that OST be unavailable or >> some of the files in that filesystem be unavailable? > > Both. ?An OST contains objects. ?For singly striped files (the default), > a single object is the entire file (data). ?So losing an OST means > losing the object which means losing the file (contents). > >> My impression is that lustre would stripe file data across many OSTs >> in terms of objects. > > It *may*. ?By default it does not. > >> So wouldn''t failure of one OST will potentially >> corrupt the files which have stripes/objects stored over that OST? > > Yes. ?This is the other side of the "both" I mentioned above. > > b. > > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss > >
On Fri, 2009-06-26 at 10:56 -0500, Carlos Santana wrote:> > I was wondering what will happen during OST failure > - if client is making some read/write operationAssuming the OST is configured for failover, the client will retry anything that didn''t get committed to disk before the OST failure. It will try with all available failover targets for the OST.> - if client requests read/write after OST failsSame as above.> When I made OSS unavailable the client waited/got delayed response > till OSS connected back.Right. That''s failover.> I am not sure about OST failure though. Any > clues?An OST fails if an OSS fails given that an OST is the disk in an OSS (which is the node). b. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 197 bytes Desc: This is a digitally signed message part Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20090626/2d9a600d/attachment.bin
Sorry, but may be I am confused between OSS and OST. On Fri, Jun 26, 2009 at 11:24 AM, Brian J. Murrell<Brian.Murrell at sun.com> wrote:> On Fri, 2009-06-26 at 10:56 -0500, Carlos Santana wrote: >> >> I was wondering what will happen during OST failure >> ?- if client is making some read/write operation > > Assuming the OST is configured for failover, the client will retry > anything that didn''t get committed to disk before the OST failure. ?It > will try with all available failover targets for the OST.Can OST(disk) be configured for failover like an OSS(server node)?> >> - if client requests read/write after OST fails > > Same as above. > >> When I made OSS unavailable the client waited/got delayed response >> till OSS connected back. > > Right. ?That''s failover. > >> I am not sure about OST failure though. Any >> clues? > > An OST fails if an OSS fails given that an OST is the disk in an OSS > (which is the node).I thought an OST(disk) can fail without OSS(server) being failed. And that''s my question, what will happen in such scenario - while client is in read/write operation and client requesting read/write after the OST(disk) failure?> > b. > > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss > >~ CS.
OSS is the server. It normally provides one or more OSTs. OST failover is done by configuring multiple OSS nodes to be able to serve the same OST. Only ONE OSS node may provide the OST at a time. Failover is accomplished by the clients attempting to connect to each OSS node configured to serve the OST, until one of them responds with it active. An OST can be moved back-and-forth between OSS nodes by umount/mount commands (assuming both servers can access the same disk!) If an OST "fails", meaning that the underlying HW has failed (or the connection to the storage has failed -- one reason to use multipath IO), then Lustre will return IO errors to the application (although there is an RFE to not do that). Normally what happens is the OSS _node_ fails, and the other node mounts the OST (typically done by using Linux-HA/Heartbeat). MDS/MDT failover/configuration is similar. Kevin Carlos Santana wrote:> Sorry, but may be I am confused between OSS and OST. > > On Fri, Jun 26, 2009 at 11:24 AM, Brian J. Murrell<Brian.Murrell at sun.com> wrote: > >> On Fri, 2009-06-26 at 10:56 -0500, Carlos Santana wrote: >> >>> I was wondering what will happen during OST failure >>> - if client is making some read/write operation >>> >> Assuming the OST is configured for failover, the client will retry >> anything that didn''t get committed to disk before the OST failure. It >> will try with all available failover targets for the OST. >> > > Can OST(disk) be configured for failover like an OSS(server node)? > > >>> - if client requests read/write after OST fails >>> >> Same as above. >> >> >>> When I made OSS unavailable the client waited/got delayed response >>> till OSS connected back. >>> >> Right. That''s failover. >> >> >>> I am not sure about OST failure though. Any >>> clues? >>> >> An OST fails if an OSS fails given that an OST is the disk in an OSS >> (which is the node). >> > > I thought an OST(disk) can fail without OSS(server) being failed. > And that''s my question, what will happen in such scenario - while > client is in read/write operation and client requesting read/write > after the OST(disk) failure? > > >> b. >> >>
On Fri, 2009-06-26 at 11:51 -0600, Kevin Van Maren wrote:> If an OST "fails", meaning that the underlying HW has failed (or the > connection to the storage has failed -- one reason to use multipath IO), > then Lustre will return IO errors to the application (although there is > an RFE to not do that).This is not entirely true. It is only true when an OST is configured as "failout". When an OST is configured as failover however (which is the typical case), the application just blocks until the OST can be put back into service again on any of the defined failover nodes for that OST and the client can reconnect. At that time, pending operations are resumed and the application continues.> Normally what happens is the OSS _node_ fails, > and the other node mounts the OST (typically done by using > Linux-HA/Heartbeat).Right. And no applications see any errors while this happens. And it is worth noting that defining an OST for failover does not require that more than one OSS be defined for it. You can provide "failover service" (i.e. no EIOs to clients) using a single OSS. If it dies, then clients just block until it can be repaired. b. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 197 bytes Desc: This is a digitally signed message part Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20090626/e2c6c562/attachment.bin
On Fri, Jun 26, 2009 at 12:51 PM, Kevin Van Maren<Kevin.Vanmaren at sun.com> wrote:> OSS is the server. ?It normally provides one or more OSTs. > > OST failover is done by configuring multiple OSS nodes to be able to serve > the same OST. ?Only ONE OSS node may provide the OST at a time. >I understand that OST can''t be shared by two or more active OSSs at a time. But we can/should configure OSSs for failover mode. In my interpretation OST failure was a disk/storage failure. So the failover you are referring to was an OSS failover in my understanding (i.e., switch to another failover OSS node, if particular OSS fails) .> Failover is accomplished by the clients attempting to connect to each OSS > node configured to serve the OST, until one of them responds with it active. > > > An OST can be moved back-and-forth between OSS nodes by umount/mount > commands (assuming both servers can access the same disk!) > > If an OST "fails", meaning that the underlying HW has failed (or the > connection to the storage has failed -- one reason to use multipath IO), > then Lustre will return IO errors to the application (although there is an > RFE to not do that). ?Normally what happens is the OSS _node_ fails, and the > other node mounts the OST (typically done by using Linux-HA/Heartbeat). >Yeah, this is what I am curious abt - OST/disk/storage-device failure. It might be nice to have something on wiki regarding server and target as separate entities or same machine. I have gone through the FAQ entry, but it would be great if we could elaborate it further.> > MDS/MDT failover/configuration is similar. > > Kevin > > > > Carlos Santana wrote: >> >> Sorry, but may be I am confused between OSS and OST. >> >> On Fri, Jun 26, 2009 at 11:24 AM, Brian J. Murrell<Brian.Murrell at sun.com> >> wrote: >> >>> >>> On Fri, 2009-06-26 at 10:56 -0500, Carlos Santana wrote: >>> >>>> >>>> I was wondering what will happen during OST failure >>>> ?- if client is making some read/write operation >>>> >>> >>> Assuming the OST is configured for failover, the client will retry >>> anything that didn''t get committed to disk before the OST failure. ?It >>> will try with all available failover targets for the OST. >>> >> >> Can OST(disk) be configured for failover like an OSS(server node)? >> >> >>>> >>>> - if client requests read/write after OST fails >>>> >>> >>> Same as above. >>> >>> >>>> >>>> When I made OSS unavailable the client waited/got delayed response >>>> till OSS connected back. >>>> >>> >>> Right. ?That''s failover. >>> >>> >>>> >>>> I am not sure about OST failure though. Any >>>> clues? >>>> >>> >>> An OST fails if an OSS fails given that an OST is the disk in an OSS >>> (which is the node). >>> >> >> I thought an OST(disk) can fail without OSS(server) being failed. >> And that''s my question, what will happen in such scenario - while >> client is in read/write operation and client requesting read/write >> after the OST(disk) failure? >> >> >>> >>> b. >>> >>> > >
On Fri, 2009-06-26 at 13:15 -0500, Carlos Santana wrote:> > Yeah, this is what I am curious abt - OST/disk/storage-device failure.If the media (i.e. physical disk) that is an OST fails, then there is nothing Lustre can do to recover it. This is why we strongly suggest OSTs be some form of RAID. Lustre absolutely assumes that the storage is reliable and adds no additional redundancy to/for OSTs or the MDT. The bottom line -- your data is only as safe as the disks (virtual or physical) that you give to Lustre. b. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 197 bytes Desc: This is a digitally signed message part Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20090626/ee62860f/attachment.bin
On Fri, Jun 26, 2009 at 1:21 PM, Brian J. Murrell<Brian.Murrell at sun.com> wrote:> On Fri, 2009-06-26 at 13:15 -0500, Carlos Santana wrote: >> >> Yeah, this is what I am curious abt - OST/disk/storage-device failure. > > If the media (i.e. physical disk) that is an OST fails, then there is > nothing Lustre can do to recover it. ?This is why we strongly suggest > OSTs be some form of RAID. ?Lustre absolutely assumes that the storage > is reliable and adds no additional redundancy to/for OSTs or the MDT. >Yeah, this was answered by Kevin in the beginning of this thread. My question was what will be the message/error given to the client. Also, I did not understand OSS failure and OST failure terms were used interchangeably. The ''failover'' term seems appropriate when talking abt servers and not targets.> The bottom line -- your data is only as safe as the disks (virtual or > physical) that you give to Lustre. > > b. > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss > >~ Thanks, CS.
On Fri, 2009-06-26 at 13:32 -0500, Carlos Santana wrote:> > Yeah, this was answered by Kevin in the beginning of this thread. My > question was what will be the message/error given to the client.If the media corrupts, typically ldiskfs will see that and set it read-only, returning read-only errors to the client. If the disk outright dies and just fails to respond at all to requests from the OSS, IIRC, the OSS will just keep trying and the client will end up timing out and will start to find an OSS (i.e. among the failover nodes) that will respond for that OST.> Also, I did not understand OSS failure and OST failure terms were used > interchangeably.They are not, really. OST failure is so catastrophic that most people go to great lengths to avoid it so it''s not considered as often as OSS failure.> The ''failover'' term seems appropriate when talking abt servers and not > targets.No, it''s quite related to targets, but not in the sense that the disk itself dies (see above about the lengths people go to avoid this) since Lustre can''t do anything about this anyway. Failover is configured at the target level, not the server level. b. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 197 bytes Desc: This is a digitally signed message part Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20090626/d76e6972/attachment.bin
Just to clarify one more point: failover is designed to handle temporary issues with the server. It is NOT to handle problems with either the storage or the network: Lustre assumes neither will have problems. More below. On Jun 26, 2009, at 12:09 PM, "Brian J. Murrell" <Brian.Murrell at Sun.COM> wrote:> On Fri, 2009-06-26 at 11:51 -0600, Kevin Van Maren wrote: >> If an OST "fails", meaning that the underlying HW has failed (or the >> connection to the storage has failed -- one reason to use multipath >> IO), >> then Lustre will return IO errors to the application (although >> there is >> an RFE to not do that). > > This is not entirely true. It is only true when an OST is > configured as > "failout". When an OST is configured as failover however (which is > the > typical case), the application just blocks until the OST can be put > back > into service again on any of the defined failover nodes for that OST > and > the client can reconnect. At that time, pending operations are > resumed > and the application continues.If the client connection to the server is lost, then yes. But I was referring to the storage returning an IO error to the server; when that happens, the server returns IO errors to the client, which are then passed to the application. The request to not forward those errors is in bugzilla -- basically give heartbeat a chance to do a failover if the path to storage is lost on the server.> >> Normally what happens is the OSS _node_ fails, >> and the other node mounts the OST (typically done by using >> Linux-HA/Heartbeat). > > Right. And no applications see any errors while this happens. > > And it is worth noting that defining an OST for failover does not > require that more than one OSS be defined for it. You can provide > "failover service" (i.e. no EIOs to clients) using a single OSS. If > it > dies, then clients just block until it can be repaired.Right, that lets you reboot the server semi-transparently (still have the delay/hang on the filesystem). But does not handle the server getting IO errors from te storage.>Kevin
Comments in-line. - CS. On Fri, Jun 26, 2009 at 1:09 PM, Brian J. Murrell<Brian.Murrell at sun.com> wrote:> On Fri, 2009-06-26 at 11:51 -0600, Kevin Van Maren wrote: >> If an OST "fails", meaning that the underlying HW has failed (or the >> connection to the storage has failed -- one reason to use multipath IO), >> then Lustre will return IO errors to the application (although there is >> an RFE to not do that). > > This is not entirely true. ?It is only true when an OST is configured as > "failout". ?When an OST is configured as failover however (which is the > typical case), the application just blocks until the OST can be put back > into service again on any of the defined failover nodes for that OST and > the client can reconnect. ?At that time, pending operations are resumed > and the application continues. >The application does not block for all commands. For example, lfs df would work and so does new file creation (if you have another OST running). However, querying disk space such df or ls will fail. And this fails even after deactivating OST on MDS.>> Normally what happens is the OSS _node_ fails, >> and the other node mounts the OST (typically done by using >> Linux-HA/Heartbeat). > > Right. ?And no applications see any errors while this happens. > > And it is worth noting that defining an OST for failover does not > require that more than one OSS be defined for it. ?You can provide > "failover service" (i.e. no EIOs to clients) using a single OSS. ?If it > dies, then clients just block until it can be repaired. > > b. > > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss > >