hi All, I need to prepare small report on "NFS vs. Lustre" ? I could find lot of resources about Lustre vs. (CXFS, GPFS, GFS) . Can you guys please provide few tips . URLs . etc. cheers, __ tharindu ******************************************************************************************************************************************************************* "The information contained in this email including in any attachment is confidential and is meant to be read only by the person to whom it is addressed. If you are not the intended recipient(s), you are prohibited from printing, forwarding, saving or copying this email. If you have received this e-mail in error, please immediately notify the sender and delete this e-mail and its attachments from your computer." ******************************************************************************************************************************************************************* -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20090826/2bb3d937/attachment.html
Lustre is a parallel filesystem where NFS is not. The advantage of NFS is its native for many Unix systems and is widely available. The advantage of Lustre is its performance. GPFS is a parallel fileysystem very similar to Lustre but its backed by IBM. It runs on AIX and Linux. Its good but costly. CXFS and GFS work similar. You need shared blockdev device such as a SAN or NetAPP (iscsi). Not really for performance. This is mostly for high availability. What are you trying to solve? We maybe able to help . On Wed, Aug 26, 2009 at 6:11 AM, Tharindu Rukshan Bamunuarachchi<tharindub at millenniumit.com> wrote:> hi All, > > > > I need to prepare small report on ?NFS vs. Lustre? ? > > > > I could find lot of resources about Lustre vs. (CXFS, GPFS, GFS) ? > > > > Can you guys please provide few tips ? URLs ? etc. > > > > > > > > > > cheers, > > __ > > tharindu > > > > ******************************************************************************************************************************************************************* > > "The information contained in this email including in any attachment is > confidential and is meant to be read only by the person to whom it is > addressed. If you are not the intended recipient(s), you are prohibited from > printing, forwarding, saving or copying this email. If you have received > this e-mail in error, please immediately notify the sender and delete this > e-mail and its attachments from your computer." > > ******************************************************************************************************************************************************************* > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss > >
You seem to be correct. Nobody ever seems to contrast NFS with these super file systems solutions. That is interesting. It''s Saturday, the family is out running around. I have time to think about this question. Unfortunately, for you, I do this more for myself. Which means this is going to be a stream-of-consciousness thing far more than a well organized discussion. Sorry. I''d begin by motivating both NFS and Lustre. Why do they exist? What problems do they solve. NFS first. Way back in the day, ethernet and the concept of a workstation got popular. There were many tools to copy files between machines but few ways to share a name space; Have the directory hierarchy and it''s content directly accessible to an application on a foreign machine. This made file sharing awkward. The model was to copy the file or files to the workstation where the work was going to be done, do the work, and copy the results back to some, hopefully, well maintained central machine. There *were* solutions to this at the time. I recall an attractive alternative called RFS (I believe) from the Bell Labs folks, via some place in England if I''m remembering right, it''s been a looong time after all. It had issues though. The nastiest issue for me was that if a client went down the service side would freeze, at least partially. Since this could happen willy-nilly, depending on the users wishes and how well the power button on his workstation was protected, together with the power cord and ethernet connection, this freezing of service for any amount of time was difficult to accept. This was so even in a rather small collection of machines. The problem with RFS (?) and it''s cousins were that they were all stateful. The service side depended on state that was held at the client. If the client went down, the service side couldn''t continue without a whole lot of recovery, timeouts, etc. It was a very *annoying* problem. In the latter half of the 1980''s (am I remembering right?) SUN proposed an open protocol called NFS. An implementation using this protocol could do most everything RFS(?) could but it didn''t suffer the service-side hangs. It couldn''t. It was stateless. If the client went down, the server just didn''t care. If the server went down, the client had the opportunity to either give up on the local operation, usually with an error returned, or wait. It was always up to the user and for client failures the annoyance was limited to the user(s) on that client. SUN, also, wisely desired the protocol to be ubiquitous. They published it. They wanted *everyone* to adopt it. More, they would help competitors. SUN held interoperability bake-a-thons to help with this. It looks like they succeeded, all around :) Let''s sum up, then. The goals for NFS were: 1) Share a local file system name space across the network. 2) Do it in a robust, resilient way. Pesky FS issues because some user kicked the cord out of his workstation was unacceptable. 3) Make it ubiquitous. SUN was a workstation vendor. They sold servers but almost everyone had a VAX in their back pocket where they made the infrastructure investment. SUN needed the high-value machines to support this protocol. Now Lustre. Lustre has a weird story and I''m not going to go into all of it. The shortest, relevant, part is that while there was at least one solution that DOE/NNSA felt acceptable, GPFS, it was not available on anything other than an IBM platform and because DOE/NNSA had a semi-formal policy of buying from different vendors at each of the three labs we were kind of stuck. Other file systems, existing and imminent, at the time were examined but they were all distributed file systems and we needed IO *bandwidth*. We needed lots, and lots of bandwidth. We also needed that ubiquitous thing that SUN had as one of their goals. We didn''t want to pay millions of dollars for another GPFS. We felt that would only be painting ourselves into a corner. Whatever we did, the result *had* to be open. It also had to be attractive to smaller sites as we wanted to turn loose of the ting at some point. If it was attractive for smaller machines we felt we would win in the long term as, eventually, the cost to further and maintain this thing was spread across the community. As far as technical goals, I guess we just wanted GPFS, but open. More though, we wanted it to survive in our platform roadmaps for at least a decade. The actual technical requirements for the contract that DOE/NNSA executed with HP, CFS was the sub-contractor responsible for development, can be found here: <http://www-cs-students.stanford.edu/~trj/SGS_PathForward_SOW.pdf> LLNL used to host this but it''s no longer there? Oh well, hopefully this link will be good for a while, at least. I''m just going to jump to the end and sum the goals up: 1) It must do *everything* NFS can. We relaxed the stateless thing though, see the next item for why. 2) It must support full POSIX semantics; Last writer wins, POSIX locks, etc. 3) It must support all of the transports we are interested in. 4) It must be scalable, in that we can cheaply attach storage and both performance (reading *and* writing) and capacity within a single mounted file system increase in direct proportion. 6) We wanted it to be easy, administratively. Our goal was that it be no harder than NFS to set up and maintain. We were involving too many folks with PhDs in the operation of our machines at the time. Before you yell FAIL, I''ll say we did try. I''ll also say we didn''t make CFS responsible for this part of the task. Don''t blame them overly much, OK? 7) We recognized we were asking for a stateful system, we wanted to mitigate that by having some focus on resiliency. These were big machines and clients died all the time. 8) While not in the SOW, we structured the contract to accomplish some future form of wide acceptance. We wanted it to be ubiquitous. That''s a lot of goals! For the technical ones, the main ones are all pretty much structured to ask two things of what became Lustre. First, give us everything NFS functionally does but go far beyond it in performance. Second, give us everything NFS functionally does but make it completely equivalent to a local file system, semantically. There''s a little more we have to consider. NFS4 is a different beast than NFS2 or NFS3. NFS{2,3} had some serious issues that becaome more prominent as time went by. First, security; It had none. Folks had bandaged on some different things to try to cure this but they weren''t standard across platforms. Second, it couldn''t do the full POSIX required semantics. That was attacked with the NFS lock protocols but it was such an after-thought it will always remain problematic. Third, new authorization possibilities introduced by Microsoft and then POSIX, called ACLs, had no way of being accomplished. NFS4 addresses those by: 1) Introducing state. Can do full POSIX now without the lock servers. Lots of resiliency mechanisms introduced to offset the downside of this, too. 2) Formalizing and offerring standardized authentication headers. 3) Introducing ACLs that map to equivalents in POSIX and Microsoft. Strengths and Weaknesses of the Two ----------------------------------- NFS4 does most everything Lustre can with one very important exception, IO bandwidth. Both seem able to deliver metadata performance at roughly the same speeds. File create, delete, and stat rates are about the same. NetApp seems to have a partial enhancement. They bought the Spinnaker goodies some time back and have deployed that technology, and redirection too(?), within their servers. The good about that is two users in different directories *could* leverage two servers, independently, and, so, scale metadata performance. It''s not guaranteed but at least there is the possibility. If the two users are in the same directory, it''s not much different, though, I''m thinking. Someone correct me if I''m wrong? Both can offer full POSIX now. It''s nasty in both cases but, yes, in theory you can export mail directory hierarchies with locking. The NFS client and server are far easier to set up and maintain. The tools to debug issues are advanced. While the Lustre folks have done much to improve this area, NFS is just leaps and bounds ahead. It''s easier to deal with NFS than Lustre. Just far, far easier, still. NFS is just built in to everything. My TV has it, for hecks sake. Lustre is, seemingly, always an add-on. It''s also a moving target. We''re constantly futzing with it, upgrading, and patching. Lustre might be compilable most everywhere we care about but building it isn''t trivial. The supplied modules are great but, still, moving targets in that we wait for SUN to catch up to the vendor supplied changes that affect Lustre. Given Lustre''s size and interaction with other components in the OS, that happens far more frequently than desired. NFS just plain wins the ubiquity argument at present. NFS IO performance does *not* scale. It''s still an in-band protocol. The data is carried in the same message as the request and is, practically, limited in size. Reads are more scalable in writes, a popular file-segment can be satisfied from the cache on reads but develops issues at some point. For writes, NFS3 and NFS4 help in that they directly support write-behind so that a client doesn''t have to wait for data to go to disk, but it''s just not enough. If one streams data to/from the store, it can be larger than the cache. A client that might read a file already made "hot" but at a very different rate just loses. A client, writing, is always looking for free memory to buffer content. Again, too many of these, simultaneously, and performance descends to the native speed of the attached back-end store and that store can only get so big. Lustre IO performance *does* scale. It uses a 3rd-party transfer. Requests are made to the metadata server and IO moves directly between the affected storage component(s) and the client. The more storage components, the less possibility of contention between clients and the more data can be accepted/supplied per unit time. NFS4 has a proposed extension, called pNFS, to address this problem. It just introduces the 3rd-party data transfers that Lustre enjoys. If and when that is a standard, and is well supported by clients and vendors, the really big technical difference will virtually disappear. It''s been a long time coming, though. It''s still not there. Will it ever be, really? The answer to the NFS vs. Lustre question comes down to the workload for a given application then, since they do have overlap in their solution space. If I were asked to look at a platform and recommend a solution I would worry about IO bandwidth requirements. If the platform in question were either read-mostly and, practically, never needed sustained read or write bandwidth, NFS would be an easy choice. I''d even think hard about NFS if the platform created many files but all were very small; Today''s filers have very respectable IOPS rates. If it came down to IO bandwidth, I''m still on the parallel file system bandwagon. NFS just can''t deal with that at present and I do still have the folks, in house, to manage the administrative burden. Done. That was useful for me. I think five years ago I might have opted for Lustre in the "create many small files" case, where I would consider NFS today, so re-examining the motivations, relative strengths, and weaknesses of both was useful. As I said, I did this more as a self-exercise than anything else but I hope you can find something useful here, too. The family is back from their errands, too :) Best wishes and good luck. --Lee On Wed, 2009-08-26 at 04:11 -0600, Tharindu Rukshan Bamunuarachchi wrote:> hi All, > > > > I need to prepare small report on ?NFS vs. Lustre? ? > > > > I could find lot of resources about Lustre vs. (CXFS, GPFS, GFS) ? > > > > Can you guys please provide few tips ? URLs ? etc. > > > > > > > > > > cheers, > > __ > > tharindu > > > > > ******************************************************************************************************************************************************************* > > "The information contained in this email including in any attachment > is confidential and is meant to be read only by the person to whom it > is addressed. If you are not the intended recipient(s), you are > prohibited from printing, forwarding, saving or copying this email. If > you have received this e-mail in error, please immediately notify the > sender and delete this e-mail and its attachments from your computer." > > ******************************************************************************************************************************************************************* >
Lee, Thanks for posting this. I found the background and perspective very interesting. John John K. Dawson jkdawson at gmail.com 612-860-2388 On Aug 29, 2009, at 12:56 PM, Lee Ward wrote:> You seem to be correct. Nobody ever seems to contrast NFS with these > super file systems solutions. That is interesting. > > It''s Saturday, the family is out running around. I have time to think > about this question. Unfortunately, for you, I do this more for > myself. > Which means this is going to be a stream-of-consciousness thing far > more > than a well organized discussion. Sorry. > > I''d begin by motivating both NFS and Lustre. Why do they exist? What > problems do they solve. > > NFS first. > > Way back in the day, ethernet and the concept of a workstation got > popular. There were many tools to copy files between machines but few > ways to share a name space; Have the directory hierarchy and it''s > content directly accessible to an application on a foreign machine. > This > made file sharing awkward. The model was to copy the file or files to > the workstation where the work was going to be done, do the work, and > copy the results back to some, hopefully, well maintained central > machine. > > There *were* solutions to this at the time. I recall an attractive > alternative called RFS (I believe) from the Bell Labs folks, via some > place in England if I''m remembering right, it''s been a looong time > after > all. It had issues though. The nastiest issue for me was that if a > client went down the service side would freeze, at least partially. > Since this could happen willy-nilly, depending on the users wishes and > how well the power button on his workstation was protected, together > with the power cord and ethernet connection, this freezing of service > for any amount of time was difficult to accept. This was so even in a > rather small collection of machines. > > The problem with RFS (?) and it''s cousins were that they were all > stateful. The service side depended on state that was held at the > client. If the client went down, the service side couldn''t continue > without a whole lot of recovery, timeouts, etc. It was a very > *annoying* > problem. > > In the latter half of the 1980''s (am I remembering right?) SUN > proposed > an open protocol called NFS. An implementation using this protocol > could > do most everything RFS(?) could but it didn''t suffer the service-side > hangs. It couldn''t. It was stateless. If the client went down, the > server just didn''t care. If the server went down, the client had the > opportunity to either give up on the local operation, usually with an > error returned, or wait. It was always up to the user and for client > failures the annoyance was limited to the user(s) on that client. > > SUN, also, wisely desired the protocol to be ubiquitous. They > published > it. They wanted *everyone* to adopt it. More, they would help > competitors. SUN held interoperability bake-a-thons to help with this. > > It looks like they succeeded, all around :) > > Let''s sum up, then. The goals for NFS were: > > 1) Share a local file system name space across the network. > 2) Do it in a robust, resilient way. Pesky FS issues because some user > kicked the cord out of his workstation was unacceptable. > 3) Make it ubiquitous. SUN was a workstation vendor. They sold servers > but almost everyone had a VAX in their back pocket where they made the > infrastructure investment. SUN needed the high-value machines to > support > this protocol. > > Now Lustre. > > Lustre has a weird story and I''m not going to go into all of it. The > shortest, relevant, part is that while there was at least one solution > that DOE/NNSA felt acceptable, GPFS, it was not available on anything > other than an IBM platform and because DOE/NNSA had a semi-formal > policy > of buying from different vendors at each of the three labs we were > kind > of stuck. Other file systems, existing and imminent, at the time were > examined but they were all distributed file systems and we needed IO > *bandwidth*. We needed lots, and lots of bandwidth. > > We also needed that ubiquitous thing that SUN had as one of their > goals. > We didn''t want to pay millions of dollars for another GPFS. We felt > that > would only be painting ourselves into a corner. Whatever we did, the > result *had* to be open. It also had to be attractive to smaller sites > as we wanted to turn loose of the ting at some point. If it was > attractive for smaller machines we felt we would win in the long term > as, eventually, the cost to further and maintain this thing was spread > across the community. > > As far as technical goals, I guess we just wanted GPFS, but open. More > though, we wanted it to survive in our platform roadmaps for at > least a > decade. The actual technical requirements for the contract that DOE/ > NNSA > executed with HP, CFS was the sub-contractor responsible for > development, can be found here: > > <http://www-cs-students.stanford.edu/~trj/SGS_PathForward_SOW.pdf> > > LLNL used to host this but it''s no longer there? Oh well, hopefully > this > link will be good for a while, at least. > > I''m just going to jump to the end and sum the goals up: > > 1) It must do *everything* NFS can. We relaxed the stateless thing > though, see the next item for why. > 2) It must support full POSIX semantics; Last writer wins, POSIX > locks, > etc. > 3) It must support all of the transports we are interested in. > 4) It must be scalable, in that we can cheaply attach storage and both > performance (reading *and* writing) and capacity within a single > mounted > file system increase in direct proportion. > 6) We wanted it to be easy, administratively. Our goal was that it > be no > harder than NFS to set up and maintain. We were involving too many > folks > with PhDs in the operation of our machines at the time. Before you > yell > FAIL, I''ll say we did try. I''ll also say we didn''t make CFS > responsible > for this part of the task. Don''t blame them overly much, OK? > 7) We recognized we were asking for a stateful system, we wanted to > mitigate that by having some focus on resiliency. These were big > machines and clients died all the time. > 8) While not in the SOW, we structured the contract to accomplish some > future form of wide acceptance. We wanted it to be ubiquitous. > > That''s a lot of goals! For the technical ones, the main ones are all > pretty much structured to ask two things of what became Lustre. First, > give us everything NFS functionally does but go far beyond it in > performance. Second, give us everything NFS functionally does but make > it completely equivalent to a local file system, semantically. > > There''s a little more we have to consider. NFS4 is a different beast > than NFS2 or NFS3. NFS{2,3} had some serious issues that becaome more > prominent as time went by. First, security; It had none. Folks had > bandaged on some different things to try to cure this but they weren''t > standard across platforms. Second, it couldn''t do the full POSIX > required semantics. That was attacked with the NFS lock protocols > but it > was such an after-thought it will always remain problematic. Third, > new > authorization possibilities introduced by Microsoft and then POSIX, > called ACLs, had no way of being accomplished. > > NFS4 addresses those by: > > 1) Introducing state. Can do full POSIX now without the lock servers. > Lots of resiliency mechanisms introduced to offset the downside of > this, > too. > 2) Formalizing and offerring standardized authentication headers. > 3) Introducing ACLs that map to equivalents in POSIX and Microsoft. > > Strengths and Weaknesses of the Two > ----------------------------------- > > NFS4 does most everything Lustre can with one very important > exception, > IO bandwidth. > > Both seem able to deliver metadata performance at roughly the same > speeds. File create, delete, and stat rates are about the same. NetApp > seems to have a partial enhancement. They bought the Spinnaker goodies > some time back and have deployed that technology, and redirection > too(?), within their servers. The good about that is two users in > different directories *could* leverage two servers, independently, > and, > so, scale metadata performance. It''s not guaranteed but at least there > is the possibility. If the two users are in the same directory, it''s > not > much different, though, I''m thinking. Someone correct me if I''m wrong? > > Both can offer full POSIX now. It''s nasty in both cases but, yes, in > theory you can export mail directory hierarchies with locking. > > The NFS client and server are far easier to set up and maintain. The > tools to debug issues are advanced. While the Lustre folks have done > much to improve this area, NFS is just leaps and bounds ahead. It''s > easier to deal with NFS than Lustre. Just far, far easier, still. > > NFS is just built in to everything. My TV has it, for hecks sake. > Lustre > is, seemingly, always an add-on. It''s also a moving target. We''re > constantly futzing with it, upgrading, and patching. Lustre might be > compilable most everywhere we care about but building it isn''t > trivial. > The supplied modules are great but, still, moving targets in that we > wait for SUN to catch up to the vendor supplied changes that affect > Lustre. Given Lustre''s size and interaction with other components in > the > OS, that happens far more frequently than desired. NFS just plain wins > the ubiquity argument at present. > > NFS IO performance does *not* scale. It''s still an in-band protocol. > The > data is carried in the same message as the request and is, > practically, > limited in size. Reads are more scalable in writes, a popular > file-segment can be satisfied from the cache on reads but develops > issues at some point. For writes, NFS3 and NFS4 help in that they > directly support write-behind so that a client doesn''t have to wait > for > data to go to disk, but it''s just not enough. If one streams data > to/from the store, it can be larger than the cache. A client that > might > read a file already made "hot" but at a very different rate just > loses. > A client, writing, is always looking for free memory to buffer > content. > Again, too many of these, simultaneously, and performance descends to > the native speed of the attached back-end store and that store can > only > get so big. > > Lustre IO performance *does* scale. It uses a 3rd-party transfer. > Requests are made to the metadata server and IO moves directly between > the affected storage component(s) and the client. The more storage > components, the less possibility of contention between clients and the > more data can be accepted/supplied per unit time. > > NFS4 has a proposed extension, called pNFS, to address this problem. > It > just introduces the 3rd-party data transfers that Lustre enjoys. If > and > when that is a standard, and is well supported by clients and vendors, > the really big technical difference will virtually disappear. It''s > been > a long time coming, though. It''s still not there. Will it ever be, > really? > > The answer to the NFS vs. Lustre question comes down to the workload > for > a given application then, since they do have overlap in their solution > space. If I were asked to look at a platform and recommend a > solution I > would worry about IO bandwidth requirements. If the platform in > question > were either read-mostly and, practically, never needed sustained > read or > write bandwidth, NFS would be an easy choice. I''d even think hard > about > NFS if the platform created many files but all were very small; > Today''s > filers have very respectable IOPS rates. If it came down to IO > bandwidth, I''m still on the parallel file system bandwagon. NFS just > can''t deal with that at present and I do still have the folks, in > house, > to manage the administrative burden. > > Done. That was useful for me. I think five years ago I might have > opted > for Lustre in the "create many small files" case, where I would > consider > NFS today, so re-examining the motivations, relative strengths, and > weaknesses of both was useful. As I said, I did this more as a > self-exercise than anything else but I hope you can find something > useful here, too. The family is back from their errands, too :) Best > wishes and good luck. > > --Lee > > > On Wed, 2009-08-26 at 04:11 -0600, Tharindu Rukshan Bamunuarachchi > wrote: >> hi All, >> >> >> >> I need to prepare small report on ?NFS vs. Lustre? ? >> >> >> >> I could find lot of resources about Lustre vs. (CXFS, GPFS, GFS) ? >> >> >> >> Can you guys please provide few tips ? URLs ? etc. >> >> >> >> >> >> >> >> >> >> cheers, >> >> __ >> >> tharindu >> >> >> >> >> ******************************************************************************************************************************************************************* >> >> "The information contained in this email including in any attachment >> is confidential and is meant to be read only by the person to whom it >> is addressed. If you are not the intended recipient(s), you are >> prohibited from printing, forwarding, saving or copying this email. >> If >> you have received this e-mail in error, please immediately notify the >> sender and delete this e-mail and its attachments from your >> computer." >> >> ******************************************************************************************************************************************************************* >> > > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss-------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 2419 bytes Desc: not available Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20090829/5698c6fb/attachment.bin
Well said. This should be on the Wiki :-) On Sat, Aug 29, 2009 at 2:15 PM, John K. Dawson<jkdawson at gmail.com> wrote:> Lee, > > Thanks for posting this. I found the background and perspective very > interesting. > > John > > John K. Dawson > jkdawson at gmail.com > 612-860-2388 > > On Aug 29, 2009, at 12:56 PM, Lee Ward wrote: > >> You seem to be correct. Nobody ever seems to contrast NFS with these >> super file systems solutions. That is interesting. >> >> It''s Saturday, the family is out running around. I have time to think >> about this question. Unfortunately, for you, I do this more for myself. >> Which means this is going to be a stream-of-consciousness thing far more >> than a well organized discussion. Sorry. >> >> I''d begin by motivating both NFS and Lustre. Why do they exist? What >> problems do they solve. >> >> NFS first. >> >> Way back in the day, ethernet and the concept of a workstation got >> popular. There were many tools to copy files between machines but few >> ways to share a name space; Have the directory hierarchy and it''s >> content directly accessible to an application on a foreign machine. This >> made file sharing awkward. The model was to copy the file or files to >> the workstation where the work was going to be done, do the work, and >> copy the results back to some, hopefully, well maintained central >> machine. >> >> There *were* solutions to this at the time. I recall an attractive >> alternative called RFS (I believe) from the Bell Labs folks, via some >> place in England if I''m remembering right, it''s been a looong time after >> all. It had issues though. The nastiest issue for me was that if a >> client went down the service side would freeze, at least partially. >> Since this could happen willy-nilly, depending on the users wishes and >> how well the power button on his workstation was protected, together >> with the power cord and ethernet connection, this freezing of service >> for any amount of time was difficult to accept. This was so even in a >> rather small collection of machines. >> >> The problem with RFS (?) and it''s cousins were that they were all >> stateful. The service side depended on state that was held at the >> client. If the client went down, the service side couldn''t continue >> without a whole lot of recovery, timeouts, etc. It was a very *annoying* >> problem. >> >> In the latter half of the 1980''s (am I remembering right?) SUN proposed >> an open protocol called NFS. An implementation using this protocol could >> do most everything RFS(?) could but it didn''t suffer the service-side >> hangs. It couldn''t. It was stateless. If the client went down, the >> server just didn''t care. If the server went down, the client had the >> opportunity to either give up on the local operation, usually with an >> error returned, or wait. It was always up to the user and for client >> failures the annoyance was limited to the user(s) on that client. >> >> SUN, also, wisely desired the protocol to be ubiquitous. They published >> it. They wanted *everyone* to adopt it. More, they would help >> competitors. SUN held interoperability bake-a-thons to help with this. >> >> It looks like they succeeded, all around :) >> >> Let''s sum up, then. The goals for NFS were: >> >> 1) Share a local file system name space across the network. >> 2) Do it in a robust, resilient way. Pesky FS issues because some user >> kicked the cord out of his workstation was unacceptable. >> 3) Make it ubiquitous. SUN was a workstation vendor. They sold servers >> but almost everyone had a VAX in their back pocket where they made the >> infrastructure investment. SUN needed the high-value machines to support >> this protocol. >> >> Now Lustre. >> >> Lustre has a weird story and I''m not going to go into all of it. The >> shortest, relevant, part is that while there was at least one solution >> that DOE/NNSA felt acceptable, GPFS, it was not available on anything >> other than an IBM platform and because DOE/NNSA had a semi-formal policy >> of buying from different vendors at each of the three labs we were kind >> of stuck. Other file systems, existing and imminent, at the time were >> examined but they were all distributed file systems and we needed IO >> *bandwidth*. We needed lots, and lots of bandwidth. >> >> We also needed that ubiquitous thing that SUN had as one of their goals. >> We didn''t want to pay millions of dollars for another GPFS. We felt that >> would only be painting ourselves into a corner. Whatever we did, the >> result *had* to be open. It also had to be attractive to smaller sites >> as we wanted to turn loose of the ting at some point. If it was >> attractive for smaller machines we felt we would win in the long term >> as, eventually, the cost to further and maintain this thing was spread >> across the community. >> >> As far as technical goals, I guess we just wanted GPFS, but open. More >> though, we wanted it to survive in our platform roadmaps for at least a >> decade. The actual technical requirements for the contract that DOE/NNSA >> executed with HP, CFS was the sub-contractor responsible for >> development, can be found here: >> >> <http://www-cs-students.stanford.edu/~trj/SGS_PathForward_SOW.pdf> >> >> LLNL used to host this but it''s no longer there? Oh well, hopefully this >> link will be good for a while, at least. >> >> I''m just going to jump to the end and sum the goals up: >> >> 1) It must do *everything* NFS can. We relaxed the stateless thing >> though, see the next item for why. >> 2) It must support full POSIX semantics; Last writer wins, POSIX locks, >> etc. >> 3) It must support all of the transports we are interested in. >> 4) It must be scalable, in that we can cheaply attach storage and both >> performance (reading *and* writing) and capacity within a single mounted >> file system increase in direct proportion. >> 6) We wanted it to be easy, administratively. Our goal was that it be no >> harder than NFS to set up and maintain. We were involving too many folks >> with PhDs in the operation of our machines at the time. Before you yell >> FAIL, I''ll say we did try. I''ll also say we didn''t make CFS responsible >> for this part of the task. Don''t blame them overly much, OK? >> 7) We recognized we were asking for a stateful system, we wanted to >> mitigate that by having some focus on resiliency. These were big >> machines and clients died all the time. >> 8) While not in the SOW, we structured the contract to accomplish some >> future form of wide acceptance. We wanted it to be ubiquitous. >> >> That''s a lot of goals! For the technical ones, the main ones are all >> pretty much structured to ask two things of what became Lustre. First, >> give us everything NFS functionally does but go far beyond it in >> performance. Second, give us everything NFS functionally does but make >> it completely equivalent to a local file system, semantically. >> >> There''s a little more we have to consider. NFS4 is a different beast >> than NFS2 or NFS3. NFS{2,3} had some serious issues that becaome more >> prominent as time went by. First, security; It had none. Folks had >> bandaged on some different things to try to cure this but they weren''t >> standard across platforms. Second, it couldn''t do the full POSIX >> required semantics. That was attacked with the NFS lock protocols but it >> was such an after-thought it will always remain problematic. Third, new >> authorization possibilities introduced by Microsoft and then POSIX, >> called ACLs, had no way of being accomplished. >> >> NFS4 addresses those by: >> >> 1) Introducing state. Can do full POSIX now without the lock servers. >> Lots of resiliency mechanisms introduced to offset the downside of this, >> too. >> 2) Formalizing and offerring standardized authentication headers. >> 3) Introducing ACLs that map to equivalents in POSIX and Microsoft. >> >> Strengths and Weaknesses of the Two >> ----------------------------------- >> >> NFS4 does most everything Lustre can with one very important exception, >> IO bandwidth. >> >> Both seem able to deliver metadata performance at roughly the same >> speeds. File create, delete, and stat rates are about the same. NetApp >> seems to have a partial enhancement. They bought the Spinnaker goodies >> some time back and have deployed that technology, and redirection >> too(?), within their servers. The good about that is two users in >> different directories *could* leverage two servers, independently, and, >> so, scale metadata performance. It''s not guaranteed but at least there >> is the possibility. If the two users are in the same directory, it''s not >> much different, though, I''m thinking. Someone correct me if I''m wrong? >> >> Both can offer full POSIX now. It''s nasty in both cases but, yes, in >> theory you can export mail directory hierarchies with locking. >> >> The NFS client and server are far easier to set up and maintain. The >> tools to debug issues are advanced. While the Lustre folks have done >> much to improve this area, NFS is just leaps and bounds ahead. It''s >> easier to deal with NFS than Lustre. Just far, far easier, still. >> >> NFS is just built in to everything. My TV has it, for hecks sake. Lustre >> is, seemingly, always an add-on. It''s also a moving target. We''re >> constantly futzing with it, upgrading, and patching. Lustre might be >> compilable most everywhere we care about but building it isn''t trivial. >> The supplied modules are great but, still, moving targets in that we >> wait for SUN to catch up to the vendor supplied changes that affect >> Lustre. Given Lustre''s size and interaction with other components in the >> OS, that happens far more frequently than desired. NFS just plain wins >> the ubiquity argument at present. >> >> NFS IO performance does *not* scale. It''s still an in-band protocol. The >> data is carried in the same message as the request and is, practically, >> limited in size. Reads are more scalable in writes, a popular >> file-segment can be satisfied from the cache on reads but develops >> issues at some point. For writes, NFS3 and NFS4 help in that they >> directly support write-behind so that a client doesn''t have to wait for >> data to go to disk, but it''s just not enough. If one streams data >> to/from the store, it can be larger than the cache. A client that might >> read a file already made "hot" but at a very different rate just loses. >> A client, writing, is always looking for free memory to buffer content. >> Again, too many of these, simultaneously, and performance descends to >> the native speed of the attached back-end store and that store can only >> get so big. >> >> Lustre IO performance *does* scale. It uses a 3rd-party transfer. >> Requests are made to the metadata server and IO moves directly between >> the affected storage component(s) and the client. The more storage >> components, the less possibility of contention between clients and the >> more data can be accepted/supplied per unit time. >> >> NFS4 has a proposed extension, called pNFS, to address this problem. It >> just introduces the 3rd-party data transfers that Lustre enjoys. If and >> when that is a standard, and is well supported by clients and vendors, >> the really big technical difference will virtually disappear. It''s been >> a long time coming, though. It''s still not there. Will it ever be, >> really? >> >> The answer to the NFS vs. Lustre question comes down to the workload for >> a given application then, since they do have overlap in their solution >> space. If I were asked to look at a platform and recommend a solution I >> would worry about IO bandwidth requirements. If the platform in question >> were either read-mostly and, practically, never needed sustained read or >> write bandwidth, NFS would be an easy choice. I''d even think hard about >> NFS if the platform created many files but all were very small; Today''s >> filers have very respectable IOPS rates. If it came down to IO >> bandwidth, I''m still on the parallel file system bandwagon. NFS just >> can''t deal with that at present and I do still have the folks, in house, >> to manage the administrative burden. >> >> Done. That was useful for me. I think five years ago I might have opted >> for Lustre in the "create many small files" case, where I would consider >> NFS today, so re-examining the motivations, relative strengths, and >> weaknesses of both was useful. As I said, I did this more as a >> self-exercise than anything else but I hope you can find something >> useful here, too. The family is back from their errands, too :) Best >> wishes and good luck. >> >> ? ? ? ? ? ? ? ?--Lee >> >> >> On Wed, 2009-08-26 at 04:11 -0600, Tharindu Rukshan Bamunuarachchi >> wrote: >>> >>> hi All, >>> >>> >>> >>> I need to prepare small report on ?NFS vs. Lustre? ? >>> >>> >>> >>> I could find lot of resources about Lustre vs. (CXFS, GPFS, GFS) ? >>> >>> >>> >>> Can you guys please provide few tips ? URLs ? etc. >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> cheers, >>> >>> __ >>> >>> tharindu >>> >>> >>> >>> >>> >>> ******************************************************************************************************************************************************************* >>> >>> "The information contained in this email including in any attachment >>> is confidential and is meant to be read only by the person to whom it >>> is addressed. If you are not the intended recipient(s), you are >>> prohibited from printing, forwarding, saving or copying this email. If >>> you have received this e-mail in error, please immediately notify the >>> sender and delete this e-mail and its attachments from your computer." >>> >>> >>> ******************************************************************************************************************************************************************* >>> >> >> >> _______________________________________________ >> Lustre-discuss mailing list >> Lustre-discuss at lists.lustre.org >> http://lists.lustre.org/mailman/listinfo/lustre-discuss > > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss > >
Hi! On Sat, Aug 29, 2009 at 11:56:40AM -0600, Lee Ward wrote:> NFS4 addresses those by: > > 1) Introducing state. Can do full POSIX now without the lock servers. > Lots of resiliency mechanisms introduced to offset the downside of this, > too.NFS4 implementations are able to handle Posix advisory locks, but unlike Lustre, they don''t support full Posix filesystem semantics. For example, NFS4 still follows the traditional NFS close-to-open cache consistency model whereas with Lustre, individual write()s are atomic and become immediately visible to all clients. Regards, Daniel.
On Sun, Aug 30, 2009 at 10:51:41PM +0200, Daniel Kobras wrote:> On Sat, Aug 29, 2009 at 11:56:40AM -0600, Lee Ward wrote: > > NFS4 addresses those by: > > > > 1) Introducing state. Can do full POSIX now without the lock servers. > > Lots of resiliency mechanisms introduced to offset the downside of this, > > too. > > NFS4 implementations are able to handle Posix advisory locks, but unlike > Lustre, they don''t support full Posix filesystem semantics. For example, NFS4 > still follows the traditional NFS close-to-open cache consistency model whereas > with Lustre, individual write()s are atomic and become immediately visible to > all clients.NFSv4 can''t handle O_APPEND, and has those close-to-open semantics. Those are the two large departures from POSIX in NFSv4. NFSv4.1 also adds metadata/data separation and data distribution, much like Lustre, but with the same POSIX semantics departures mentioned above. Also, NFSv4.1''s "pNFS" concept doesn''t have room for "capabilities" (in the distributed filesystem sense, not in the Linux capabilities sense), which means that OSSes and MDSes have to communicate to get permissions to be enforced. There are also differences with respect to recovery, etcetera. One thing about NFS is that it''s meant to be neutral w.r.t. the type of filesystem it shares. So NFSv4, for example, has features for dealing with filesystems that don''t have a notion of persistent inode number. Whereas Lustre has its own on-disk format and therefore can''t be used to share just any type of filesystem. Nico --
On Sun, 2009-08-30 at 16:12 -0500, Nicolas Williams wrote:> > One thing about NFS is that it''s meant to be neutral w.r.t. the type of > filesystem it shares. So NFSv4, for example, has features for dealing > with filesystems that don''t have a notion of persistent inode number. > Whereas Lustre has its own on-disk format and therefore can''t be used to > share just any type of filesystem.You have "stumbled on to" an interesting, significant difference between NFS and Lustre. NFS is a protocol for sharing an existing filesystem. Lustre is a filesystem -- so much so in fact, that NFS can even share it out. b. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 197 bytes Desc: This is a digitally signed message part Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20090830/813388b3/attachment.bin
On Sun, Aug 30, 2009 at 08:16:52PM -0400, Brian J. Murrell wrote:> On Sun, 2009-08-30 at 16:12 -0500, Nicolas Williams wrote: > > One thing about NFS is that it''s meant to be neutral w.r.t. the type of > > filesystem it shares. So NFSv4, for example, has features for dealing > > with filesystems that don''t have a notion of persistent inode number. > > Whereas Lustre has its own on-disk format and therefore can''t be used to > > share just any type of filesystem. > > You have "stumbled on to" an interesting, significant difference between > NFS and Lustre. NFS is a protocol for sharing an existing filesystem. > Lustre is a filesystem -- so much so in fact, that NFS can even share it > out.Indeed. pNFS is not really a protocol for sharing generic, pre-existing filesystems anymore either. The moment you want to distribute the filesystem itself you can no longer just substitute any filesystem into an implementation of the protocol. (Yes, I understand that when Lustre was layered above the VFS one could conceivably have changed the underlying fs, though that didn''t work out, if for practical reasons. But even then, one couldn''t have used the underlying fs directly, not in a meaningful way.) Nico --
Interesting discussion of NFS vs. Lustre even if they are so different in aims... [ ... ] lee> 3) It must support all of the transports we are interested in. Except for some corner cases (that an HEP site might well have) that today tends to reduce to the classic Ethernet/IP pair... lee> 4) It must be scalable, in that we can cheaply attach lee> storage and both performance (reading *and* writing) and lee> capacity within a single mounted file system increase in lee> direct proportion. I suspect that scalability is more of a dream, as to me it involves more requirements including scalable backup (not so easy) and scalable ''fsck'' (not so easy). These are easier with Lustre because it does not provide "a single mounted file system" but a single mounted *namespace* which is a very different thing, even if largely equivalent for most users. [ ... ] lee> NFS4 does most everything Lustre can with one very lee> important exception, IO bandwidth. [ ... ] Lustre IO lee> performance *does* scale. It uses a 3rd-party transfer. That can summarized by saying that Lustre is a parallel distributed metafilesystem, while NFS is a protocol used to access what usually is something not distributed and an actual filesystem. The limitations of the NFS protocol can be overcome, and as you say, pNFS turns it into a parallel distributed metafilesystem too: lee> NFS4 has a proposed extension, called pNFS, to address this lee> problem. It just introduces the 3rd-party data transfers lee> that Lustre enjoys. If and when that is a standard, and is lee> well supported by clients and vendors, the really big lee> technical difference will virtually disappear. It''s been a lee> long time coming, though. It''s still not there. Will it lee> ever be, really? My impression is that it is a lot more real than it was only a couple years ago, and here is an amusing mashup: http://FT.ORNL.gov/pubs-archive/ipdps2009-wyu-final.pdf ?Parallel NFS (pNFS) is an emergent open standard for parallelizing data transfer over a variety of I/O protocols. Prototypes of pNFS are actively being developed by industry and academia to examine its viability and possible enhancements. In this paper, we present the design, implementation, and evaluation of lpNFS, a Lustre-based parallel NFS. [ ... ] Our initial performance evaluation shows that the performance of lpNFS is comparable to that of original Lustre.? lee> Done. That was useful for me. I think five years ago I lee> might have opted for Lustre in the "create many small lee> files" case, where I would consider NFS today, Looks optimistic to me -- I don''t see any good solution to the "create many small files case, at least as to shared storage. For smaller situations I am looking out of interest to some other distributed filesystems, which are a bit more researchy, but seem fairly reliable already.
Hi! On Sun, Aug 30, 2009 at 04:12:11PM -0500, Nicolas Williams wrote:> NFSv4 can''t handle O_APPEND, and has those close-to-open semantics. > Those are the two large departures from POSIX in NFSv4.Along these lines, it''s probably worth mentioning commit-on-close as well, an area where NFS (v3 and v4, optionally relaxed when using write delegations) is more strict than Posix. This is to make sure that NFS still has the possibility to notify the user about errors when trying to save their data. Lustre''s standard config follows Posix and allows dirty client-side caches after close(). Performance improves as a result, of course, but in case something goes wrong on the net or the server, users potentially lose data just like on any local Posix filesystem. The difference being that users tend to notice when their local machine crashes. It''s much easier to miss a remote server or a switch going down, and hence suffer from silent data loss. (Admins will typically notice, eg. via eviction messages in the logs, but have a hard time telling whicht files had been affected.) The solution is to fsync() all valuable data on a Posix filesystem, but that''s not necessarily within reach for an average end user. Regards, Daniel.
On Mon, 2009-08-31 at 21:56 +0200, Daniel Kobras wrote:> Hi!Hi,> Lustre''s > standard config follows Posix and allows dirty client-side caches after > close(). Performance improves as a result, of course, but in case something > goes wrong on the net or the server, users potentially lose data just like on > any local Posix filesystem.I don''t think this is true. This is something that I am only peripherally knowledgeable about and I am sure somebody like Andreas or Johann can correct me if/where I go wrong... You are right that there is an opportunity for a client to write to an OST and get it''s write(2) call returned before data goes to physical disk. But Lustre clients know that, and therefore they keep the state needed to replay that write(2) to the server until the server sends back a commit callback. The commit callback is what tells the client that the data actually went to physical media and that it can now purge any state required to replay that transaction. Until that write callback is received, the client holds on to whatever state it would need to do that write(2) all over again, for exactly the case you cite which is the server goes down before the data goes to physical media. It is this data that the client is caching until the commit callback is received that is used by the recovery mechanisms that start when a target comes back on-line. Hope that clarifies things, and further, I hope my understanding is correct as is my explanation. b. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 197 bytes Desc: This is a digitally signed message part Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20090831/82d86932/attachment.bin
Brian J. Murrell wrote:>> Lustre''s >> standard config follows Posix and allows dirty client-side caches after >> close(). Performance improves as a result, of course, but in case something >> goes wrong on the net or the server, users potentially lose data just like on >> any local Posix filesystem. >> > >Yes this is the case on server failure but I think the true similarity between lustre and a locally mounted filesystem lies in the failure of a client holding dirty pages. Please correct me if I''m wrong but data loss will occur should the client fail after close() but prior to the set of dirty pages being committed on the OST. paul> I don''t think this is true. This is something that I am only > peripherally knowledgeable about and I am sure somebody like Andreas or > Johann can correct me if/where I go wrong... > > You are right that there is an opportunity for a client to write to an > OST and get it''s write(2) call returned before data goes to physical > disk. But Lustre clients know that, and therefore they keep the state > needed to replay that write(2) to the server until the server sends back > a commit callback. The commit callback is what tells the client that > the data actually went to physical media and that it can now purge any > state required to replay that transaction. > > Until that write callback is received, the client holds on to whatever > state it would need to do that write(2) all over again, for exactly the > case you cite which is the server goes down before the data goes to > physical media. > > It is this data that the client is caching until the commit callback is > received that is used by the recovery mechanisms that start when a > target comes back on-line. > > Hope that clarifies things, and further, I hope my understanding is > correct as is my explanation. > > b. > > > ------------------------------------------------------------------------ > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss >
Yes, the semantics are similar to a local filesystem: data can be lost after close() under several cases, including: 1) the client crashes before the server has written the data to disk (data that made it to the server should be written, but that is asynch), 2) the server returns an error to the client (EIO, ie due to errors on the OST), 3) the client is evicted by the server (ie, due to communication issues) before writing data to disk, or 4) server reboots and recovery fails (ie, in 1.6.x a _different_ client does not reconnect to replay transactions). With version-based recovery in 1.8, clients might be able to still replay some transactions even if another client crashed/rebooted while the server was down. fsync() is the best way to ensure data is on disk, for both Lustre and a local filesystem. Kevin Brian J. Murrell wrote:> On Mon, 2009-08-31 at 21:56 +0200, Daniel Kobras wrote: > >> Hi! >> > > Hi, > > >> Lustre''s >> standard config follows Posix and allows dirty client-side caches after >> close(). Performance improves as a result, of course, but in case something >> goes wrong on the net or the server, users potentially lose data just like on >> any local Posix filesystem. >> > > I don''t think this is true. This is something that I am only > peripherally knowledgeable about and I am sure somebody like Andreas or > Johann can correct me if/where I go wrong... > > You are right that there is an opportunity for a client to write to an > OST and get it''s write(2) call returned before data goes to physical > disk. But Lustre clients know that, and therefore they keep the state > needed to replay that write(2) to the server until the server sends back > a commit callback. The commit callback is what tells the client that > the data actually went to physical media and that it can now purge any > state required to replay that transaction. > > Until that write callback is received, the client holds on to whatever > state it would need to do that write(2) all over again, for exactly the > case you cite which is the server goes down before the data goes to > physical media. > > It is this data that the client is caching until the commit callback is > received that is used by the recovery mechanisms that start when a > target comes back on-line. > > Hope that clarifies things, and further, I hope my understanding is > correct as is my explanation. > > b. > > > ------------------------------------------------------------------------ > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss >
On Mon, Aug 31, 2009 at 04:50:02PM -0400, Paul Nowoczynski wrote:> Yes this is the case on server failure but I think the true similarity > between lustre and a locally mounted filesystem lies in the failure of a > client holding dirty pages. Please correct me if I''m wrong but data > loss will occur should the client fail after close() but prior to the > set of dirty pages being committed on the OST.The client will have DLM locks outstanding if it has dirty data, so that the client''s death can be used to detect that its open, dirty files are now potentially corrupted. Client death with dirty data is not all that different from process death with dirty data in user-land. Think of an application that does write(2), write(2), close(2), _exit(2), but dies between writes. Compare that to a client that dies after flushing the first of those writes but before flushing the second, though after the application calls close(2). Nothing special is usually done in the first case, even though if the process did have byte range locks outstanding, then the OS could flag the affected file as potentially corrupted. I don''t think Lustre does actually do anything to mark files as corrupted that Lustre could detect as potentially corrupted. Some applications can recover automatically -- think of databases, such as SQLite3, or think of plain log files. Other applications might well be affected. Since corruption detection in this case is heuristic, and since the impact will vary by application, I don''t think there''s an easy answer as to what Lustre ought to do about it. Ideally we could track the "potentially corrupt" status as an advisory meta-data item that could be fetched with a stat(2)-like system call, and have applications reset it when they recover. Nico --
Nicolas Williams wrote:> Client death with dirty data is not all that different from process > death with dirty data in user-land. Think of an application that does > write(2), write(2), close(2), _exit(2), but dies between writes. >It is very different; with a user application crash, all the writes to that point will be completed by the kernel. With the node crash, there are no guarantees about what made it to disk since the last fsync() -- a later write may be partly flushed before an earlier write, so the node could crash after the second write made it to disk, but before the first one does. But this is also true for a local filesystem. Kevin
On Mon, Aug 31, 2009 at 03:50:33PM -0600, Kevin Van Maren wrote:> Nicolas Williams wrote: > >Client death with dirty data is not all that different from process > >death with dirty data in user-land. Think of an application that does > >write(2), write(2), close(2), _exit(2), but dies between writes. > > It is very different; with a user application crash, all the writes to > that point will be completed by the kernel.But not writes that the application hasn''t done yet but which it was working on putting together at the time that it died. If those never- done writes are related to writes that did get made, then you may have a problem. For example, consider an RDBMS. Say you begin a transaction, do some INSERTs/UPDATEs/DELETEs, then COMMIT. This will almost certainly require multiple write(2)s (even for a DB that uses COW principles). And suppose that somewhere in the middle the process doing the writes dies. There should be some undo/redo log somewhere, and on restart the RDBMS must recover from a partially unfinished transaction.
Hi! On Mon, Aug 31, 2009 at 04:34:58PM -0400, Brian J. Murrell wrote:> On Mon, 2009-08-31 at 21:56 +0200, Daniel Kobras wrote: > > Lustre''s > > standard config follows Posix and allows dirty client-side caches after > > close(). Performance improves as a result, of course, but in case something > > goes wrong on the net or the server, users potentially lose data just like on > > any local Posix filesystem. > > I don''t think this is true. This is something that I am only > peripherally knowledgeable about and I am sure somebody like Andreas or > Johann can correct me if/where I go wrong... > > You are right that there is an opportunity for a client to write to an > OST and get it''s write(2) call returned before data goes to physical > disk. But Lustre clients know that, and therefore they keep the state > needed to replay that write(2) to the server until the server sends back > a commit callback. The commit callback is what tells the client that > the data actually went to physical media and that it can now purge any > state required to replay that transaction.Lustre can recover from certain error conditions just fine, of course, but still it cannot recover gracefully from others. Think double failures or, more likely, connectivity problems to a subset of hosts. For instance, if, say, an Ethernet switch goes down for a few minutes with IB still available, all Ethernet-connected clients will get evicted. Users won''t necessarily notice that there was a problem, but they''ve just potentially lost data. VBR makes the data loss less likely in this case, but the possibility is still there. I''d suspect you''ll always be able to construct similar corner cases as long as the networked filesystem allows dirty caches after close(). Regards, Daniel.
Greetings all, I''d like to throw in my 2c as well. I''m not a Lustre dev, just a sysadmin who manages a small (<10TB) Lustre data store. For some background, we''re using it in a web hosting environment for a particularly large set of websites. We''re also considering it for use as a storage backend to a cluster of VPS servers. Our "default" choice for new clusters is usually NFS, for many of the reasons mentioned already- pretty good read performance, makes good use of client- and server-side caching with no extra work, and above all it''s *extremely* simple to maintain. You can install 2 machines with completely stock Linux distros, and the odds are both of them will support being an NFS server *and* client, and will talk to each other with only minimal effort. Our problems with NFS: Occasionally we need better locking support than NFS delivers. Often capacity scalability is a concern (if you planned for it, you can grow the NFS-exported volume to some extent). Scaling out to many clients (frontend web servers in our case, usually) is sometimes a problem, although realistically we just don''t need that many frontends very often. The downside to Lustre is the complexity. Initial setup is much simpler than, say, Red Hat GFS or OCFS2, but still *vastly* more complicated than NFS, due in large part to the ubiquity of NFS. If NFS breaks (and it rarely does for us), the fix is usually pretty simple. If Lustre breaks... well, let''s just say I don''t like being the guy on-call. It could be worse, but it''s no picnic. We''ve had a lot more downtime with our *redundant* Lustre cluster than we ever did with the standalone NFS servers it replaced. Documentation-wise, a lot of NFS documentation is extremely dated, and what used to be good advice often isn''t anymore. My personal opinion is that the Red Hat GFS documentation is an utter disaster. It looks great from 50,000 feet but is nigh-impossible to implement without much head-bashing. You may have found that really nice article in Red Hat Magazine about NFS vs GFS scalability. Looks cool, doesn''t it? We tried that and gave up a week later when we just couldn''t make it stable- yeah we could make it work, but it''d be a *constant* headache. Lustre, on the other hand, has pretty good documentation. The admin guide is beefy and detailed, and has a lot of good info. Some of it feels dated (1.6 vs 1.8), but all in all I''m happy with it. Redundancy is a problem- you can sorta do HA-NFS, but it''s not particularly pretty, and it''s not conveniently active-active. Lustre has some redundancy abilities, although none of them are what I''d call "native". To me, native failover redundancy would mean Lustre handles the data migration/synchronization and the actual failover. Lustre supports multiple targets for the same data, and will try them both if it''s not working... but it''s up to *you* to make sure the data is actually *available* in both places. We use DRBD for this, and heartbeat to handle it. It mostly works, but I''m not really happy with it. It''s no worse than what NFS offers, and sometimes better. You can easily do a LOT of disk space on one server if needed. I''ve seen a 25TB array on one server (Dell MD1000''s + Windows!), and *heard* of as much as 67TB on one server (not NFS though). I really don''t know how well NFS handles arrays that size, but it should at least function. Of course, with Lustre, you can still do that much on one server, *plus* more servers with that much too. There''s also staffing to consider. Being so much simpler, NFS wins because you don''t need as highly-trained staff to deal with it. NFS probably costs less from a personnel standpoint- Lustre admins are rarer, and therefore probably command higher salaries, it''s not obvious that you would need fewer of them. At some point a manager will have to decide if the technological benefits of Lustre outweigh the extra staffing costs to maintain it (if there actually are any such costs). All in all, neither is really ideal, and they have different strengths. If you need to be 24/7 and not a lot of your staff is going to have time to become proficient with a complicated storage subsystem like Lustre, you''re probably better off with NFS. If you really need better scalability or POSIX-ness, and can stand the administrative overhead, Lustre works. I guess the proof is in the pudding- we''re not planning on migrating en-masse from NFS to Lustre. We''re sticking with NFS as our default choice, at least for the time being. Happy sysadmin-ing, Jake On Wed, Aug 26, 2009 at 3:11 AM, Tharindu Rukshan Bamunuarachchi <tharindub at millenniumit.com> wrote:> > hi All, > > > > I need to prepare small report on ?NFS vs. Lustre? ? > > > > I could find lot of resources about Lustre vs. (CXFS, GPFS, GFS) ? > > > > Can you guys please provide few tips ? URLs ? etc. > > > > > > > > > > cheers, > > __ > > tharindu > > > > ******************************************************************************************************************************************************************* > > "The information contained in this email including in any attachment is confidential and is meant to be read only by the person to whom it is addressed. If you are not the intended recipient(s), you are prohibited from printing, forwarding, saving or copying this email. If you have received this e-mail in error, please immediately notify the sender and delete this e-mail and its attachments from your computer." > > ******************************************************************************************************************************************************************* > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss >