On Fri, 2004-05-28 at 11:11, Phil Schwan wrote:> All of the OSTs can use one large set of shared storage, but let''s be > extremely clear -- each LUN is used by exactly one node at a time. If > failover is necessary, we require that the "old" node be forcefully > powered off, to ensure that it doesn''t wake up and start writing to > the > partition while another node is already doing so. This would cause > catastrophic corruption. > > So yes, you could absolutely connect multiple object servers to a > single > large SAN. This is common amongst our customers. > > Is this the answer that you were looking for?Ok. I''m confused. I apologize if I drive you a little crazy, but indulge me as I give this another shot. Maybe, I can give a scenario that explains how I "hope" to see this work. You can tear it apart and shed some light on how this works in reality. Let us say that you have a storage array attached to a SAN. You carve/create a LUN from this array that has RAID1-mirror protection. Now, I''ve added three OST servers, each with 1 fiber channel HBA card. I attach them to the same SAN switch as the storage array. I tell the storage array to present this LUN to all three OST servers. I go to OST server #1 and I see this LUN via /proc/scsi/scsi. I fdisk and create 1 partition that covers this entire disk. Let us say it is called /dev/sdb. I mkfs and create a ext3 filesystem against this disk. I mount it under /topdir/data1. I have very, very little knowledge on how lustre takes this disk via its OBD layer. But since I know, from the documentation, that ext3 is supported, I know that this OST server is good to go from a storage point of view. Maybe, I need not mount /dev/sdb on /topdir/data1, but I know the disk is ready, right? I want more performance/protection/availability so I add more OST server. I go to OST server #2 and it too sees this same LUN via /proc/scisc/scsi. I do not have to do anything to the disk as it was formatted by OST server 1. Same rules apply to OST server 3. I understand that no two (or more) OST server''s can "use" a LUN at the same time. So now, the above scenario will not work, right? I do not want to risk corruption and I certainly do not want to manually intervene by powering off a dead OST server? I mean, what if it came back online and started writing to the same LUN before I had a chance to do anything. This would be unacceptable. Instead, I have to create three separate LUNS, one for each OST server. But this defeats the purpose of sharing storage. Yes, I know a SAN allows you to share b/w and the storage array(s), but why can''t the OST servers take it a step further and share a LUN that already has some resiliency (like a RAID1-mirror in my scenario)? Maybe I am confusing something like Tru64 clustering where the nodes can read/write and even boot from the same disk. The white paper mentioned that the data associated with any file will be stripped across multiple OST servers. What does this mean? Multiple copies of the same file? Or parts of the file, like stripes (of z size) be spread across the OST servers down to their disk arrays. If multiple copies of the file are spread across the OSTs, then I would feel this is wasteful. You said this was not the case. If parts of a file are spread across the OST servers, then how is this file pieced back together if each OST is locked to a LUN and a OST server fails? Does each OST pass file data (over TCP/QSW since the LUN is locked per server) to the others in order to keep all files available to all OST servers? Wouldn''t this go back to essentially copying the same data to all OST servers? Boy, I must be going in circles. Hopefully from this you can determine what concepts I am clueless on (probably all), but give me a try. thank you.> On Fri, 2004-05-28 at 09:40, Bill Pappas wrote: > > Has lustre considered working on top of a Veritas (VxFS) filesystem? > > No, not really. You are the first to ask. > > Given all of the improvements that we and others have made to ext3 > over > the last 2-3 years, our customers have not needed to look elsewhere. > > > Also, the white paper did not clearly explain to me how the data is > > spread across OSTs. Page 3 (of the white paper), paragraph 1 > mentions > > that "...objects allocated on OSTs hold the data associated with the > > file and can be striped across several OSTs in a RAID pattern." The > > diagram on page 2 implies that each OST has DAS (direct attached > > storage), not shared storage. The correlation (if any) is not clear > to > > me. Is the striping accomplished at the OST level with some sort of > > software RAID? If 4 of 5 of your OSTs (assume I am using 5 OST for > > availability and load balance) drop off and die, does the remaining > OST > > have a complete copy of the the data for the client? > > > > Let me be more precise....does each DAS for each OST have a complete > > copy of the entire filesystem? > > No. To be completely clear, the file data is striped across multiple > OSTs in a RAID-0 fashion. Not RAID-1. > > Today, if you require redundancy (ie, the option for one node to take > over the services of another node), some shared storage is required. > Many of our customers connect two or more object storage servers to > each > fibrechannel array for this purpose. > > If you don''t require failover, any block device will do just fine. > > We are working on a RAID-1 object driver, which will do as you > describe, > and eliminate the need for shared storage for redundancy. > > > If so, can the OSTs use SAN storage thus > > see the same LUNs? In other words, can all the OSTs use one set of > > disks? Imagine Figure 1 with OST1 through 3 all seeing the same > disk > > array. This would leverage my storage resource more fairly as I am > > using a high performance storage array which is not cheap (for us, > at > > least). I''d hate to create 5 1TB disk arrays or 5TB of space when > only > > 1TB is needed for the filesystem. > > All of the OSTs can use one large set of shared storage, but let''s be > extremely clear -- each LUN is used by exactly one node at a time. If > failover is necessary, we require that the "old" node be forcefully > powered off, to ensure that it doesn''t wake up and start writing to > the > partition while another node is already doing so. This would cause > catastrophic corruption. > > So yes, you could absolutely connect multiple object servers to a > single > large SAN. This is common amongst our customers. > > Is this the answer that you were looking for? > > -Phil-- Bill Pappas Systems Integration Engineer St. Jude Children''s Research Hospital Department: Hartwell Center Phone: 901.495.4549 Fax: 901.495.2945
On Fri, 2004-05-28 at 11:49, James Dabbs wrote:> > We are evaluating Lustre for a telecomm application where two redundant > telecomm switches access data on a cluster of 2 server nodes. That''s 4 > computers -- 2 clients (the switches) and 2 servers. Client 1 and > server 1 are in location A, and client 2 and server 2 are in location B, > with high-capacity fiber connecting the two locations. The objective is > an overall system resilient enough to withstand complete destruction of > location A or location B, or failure of any one of the 4 nodes. > > I have been reading up on Lustre, and it seems like this is possible. > My question now is whether this application is ''right'' for Lustre, or if > there are other, better suited technologies. Any impressions would be > greatly appreciated.Lustre will do this for you -- but not quite yet. To keep data synchronized between multiple servers, possibly geographically distributed, we have designed Lustre proxies. A stable, production-ready version of these proxies is planned for release in 2005, unless we receive funding to accelerate that work. Lustre can already handle the destruction of servers, but as it today requires some kind of shared storage, it is limited to tightly-coupled clusters. Some people have been experimenting with using drdb to emulate shared storage, but I don''t know how well it works, or if it''s appropriate for this application. InterMezzo, another file system we built but which is no longer maintained due to lack of funding, was designed to be very good at this, but without the strict POSIX semantics that Lustre has. Over time, Lustre''s feature set will grow to include most or all of InterMezzo''s. Good luck-- -Phil
On Fri, 2004-05-28 at 09:40, Bill Pappas wrote:> Has lustre considered working on top of a Veritas (VxFS) filesystem?No, not really. You are the first to ask. Given all of the improvements that we and others have made to ext3 over the last 2-3 years, our customers have not needed to look elsewhere.> Also, the white paper did not clearly explain to me how the data is > spread across OSTs. Page 3 (of the white paper), paragraph 1 mentions > that "...objects allocated on OSTs hold the data associated with the > file and can be striped across several OSTs in a RAID pattern." The > diagram on page 2 implies that each OST has DAS (direct attached > storage), not shared storage. The correlation (if any) is not clear to > me. Is the striping accomplished at the OST level with some sort of > software RAID? If 4 of 5 of your OSTs (assume I am using 5 OST for > availability and load balance) drop off and die, does the remaining OST > have a complete copy of the the data for the client? > > Let me be more precise....does each DAS for each OST have a complete > copy of the entire filesystem?No. To be completely clear, the file data is striped across multiple OSTs in a RAID-0 fashion. Not RAID-1. Today, if you require redundancy (ie, the option for one node to take over the services of another node), some shared storage is required. Many of our customers connect two or more object storage servers to each fibrechannel array for this purpose. If you don''t require failover, any block device will do just fine. We are working on a RAID-1 object driver, which will do as you describe, and eliminate the need for shared storage for redundancy.> If so, can the OSTs use SAN storage thus > see the same LUNs? In other words, can all the OSTs use one set of > disks? Imagine Figure 1 with OST1 through 3 all seeing the same disk > array. This would leverage my storage resource more fairly as I am > using a high performance storage array which is not cheap (for us, at > least). I''d hate to create 5 1TB disk arrays or 5TB of space when only > 1TB is needed for the filesystem.All of the OSTs can use one large set of shared storage, but let''s be extremely clear -- each LUN is used by exactly one node at a time. If failover is necessary, we require that the "old" node be forcefully powered off, to ensure that it doesn''t wake up and start writing to the partition while another node is already doing so. This would cause catastrophic corruption. So yes, you could absolutely connect multiple object servers to a single large SAN. This is common amongst our customers. Is this the answer that you were looking for? -Phil
On Fri, 2004-05-28 at 13:45, Bill Pappas wrote:> > Let us say that you have a storage array attached to a SAN. You > carve/create a LUN from this array that has RAID1-mirror protection. > Now, I''ve added three OST servers, each with 1 fiber channel HBA card. > I attach them to the same SAN switch as the storage array. I tell the > storage array to present this LUN to all three OST servers. I go to OST > server #1 and I see this LUN via /proc/scsi/scsi. I fdisk and create 1 > partition that covers this entire disk. Let us say it is called > /dev/sdb. I mkfs and create a ext3 filesystem against this disk. I > mount it under /topdir/data1.OK, I think I understand our miscommunication. I''m going to steal your example here, and give one of my own. You have an enormous SAN that you want to share amongst your three object servers -- on that we agree. You want all 3 to be active (ie, doing Lustre I/O) simultaneously, and you *also* want them to be able to take over for each other in the event that one fails. If I have any of that wrong, please correct me.> I have very, very little knowledge on how lustre takes this disk via its > OBD layer. But since I know, from the documentation, that ext3 is > supported, I know that this OST server is good to go from a storage > point of view. Maybe, I need not mount /dev/sdb on /topdir/data1, but I > know the disk is ready, right?That''s correct. You should definitely NOT mount it yourself. Lustre will mount it privately itself, and it would be very bad to have it mounted twice. You should also, generally, let ''lconf'' format the disk for you, because it will format things a little bit differently, and enable some extra ext3 features. "lconf --reformat" will do that for you (be careful).> I want more performance/protection/availability so I add more OST > server. I go to OST server #2 and it too sees this same LUN via > /proc/scisc/scsi. I do not have to do anything to the disk as it was > formatted by OST server 1. Same rules apply to OST server 3. > > I understand that no two (or more) OST server''s can "use" a LUN at the > same time. So now, the above scenario will not work, right? I do not > want to risk corruption and I certainly do not want to manually > intervene by powering off a dead OST server? I mean, what if it came > back online and started writing to the same LUN before I had a chance to > do anything. This would be unacceptable. > > Instead, I have to create three separate LUNS, one for each OST server. > But this defeats the purpose of sharing storage. Yes, I know a SAN > allows you to share b/w and the storage array(s), but why can''t the OST > servers take it a step further and share a LUN that already has some > resiliency (like a RAID1-mirror in my scenario)? Maybe I am confusing > something like Tru64 clustering where the nodes can read/write and even > boot from the same disk.Here is the confusion, I think. Whether there are 3 separate LUNs or just a single LUN with 3 partitions does not really matter to Lustre. The fact is, we need 3 completely separate block devices, one for each OST''s primary storage. All 3 server nodes can see all 3 partitions, but, and this is the key: a given partition is only in use by exactly one node at a time. Period. During the normal course of operation, each server has 1 OST, with one distinct backing file system. For the sake of simplicity, let''s setup an example with one LUN visible to all 3 servers as /dev/sda. We have 3 partitions: oss1 starts Lustre, and mounts /dev/sda1 oss2 starts Lustre, and mounts /dev/sda2 oss3 starts Lustre, and mounts /dev/sda3 During the normal course of operation, all 3 OSSs can serve data from the same shared SAN device. They don''t need to know anything about each other. Now -- oss1 catches on fire, or your junior sysadmin pours his coffee into it: Step 1: turn off oss1. It''s critical that before we mount /dev/sda1 somewhere else, we are sure that oss1 has stopped writing! Step 2: run a small script on oss2. This script will start a new OST on that node, which mounts /dev/sda1 Step 3: today, that script also needs to update a small piece of shared configuration (either in LDAP, or some other mechanism like an NFS store), so that clients know to find that OST on oss2 now. Et voila. oss2 is now serving objects from /dev/sda2 and /dev/sda1, oss3 is serving objects from /dev/sda3, and oss1 is out for service. Is this what you had in mind? Or have I still missed the point?> The white paper mentioned that the data associated with any file will be > stripped across multiple OST servers. What does this mean? Multiple > copies of the same file? Or parts of the file, like stripes (of z size) > be spread across the OST servers down to their disk arrays.The latter, which is raid-0. Different parts of the file are spread across many objects, which reside on different OSTs.> If multiple copies of the file are spread across the OSTs, then I would > feel this is wasteful. You said this was not the case.This will be an option later, but not today, correct.> If parts of a file are spread across the OST servers, then how is this > file pieced back together if each OST is locked to a LUN and a OST > server fails? Does each OST pass file data (over TCP/QSW since the LUN > is locked per server) to the others in order to keep all files available > to all OST servers? Wouldn''t this go back to essentially copying the > same data to all OST servers?I hope my example above cleared this up. The point is that the partitions or LUNs really are not locked to a given node, but it is totally critical that only one node at a time is using a given partition. Hope that helps-- -Phil
João Miguel Neves
2006-May-19 07:36 UTC
[Lustre-discuss] Suitability of Lustre for HA telecomm
--=-Qx39KYaXI1mOLgveupaa Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable A S=C3=A1b, 2004-06-12 =C3=A0s 15:32, Phil Schwan escreveu:> Lustre can already handle the destruction of servers, but as it today > requires some kind of shared storage, it is limited to tightly-coupled > clusters. Some people have been experimenting with using drdb to > emulate shared storage, but I don''t know how well it works, or if it''s > appropriate for this application.We''ve been using drbd with lustre for a couple of months without major issues on a local gigabit ethernet network. We have one client/mds server and 4 storage nodes where each node has 2 machines with 8 250GB SATA disks each. Our problem is not I/O speed but the amount of storage space. All the machines are connected to the same gigabit ethernet switch. Performance with bonnie++ is around 50MB/s for sequential reads and 23MB/s for sequential writes. If anyone would be interested in me doing some other tests, I''d love to know. --=20 Jo=C3=A3o Miguel Neves --=-Qx39KYaXI1mOLgveupaa Content-Type: application/pgp-signature; name=signature.asc Content-Description: Esta =?ISO-8859-1?Q?=E9?= uma parte de mensagem assinada digitalmente -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.2.4 (GNU/Linux) iD8DBQBAzsCkGFkMfesLN9wRAoMYAJ9owpCR1D3JFeUsQ6gqd36sn27DuQCfXiIl jQvZs4yLoW1KwJuexYqKRaQ=R7+p -----END PGP SIGNATURE----- --=-Qx39KYaXI1mOLgveupaa--
Hello, We are evaluating Lustre for a telecomm application where two redundant telecomm switches access data on a cluster of 2 server nodes. That''s 4 computers -- 2 clients (the switches) and 2 servers. Client 1 and server 1 are in location A, and client 2 and server 2 are in location B, with high-capacity fiber connecting the two locations. The objective is an overall system resilient enough to withstand complete destruction of location A or location B, or failure of any one of the 4 nodes. I have been reading up on Lustre, and it seems like this is possible. My question now is whether this application is ''right'' for Lustre, or if there are other, better suited technologies. Any impressions would be greatly appreciated. Thanks, James Dabbs, TGA
Has lustre considered working on top of a Veritas (VxFS) filesystem? I did not see a mention in the lustre white paper: http://www.lustre.org/docs/whitepaper.pdf Also, the white paper did not clearly explain to me how the data is spread across OSTs. Page 3 (of the white paper), paragraph 1 mentions that "...objects allocated on OSTs hold the data associated with the file and can be striped across several OSTs in a RAID pattern." The diagram on page 2 implies that each OST has DAS (direct attached storage), not shared storage. The correlation (if any) is not clear to me. Is the striping accomplished at the OST level with some sort of software RAID? If 4 of 5 of your OSTs (assume I am using 5 OST for availability and load balance) drop off and die, does the remaining OST have a complete copy of the the data for the client? Let me be more precise....does each DAS for each OST have a complete copy of the entire filesystem? If so, can the OSTs use SAN storage thus see the same LUNs? In other words, can all the OSTs use one set of disks? Imagine Figure 1 with OST1 through 3 all seeing the same disk array. This would leverage my storage resource more fairly as I am using a high performance storage array which is not cheap (for us, at least). I''d hate to create 5 1TB disk arrays or 5TB of space when only 1TB is needed for the filesystem. Thanks. I hope I am clear. If not, please give me the opportunity to clarify. -- Bill Pappas Systems Integration Engineer St. Jude Children''s Research Hospital Department: Hartwell Center Phone: 901.495.4549 Fax: 901.495.2945