thr3ads.net - Lustre discuss - [Lustre-discuss] VxFS support [May 2006]

If this information is useful, please help other people find it:
Share via:

Bill Pappas

2006-May-19 07:36 UTC

[Lustre-discuss] VxFS support

On Fri, 2004-05-28 at 11:11, Phil Schwan wrote:> All of the OSTs can use one large set of shared storage, but let''s
be
> extremely clear -- each LUN is used by exactly one node at a time.  If
> failover is necessary, we require that the "old" node be
forcefully
> powered off, to ensure that it doesn''t wake up and start writing
to
> the
> partition while another node is already doing so.  This would cause
> catastrophic corruption.
> 
> So yes, you could absolutely connect multiple object servers to a
> single
> large SAN.  This is common amongst our customers.
> 
> Is this the answer that you were looking for?Ok. I''m confused.  I apologize if I drive you a little crazy, but
indulge me as I give this another shot.

Maybe, I can give a scenario that explains how I "hope" to see this
work.  You can tear it apart and shed some light on how this works in
reality.

Let us say that you have a storage array attached to a SAN.  You
carve/create a LUN from this array that has RAID1-mirror protection. 
Now, I''ve added three OST servers, each with 1 fiber channel HBA card.
I attach them to the same SAN switch as the storage array.  I tell the
storage array to present this LUN to all three OST servers.  I go to OST
server #1 and I see this LUN via /proc/scsi/scsi.  I fdisk and create 1
partition that covers this entire disk.  Let us say it is called
/dev/sdb.  I mkfs and create a ext3 filesystem against this disk.  I
mount it under /topdir/data1.  

I have very, very little knowledge on how lustre takes this disk via its
OBD layer.  But since I know, from the documentation, that ext3 is
supported, I know that this OST server is good to go from a storage
point of view. Maybe, I need not mount /dev/sdb on /topdir/data1, but I
know the disk is ready, right?

I want more performance/protection/availability so I add more OST
server. I go to OST server #2 and it too sees this same LUN via
/proc/scisc/scsi.  I do not have to do anything to the disk as it was
formatted by OST server 1.  Same rules apply to OST server 3.

I understand that no two (or more) OST server''s can "use" a
LUN at the
same time. So now, the above scenario will not work, right? I do not
want to risk corruption and I certainly do not want to manually
intervene by powering off a dead OST server? I mean, what if it came
back online and started writing to the same LUN before I had a chance to
do anything.  This would be unacceptable.

Instead, I have to create three separate LUNS, one for each OST server. 
But this defeats the purpose of sharing storage.  Yes, I know a SAN
allows you to share b/w and the storage array(s), but why can''t the OST
servers take it a step further and share a LUN that already has some
resiliency (like a RAID1-mirror in my scenario)?  Maybe I am confusing
something like Tru64 clustering where the nodes can read/write and even
boot from the same disk.

The white paper mentioned that the data associated with any file will be
stripped across multiple OST servers. What does this mean?  Multiple
copies of the same file?  Or parts of the file, like stripes (of z size)
be spread across the OST servers down to their disk arrays.

If multiple copies of the file are spread across the OSTs, then I would
feel this is wasteful.  You said this was not the case.

If parts of a file are spread across the OST servers, then how is this
file pieced back together if each OST is locked to a LUN and a OST
server fails? Does each OST pass file data (over TCP/QSW since the LUN
is locked per server) to the others in order to keep all files available
to all OST servers?  Wouldn''t this go back to essentially copying the
same data to all OST servers?

Boy, I must be going in circles.  Hopefully from this you can determine
what concepts I am clueless on (probably all), but give me a try.

thank you.
> On Fri, 2004-05-28 at 09:40, Bill Pappas wrote:
> > Has lustre considered working on top of a Veritas (VxFS) filesystem?
> 
> No, not really.  You are the first to ask.
> 
> Given all of the improvements that we and others have made to ext3
> over
> the last 2-3 years, our customers have not needed to look elsewhere.
> 
> > Also, the white paper did not clearly explain to me how the data is
> > spread across OSTs.  Page 3 (of the white paper), paragraph 1
> mentions
> > that "...objects allocated on OSTs hold the data associated with
the
> > file and can be striped across several OSTs in a RAID pattern." 
The
> > diagram on page 2 implies that each OST has DAS (direct attached
> > storage), not shared storage.  The correlation (if any) is not clear
> to
> > me. Is the striping accomplished at the OST level with some sort of
> > software RAID?  If 4 of 5 of your OSTs (assume I am using 5 OST for
> > availability and load balance) drop off and die, does the remaining
> OST
> > have a complete copy of the the data for the client?
> > 
> > Let me be more precise....does each DAS for each OST have a complete
> > copy of the entire filesystem?
> 
> No.  To be completely clear, the file data is striped across multiple
> OSTs in a RAID-0 fashion.  Not RAID-1.
> 
> Today, if you require redundancy (ie, the option for one node to take
> over the services of another node), some shared storage is required. 
> Many of our customers connect two or more object storage servers to
> each
> fibrechannel array for this purpose.
> 
> If you don''t require failover, any block device will do just fine.
> 
> We are working on a RAID-1 object driver, which will do as you
> describe,
> and eliminate the need for shared storage for redundancy.
> 
> > If so, can the OSTs use SAN storage thus
> > see the same LUNs?  In other words, can all the OSTs use one set of
> > disks?  Imagine Figure 1 with OST1 through 3 all seeing the same
> disk
> > array.  This would leverage my storage resource more fairly as I am
> > using a high performance storage array which is not cheap (for us,
> at
> > least).  I''d hate to create 5 1TB disk arrays or 5TB of space
when
> only
> > 1TB is needed for the filesystem.
> 
> All of the OSTs can use one large set of shared storage, but let''s
be
> extremely clear -- each LUN is used by exactly one node at a time.  If
> failover is necessary, we require that the "old" node be
forcefully
> powered off, to ensure that it doesn''t wake up and start writing
to
> the
> partition while another node is already doing so.  This would cause
> catastrophic corruption.
> 
> So yes, you could absolutely connect multiple object servers to a
> single
> large SAN.  This is common amongst our customers.
> 
> Is this the answer that you were looking for?
> 
> -Phil-- 
Bill Pappas
Systems Integration Engineer
St. Jude Children''s Research Hospital
Department: Hartwell Center
Phone: 901.495.4549
Fax: 901.495.2945

Phil Schwan

2006-May-19 07:36 UTC

head link

[Lustre-discuss] Suitability of Lustre for HA telecomm

On Fri, 2004-05-28 at 11:49, James Dabbs wrote:> 
> We are evaluating Lustre for a telecomm application where two redundant
> telecomm switches access data on a cluster of 2 server nodes. 
That''s 4
> computers -- 2 clients (the switches) and 2 servers.  Client 1 and
> server 1 are in location A, and client 2 and server 2 are in location B,
> with high-capacity fiber connecting the two locations.  The objective is
> an overall system resilient enough to withstand complete destruction of
> location A or location B, or failure of any one of the 4 nodes.
> 
> I have been reading up on Lustre, and it seems like this is possible.
> My question now is whether this application is ''right''
for Lustre, or if
> there are other, better suited technologies.  Any impressions would be
> greatly appreciated.
Lustre will do this for you -- but not quite yet.

To keep data synchronized between multiple servers, possibly
geographically distributed, we have designed Lustre proxies.  A stable,
production-ready version of these proxies is planned for release in
2005, unless we receive funding to accelerate that work.

Lustre can already handle the destruction of servers, but as it today
requires some kind of shared storage, it is limited to tightly-coupled
clusters.  Some people have been experimenting with using drdb to
emulate shared storage, but I don''t know how well it works, or if
it''s
appropriate for this application.

InterMezzo, another file system we built but which is no longer
maintained due to lack of funding, was designed to be very good at this,
but without the strict POSIX semantics that Lustre has.  Over time,
Lustre''s feature set will grow to include most or all of
InterMezzo''s.

Good luck--

-Phil

Phil Schwan

2006-May-19 07:36 UTC

head link

[Lustre-discuss] VxFS support

On Fri, 2004-05-28 at 09:40, Bill Pappas wrote:> Has lustre considered working on top of a Veritas (VxFS) filesystem?
No, not really.  You are the first to ask.

Given all of the improvements that we and others have made to ext3 over
the last 2-3 years, our customers have not needed to look elsewhere.
> Also, the white paper did not clearly explain to me how the data is
> spread across OSTs.  Page 3 (of the white paper), paragraph 1 mentions
> that "...objects allocated on OSTs hold the data associated with the
> file and can be striped across several OSTs in a RAID pattern."  The
> diagram on page 2 implies that each OST has DAS (direct attached
> storage), not shared storage.  The correlation (if any) is not clear to
> me. Is the striping accomplished at the OST level with some sort of
> software RAID?  If 4 of 5 of your OSTs (assume I am using 5 OST for
> availability and load balance) drop off and die, does the remaining OST
> have a complete copy of the the data for the client?
> 
> Let me be more precise....does each DAS for each OST have a complete
> copy of the entire filesystem?
No.  To be completely clear, the file data is striped across multiple
OSTs in a RAID-0 fashion.  Not RAID-1.

Today, if you require redundancy (ie, the option for one node to take
over the services of another node), some shared storage is required. 
Many of our customers connect two or more object storage servers to each
fibrechannel array for this purpose.

If you don''t require failover, any block device will do just fine.

We are working on a RAID-1 object driver, which will do as you describe,
and eliminate the need for shared storage for redundancy.
> If so, can the OSTs use SAN storage thus
> see the same LUNs?  In other words, can all the OSTs use one set of
> disks?  Imagine Figure 1 with OST1 through 3 all seeing the same disk
> array.  This would leverage my storage resource more fairly as I am
> using a high performance storage array which is not cheap (for us, at
> least).  I''d hate to create 5 1TB disk arrays or 5TB of space when
only
> 1TB is needed for the filesystem.
All of the OSTs can use one large set of shared storage, but let''s be
extremely clear -- each LUN is used by exactly one node at a time.  If
failover is necessary, we require that the "old" node be forcefully
powered off, to ensure that it doesn''t wake up and start writing to the
partition while another node is already doing so.  This would cause
catastrophic corruption.

So yes, you could absolutely connect multiple object servers to a single
large SAN.  This is common amongst our customers.

Is this the answer that you were looking for?

-Phil

Phil Schwan

2006-May-19 07:36 UTC

head link

[Lustre-discuss] VxFS support

On Fri, 2004-05-28 at 13:45, Bill Pappas wrote:> 
> Let us say that you have a storage array attached to a SAN.  You
> carve/create a LUN from this array that has RAID1-mirror protection. 
> Now, I''ve added three OST servers, each with 1 fiber channel HBA
card.
> I attach them to the same SAN switch as the storage array.  I tell the
> storage array to present this LUN to all three OST servers.  I go to OST
> server #1 and I see this LUN via /proc/scsi/scsi.  I fdisk and create 1
> partition that covers this entire disk.  Let us say it is called
> /dev/sdb.  I mkfs and create a ext3 filesystem against this disk.  I
> mount it under /topdir/data1.
OK, I think I understand our miscommunication.  I''m going to steal your
example here, and give one of my own.

You have an enormous SAN that you want to share amongst your three
object servers -- on that we agree.  You want all 3 to be active (ie,
doing Lustre I/O) simultaneously, and you *also* want them to be able to
take over for each other in the event that one fails.

If I have any of that wrong, please correct me.
> I have very, very little knowledge on how lustre takes this disk via its
> OBD layer.  But since I know, from the documentation, that ext3 is
> supported, I know that this OST server is good to go from a storage
> point of view. Maybe, I need not mount /dev/sdb on /topdir/data1, but I
> know the disk is ready, right?
That''s correct.   You should definitely NOT mount it yourself.  Lustre
will mount it privately itself, and it would be very bad to have it
mounted twice.

You should also, generally, let ''lconf'' format the disk for
you, because
it will format things a little bit differently, and enable some extra
ext3 features.  "lconf --reformat" will do that for you (be careful).
> I want more performance/protection/availability so I add more OST
> server. I go to OST server #2 and it too sees this same LUN via
> /proc/scisc/scsi.  I do not have to do anything to the disk as it was
> formatted by OST server 1.  Same rules apply to OST server 3.
> 
> I understand that no two (or more) OST server''s can
"use" a LUN at the
> same time. So now, the above scenario will not work, right? I do not
> want to risk corruption and I certainly do not want to manually
> intervene by powering off a dead OST server? I mean, what if it came
> back online and started writing to the same LUN before I had a chance to
> do anything.  This would be unacceptable.
> 
> Instead, I have to create three separate LUNS, one for each OST server. 
> But this defeats the purpose of sharing storage.  Yes, I know a SAN
> allows you to share b/w and the storage array(s), but why can''t
the OST
> servers take it a step further and share a LUN that already has some
> resiliency (like a RAID1-mirror in my scenario)?  Maybe I am confusing
> something like Tru64 clustering where the nodes can read/write and even
> boot from the same disk.
Here is the confusion, I think.

Whether there are 3 separate LUNs or just a single LUN with 3 partitions
does not really matter to Lustre.  The fact is, we need 3 completely
separate block devices, one for each OST''s primary storage.  All 3
server nodes can see all 3 partitions, but, and this is the key: a given
partition is only in use by exactly one node at a time.  Period.

During the normal course of operation, each server has 1 OST, with one
distinct backing file system.  For the sake of simplicity, let''s setup
an example with one LUN visible to all 3 servers as /dev/sda.  We have 3
partitions:

oss1 starts Lustre, and mounts /dev/sda1
oss2 starts Lustre, and mounts /dev/sda2
oss3 starts Lustre, and mounts /dev/sda3

During the normal course of operation, all 3 OSSs can serve data from
the same shared SAN device.  They don''t need to know anything about
each
other.

Now -- oss1 catches on fire, or your junior sysadmin pours his coffee
into it:

Step 1: turn off oss1.  It''s critical that before we mount /dev/sda1
somewhere else, we are sure that oss1 has stopped writing!

Step 2: run a small script on oss2.  This script will start a new OST on
that node, which mounts /dev/sda1

Step 3: today, that script also needs to update a small piece of shared
configuration (either in LDAP, or some other mechanism like an NFS
store), so that clients know to find that OST on oss2 now.

Et voila.  oss2 is now serving objects from /dev/sda2 and /dev/sda1,
oss3 is serving objects from /dev/sda3, and oss1 is out for service.

Is this what you had in mind?  Or have I still missed the point?
> The white paper mentioned that the data associated with any file will be
> stripped across multiple OST servers. What does this mean?  Multiple
> copies of the same file?  Or parts of the file, like stripes (of z size)
> be spread across the OST servers down to their disk arrays.
The latter, which is raid-0.  Different parts of the file are spread
across many objects, which reside on different OSTs.
> If multiple copies of the file are spread across the OSTs, then I would
> feel this is wasteful.  You said this was not the case.
This will be an option later, but not today, correct.
> If parts of a file are spread across the OST servers, then how is this
> file pieced back together if each OST is locked to a LUN and a OST
> server fails? Does each OST pass file data (over TCP/QSW since the LUN
> is locked per server) to the others in order to keep all files available
> to all OST servers?  Wouldn''t this go back to essentially copying
the
> same data to all OST servers?
I hope my example above cleared this up.  The point is that the
partitions or LUNs really are not locked to a given node, but it is
totally critical that only one node at a time is using a given
partition.

Hope that helps--

-Phil

João Miguel Neves

2006-May-19 07:36 UTC

head link

[Lustre-discuss] Suitability of Lustre for HA telecomm

--=-Qx39KYaXI1mOLgveupaa
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

A S=C3=A1b, 2004-06-12 =C3=A0s 15:32, Phil Schwan
escreveu:> Lustre can already handle the destruction of servers, but as it today
> requires some kind of shared storage, it is limited to tightly-coupled
> clusters.  Some people have been experimenting with using drdb to
> emulate shared storage, but I don''t know how well it works, or if
it''s
> appropriate for this application.
We''ve been using drbd with lustre for a couple of months without major
issues on a local gigabit ethernet network. We have one client/mds
server and 4 storage nodes where each node has 2 machines with 8 250GB
SATA disks each. Our problem is not I/O speed but the amount of storage
space. All the machines are connected to the same gigabit ethernet
switch.

Performance with bonnie++ is around 50MB/s for sequential reads and
23MB/s for sequential writes.

If anyone would be interested in me doing some other tests, I''d love to
know.

--=20
						Jo=C3=A3o Miguel Neves



--=-Qx39KYaXI1mOLgveupaa
Content-Type: application/pgp-signature; name=signature.asc
Content-Description: Esta =?ISO-8859-1?Q?=E9?= uma parte de mensagem
	assinada digitalmente

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.4 (GNU/Linux)

iD8DBQBAzsCkGFkMfesLN9wRAoMYAJ9owpCR1D3JFeUsQ6gqd36sn27DuQCfXiIl
jQvZs4yLoW1KwJuexYqKRaQ=R7+p
-----END PGP SIGNATURE-----

--=-Qx39KYaXI1mOLgveupaa--

James Dabbs

2006-May-19 07:36 UTC

head link

[Lustre-discuss] Suitability of Lustre for HA telecomm

Hello,

We are evaluating Lustre for a telecomm application where two redundant
telecomm switches access data on a cluster of 2 server nodes.  That''s 4
computers -- 2 clients (the switches) and 2 servers.  Client 1 and
server 1 are in location A, and client 2 and server 2 are in location B,
with high-capacity fiber connecting the two locations.  The objective is
an overall system resilient enough to withstand complete destruction of
location A or location B, or failure of any one of the 4 nodes.

I have been reading up on Lustre, and it seems like this is possible.
My question now is whether this application is ''right'' for
Lustre, or if
there are other, better suited technologies.  Any impressions would be
greatly appreciated.

Thanks,

James Dabbs, TGA

Bill Pappas

2006-May-19 07:36 UTC

head link

[Lustre-discuss] VxFS support

Has lustre considered working on top of a Veritas (VxFS) filesystem?

I did not see a mention in the lustre white paper:
http://www.lustre.org/docs/whitepaper.pdf

Also, the white paper did not clearly explain to me how the data is
spread across OSTs.  Page 3 (of the white paper), paragraph 1 mentions
that "...objects allocated on OSTs hold the data associated with the
file and can be striped across several OSTs in a RAID pattern."  The
diagram on page 2 implies that each OST has DAS (direct attached
storage), not shared storage.  The correlation (if any) is not clear to
me. Is the striping accomplished at the OST level with some sort of
software RAID?  If 4 of 5 of your OSTs (assume I am using 5 OST for
availability and load balance) drop off and die, does the remaining OST
have a complete copy of the the data for the client?

Let me be more precise....does each DAS for each OST have a complete
copy of the entire filesystem?  If so, can the OSTs use SAN storage thus
see the same LUNs?  In other words, can all the OSTs use one set of
disks?  Imagine Figure 1 with OST1 through 3 all seeing the same disk
array.  This would leverage my storage resource more fairly as I am
using a high performance storage array which is not cheap (for us, at
least).  I''d hate to create 5 1TB disk arrays or 5TB of space when only
1TB is needed for the filesystem.

Thanks.  I hope I am clear.  If not, please give me the opportunity to
clarify.
 
-- 
Bill Pappas
Systems Integration Engineer
St. Jude Children''s Research Hospital
Department: Hartwell Center
Phone: 901.495.4549
Fax: 901.495.2945

Lustre discuss - May 2006 - VxFS support

[Lustre-discuss] VxFS support

[Lustre-discuss] Suitability of Lustre for HA telecomm

[Lustre-discuss] VxFS support

[Lustre-discuss] VxFS support

[Lustre-discuss] Suitability of Lustre for HA telecomm

[Lustre-discuss] Suitability of Lustre for HA telecomm

[Lustre-discuss] VxFS support