thr3ads.net - Lustre discuss - [Lustre-discuss] lustre no longer allows reads/writes (stopped working)? [Jan 2009]

If this information is useful, please help other people find it:
Share via:

Robert Minvielle

2009-Jan-30 21:35 UTC

[Lustre-discuss] lustre no longer allows reads/writes (stopped working)?

I have setup a lustre system for testing consisting of four OST''s and
one
MDT. It seems to work fine for about a day. At the end of about 24 hours,
the clients can no longer read or write the mount point (although a file
listing (ls) works). For example, a mkdir yields a "cannot create directory
''/datafs/temp'': Identifier removed", and the temp dir
does not exist.
A file listing of the /datafs directory comes back complete and correct,
but if I try to ls a subdirectory it gives me the erorr "ls: /datafs/test2:
Identifier removed".

The client is mounting the dir to /datafs. This worked fine eariler, I left
for the day, came back in and this error is occurring on all clients (albeit
I only have three clients for testing). All clients/servers are running
RHEL5, and the lustre was installed via rpms as per the manual.

Out of curiosity, if I go to the server and do an ls on /mnt/data/mdt or
to the OST server and do an ls on /mnt/data/ost1, I get an error that
it is not a directory (although that could be normal, I am not sure).

A cat of /proc/fs/lustre/devices on the mdt does not show anything out of place
(or at least, it is the same as when I started the lustre and mounted
the servers/clients)

I have configured it all according to
http://manual.lustre.org/manual/LustreManual16_HTML/ConfiguringLustreExamples.html#50548848_pgfId-1286919
as per section 6.1.1.2 Configuration Generation and Application, using one
server
for the MGT and MDS, and I have four OSTs, just like the example.

Has anyone seen this before?

Robert

Oleg Drokin

2009-Jan-30 22:20 UTC

head link

[Lustre-discuss] lustre no longer allows reads/writes (stopped working)?

Hello!

On Jan 30, 2009, at 4:35 PM, Robert Minvielle wrote:>
> I have setup a lustre system for testing consisting of four OST''s
> and one
> MDT. It seems to work fine for about a day. At the end of about 24  
> hours,
> the clients can no longer read or write the mount point (although a  
> file
> listing (ls) works). For example, a mkdir yields a "cannot create  
> directory
> ''/datafs/temp'': Identifier removed", and the temp
dir does not exist.
> A file listing of the /datafs directory comes back complete and  
> correct,
> but if I try to ls a subdirectory it gives me the erorr "ls: /datafs/ 
> test2:
> Identifier removed".
That means your /etc/groups file is out of sync between clients and  
MDS and you
have group upcall configured
Make them to be identical as the simplest way to fix it, or refer to  
bug 14756:
https://bugzilla.lustre.org/show_bug.cgi?id=14756 for more details.

Bye,
     Oleg

Jeremy Mann

2009-Jan-30 22:41 UTC

head link

[Lustre-discuss] lustre no longer allows reads/writes (stopped working)?

Robert Minvielle wrote:>
> I have setup a lustre system for testing consisting of four OST''s
and one
> MDT. It seems to work fine for about a day. At the end of about 24 hours,
> the clients can no longer read or write the mount point (although a file
> listing (ls) works). For example, a mkdir yields a "cannot create
> directory
> ''/datafs/temp'': Identifier removed", and the temp
dir does not exist.
> A file listing of the /datafs directory comes back complete and correct,
> but if I try to ls a subdirectory it gives me the erorr "ls:
> /datafs/test2:
> Identifier removed".
Robert, this happens when your MDT node does not have the same groups file
as the OSTs.


-- 
Jeremy Mann
jeremy at biochem.uthscsa.edu

University of Texas Health Science Center
Bioinformatics Core Facility
http://www.bioinformatics.uthscsa.edu
Phone: (210) 567-2672

Robert Minvielle

2009-Jan-30 22:46 UTC

head link

[Lustre-discuss] lustre no longer allows reads/writes (stopped working)?

Aha. I was searching for the wrong thing on bugzilla. I will correct
and retest. 

Thank you. 


----- Original Message -----
From: "Oleg Drokin" <Oleg.Drokin at Sun.COM>
To: "Robert Minvielle" <robert at lite3d.com>
Cc: lustre-discuss at lists.lustre.org
Sent: Friday, January 30, 2009 4:20:01 PM GMT -06:00 US/Canada Central
Subject: Re: [Lustre-discuss] lustre no longer allows reads/writes (stopped
working)?

Hello!

On Jan 30, 2009, at 4:35 PM, Robert Minvielle wrote:>
> I have setup a lustre system for testing consisting of four OST''s
> and one
> MDT. It seems to work fine for about a day. At the end of about 24  
> hours,
> the clients can no longer read or write the mount point (although a  
> file
> listing (ls) works). For example, a mkdir yields a "cannot create  
> directory
> ''/datafs/temp'': Identifier removed", and the temp
dir does not exist.
> A file listing of the /datafs directory comes back complete and  
> correct,
> but if I try to ls a subdirectory it gives me the erorr "ls: /datafs/ 
> test2:
> Identifier removed".
That means your /etc/groups file is out of sync between clients and  
MDS and you
have group upcall configured
Make them to be identical as the simplest way to fix it, or refer to  
bug 14756:
https://bugzilla.lustre.org/show_bug.cgi?id=14756 for more details.

Bye,
     Oleg

Arden Wiebe

2009-Jan-30 23:15 UTC

head link

[Lustre-discuss] lustre no longer allows reads/writes (stopped working)?

>I have setup a lustre system for testing consisting of four OST''s
and one
>MDT. It seems to work fine for about a day. At the end of about 24 hours,
>the clients can no longer read or write the mount point (although a file
>listing (ls) works). 
That is the problem.  Your clients are mounting wrong.  You have used incorrect
formatting of the nodes.
>For example, a mkdir yields a "cannot create directory
>''/datafs/temp'': Identifier removed", and the temp dir
does not exist.
>A file listing of the /datafs directory comes back complete and correct,
>but if I try to ls a subdirectory it gives me the erorr "ls: /datafs   
>/test2:
>Identifier removed". 
Please review via your bash history the exact commands you used to make the
underlying filesystem.  Be certain everything is pointing to the correct
filesystem and to the correct directories.
>The client is mounting the dir to /datafs. This worked fine eariler, I
>left
>for the day, came back in and this error is occurring on all clients
>(albeit
>I only have three clients for testing). All clients/servers are running
>RHEL5, and the lustre was installed via rpms as per the manual. 
The client if you followed the manual 100% (takes practice) should be mounting
your combined MDT/MGS node at the MDT/MGS Node IP address via your network for
example, tcp0 on a local mountpoint likened to /mnt/datafs.

I found that changing the manual representations of your new filesystem to
something other than datafs or testfs or spfs.  In your case I would recommend
the word litefs.  Also there are some ambiguities with slashes in the examples
and I might ad use or misuse of the = sign after fsname.

By far the best example is further into the manual about mounting external
journals.  Also it is best to have the MGS and MDT separate from everything I
have read.   Otherwise you must on your combined MDT/MGS node have two mount
points /mnt/mgs and /mnt/data/mdt.
>Out of curiosity, if I go to the server and do an ls on /mnt/data/mdt or
>to the OST server and do an ls on /mnt/data/ost1, I get an error that
>it is not a directory (although that could be normal, I am not sure). 
Yes that is normal because those are mount points not directories.
>A cat of /proc/fs/lustre/devices on the mdt does not show anything out
>of place
>(or at least, it is the same as when I started the lustre and mounted
>the servers/clients) 
So we assume your combined MDT/MGS is up and running but is it formatted
properly and mounted properly?
>I have configured it all according to 
>http://manual.lustre.org/manual/LustreManual16_HTML
>/ConfiguringLustreExamples.html#50548848_pgfId-1286919
>as per section 6.1.1.2 Configuration Generation and Application, using
>one server
>for the MGT and MDS, and I have four OSTs, just like the example. 
>Has anyone seen this before? 
Yes and it is common until you become good enough at creating your Lustre
filesystem and knowing which formatting and mounting procedures interact to make
a live filesystem that you adopt and know to be sound.

Robert to simplify things I''ll include some of my .bash_history on the
nodes for you to examine.  This should considerably decrease your initial
configuration timeframe.

My configuration differs in that I opt for seperate MGS and MDT. This obviously
is from the MDT.

umount /mnt/mgs
mdadm -S /dev/md2
mdadm -S /dev/md1
mdadm -S /dev/md0
mdadm --zero-superblock /dev/sdb
mdadm --zero-superblock /dev/sdc
mdadm --zero-superblock /dev/sdd
mdadm --zero-superblock /dev/sde
mdadm --zero-superblock /dev/sdf
mdadm -v --create --assume-clean /dev/md0 --level=raid10 --raid-devices=4
/dev/sdb /dev/sdc /dev/sdd /dev/sde
sfdisk -uC /dev/sdf << EOF
mke2fs -b 4096 -O journal_dev /dev/sdf1
cat /proc/mdstat
mkfs.lustre --mgs --fsname=ioio --mkfsoptions="-J device=/dev/sdf1"
--reformat /dev/md0
rm /etc/mdadm.conf
mdadm --detail --scan --verbose > /etc/mdadm.conf
mount -t lustre /dev/md0 /mnt/mgs
e2label /dev/md0
vi /etc/fstab
e2label /dev/md0
cat /proc/mdstat
mount -t lustre 192.168.0.7 at tcp0:/ioio /mnt/ioio
lctl dl
lfs df -h

This shows a single MGS with an external journal on /dev/sdf1.  The MGS is
mounted on /mnt/mgs by the /dev/md0 devices.  The e2label of which will be
label=MGS followed by mount options in the /etc/fstab.  Here you can see I
connect a client to the MGS to test the filesystem but only after the MDT is
mounted and the OSS are mounted.

On the MDT

umount /mnt/data/mdt
mdadm -S /dev/md2
mdadm -S /dev/md0
mdadm -S /dev/md1
mdadm --zero-superblock /dev/sdb
mdadm --zero-superblock /dev/sdc
mdadm --zero-superblock /dev/sdd
mdadm --zero-superblock /dev/sde
mdadm --zero-superblock /dev/sdf
mdadm -v --create --assume-clean /dev/md0 --level=raid10 --raid-devices=4
/dev/sdb /dev/sdc /dev/sdd /dev/sde
sfdisk -uC /dev/sdf << EOF
mke2fs -b 4096 -O journal_dev /dev/sdf1
cat /proc/mdstat
mkfs.lustre --mdt --fsname=ioio --mgsnode=192.168.0.7 at tcp0
--mkfsoptions="-J device=/dev/sdf1" --reformat /dev/md0
mount -t lustre /dev/md0 /mnt/data/mdt
rm /etc/mdadm.conf
mdadm --detail --scan --verbose > /etc/mdadm.conf
e2label /dev/md0
vi /etc/fstab
shutdown -r -t secs: 0

When this MDT comes back online your filesystems shall be mounted correctly as
identified by lctl dl.

And typical of an OST.  Choose whatever raid level you require.

umount /mnt/data/ost0
cat /proc/mdstat
mdadm -S /dev/md0
mdadm --zero-superblock /dev/sdb
mdadm --zero-superblock /dev/sdc
mdadm --zero-superblock /dev/sdd
mdadm --zero-superblock /dev/sde
mdadm --zero-superblock /dev/sdf
mdadm --zero-superblock /dev/sdg
mdadm --zero-superblock /dev/sdh
mdadm -v --create --assume-clean /dev/md0 --level=raid10 --raid-devices=6
/dev/sdc /dev/sdd /dev/sde /dev/sdf /dev/sdg /dev/sdh
cat /proc/mdstat
sfdisk -uC /dev/sdb << EOF
mke2fs -b 4096 -O journal_dev /dev/sdb1
mkfs.lustre --ost --fsname=ioio --mgsnode=192.168.0.7 at tcp0
--mkfsoptions="-J device=/dev/sdb1" --reformat /dev/md0
mount -t lustre /dev/md0 /mnt/data/ost0
rm /etc/mdadm.conf
mdadm --detail --scan --verbose > /etc/mdadm.conf
e2label /dev/md0
vi /etc/fstab
cat /proc/mdstat
shutdown -r -t secs: 0

When this box comes back up the newly formatted OST should be mounted.  If not
your e2label is incorrect as does happen and is mentioned in the manual that
e2label won''t report correctly until the devices is mounted the first
time.

Robert I hope this helps to speed your testing deployment.  It will take you
probably 2 or three attempts to get a viable filesystem with all the variables
in play and your naming conventions.  Eventually you will end up wanting to have
external journals as laid out above.  Also be sure to follow your directory
naming conventions right through.  For example you mount the OST and subsequent
/dev/md0 device on /mnt/data/ost0 don''t be shortening the path as I
suspect you have on your OST mounts.



Robert
_______________________________________________
Lustre-discuss mailing list
Lustre-discuss at lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss

Lustre discuss - Jan 2009 - lustre no longer allows reads/writes (stopped working)?

[Lustre-discuss] lustre no longer allows reads/writes (stopped working)?

[Lustre-discuss] lustre no longer allows reads/writes (stopped working)?

[Lustre-discuss] lustre no longer allows reads/writes (stopped working)?

[Lustre-discuss] lustre no longer allows reads/writes (stopped working)?

[Lustre-discuss] lustre no longer allows reads/writes (stopped working)?