thr3ads.net - Lustre discuss - [Lustre-discuss] Lustre client error [Feb 2011]

If this information is useful, please help other people find it:
Share via:

Jagga Soorma

2011-Feb-15 18:57 UTC

[Lustre-discuss] Lustre client error

Hi Guys,

One of my clients got a hung lustre mount this morning and I saw the
following errors in my logs:

--
..snip..
Feb 15 09:38:07 reshpc116 kernel: LustreError: 11-0: an error occurred while
communicating with 10.0.250.47 at o2ib3. The ost_write operation failed with
-28
Feb 15 09:38:07 reshpc116 kernel: LustreError: Skipped 4755836 previous
similar messages
Feb 15 09:48:07 reshpc116 kernel: LustreError: 11-0: an error occurred while
communicating with 10.0.250.47 at o2ib3. The ost_write operation failed with
-28
Feb 15 09:48:07 reshpc116 kernel: LustreError: Skipped 4649141 previous
similar messages
Feb 15 10:16:54 reshpc116 kernel: Lustre:
6254:0:(client.c:1476:ptlrpc_expire_one_request()) @@@ Request
x1360125198261945 sent from reshpcfs-OST0005-osc-ffff8830175c8400 to NID
10.0.250.47 at o2ib3 1344s ago has timed out (1344s prior to deadline).
Feb 15 10:16:54 reshpc116 kernel: Lustre:
reshpcfs-OST0005-osc-ffff8830175c8400: Connection to service
reshpcfs-OST0005 via nid 10.0.250.47 at o2ib3 was lost; in progress operations
using this service will wait for recovery to complete.
Feb 15 10:16:54 reshpc116 kernel: LustreError: 11-0: an error occurred while
communicating with 10.0.250.47 at o2ib3. The ost_connect operation failed with
-16
Feb 15 10:16:54 reshpc116 kernel: LustreError: Skipped 2888779 previous
similar messages
Feb 15 10:16:55 reshpc116 kernel: Lustre:
6254:0:(client.c:1476:ptlrpc_expire_one_request()) @@@ Request
x1360125198261947 sent from reshpcfs-OST0005-osc-ffff8830175c8400 to NID
10.0.250.47 at o2ib3 1344s ago has timed out (1344s prior to deadline).
Feb 15 10:18:11 reshpc116 kernel: LustreError: 11-0: an error occurred while
communicating with 10.0.250.47 at o2ib3. The ost_connect operation failed with
-16
Feb 15 10:18:11 reshpc116 kernel: LustreError: Skipped 10 previous similar
messages
Feb 15 10:20:45 reshpc116 kernel: LustreError: 11-0: an error occurred while
communicating with 10.0.250.47 at o2ib3. The ost_connect operation failed with
-16
Feb 15 10:20:45 reshpc116 kernel: LustreError: Skipped 21 previous similar
messages
Feb 15 10:25:46 reshpc116 kernel: LustreError: 11-0: an error occurred while
communicating with 10.0.250.47 at o2ib3. The ost_connect operation failed with
-16
Feb 15 10:25:46 reshpc116 kernel: LustreError: Skipped 42 previous similar
messages
Feb 15 10:31:43 reshpc116 kernel: Lustre:
reshpcfs-OST0005-osc-ffff8830175c8400: Connection restored to service
reshpcfs-OST0005 using nid 10.0.250.47 at o2ib3.
--

Due to disk space issues on my lustre filesystem one of the OST''s were
full
and I deactivated that OST this morning. I thought that operation just puts
it in a read only state and that clients can still access the data from that
OST. After activating this OST again the client connected again and was
okay after this. How else would you deal with a OST that is close to 100%
full? Is it okay to leave the OST active and the clients will know not to
write data to that OST?

Thanks,
-J
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20110215/a1548545/attachment-0001.html

Bob Ball

2011-Feb-15 19:04 UTC

head link

[Lustre-discuss] Lustre client error

You can deactivate it on the MDT, that will make it RO, but leave it 
alone on the clients so they can still access files from it.

bob

On 2/15/2011 1:57 PM, Jagga Soorma wrote:> Hi Guys,
>
> One of my clients got a hung lustre mount this morning and I saw the 
> following errors in my logs:
>
> --
> ..snip..
> Feb 15 09:38:07 reshpc116 kernel: LustreError: 11-0: an error occurred 
> while communicating with 10.0.250.47 at o2ib3. The ost_write operation 
> failed with -28
> Feb 15 09:38:07 reshpc116 kernel: LustreError: Skipped 4755836 
> previous similar messages
> Feb 15 09:48:07 reshpc116 kernel: LustreError: 11-0: an error occurred 
> while communicating with 10.0.250.47 at o2ib3. The ost_write operation 
> failed with -28
> Feb 15 09:48:07 reshpc116 kernel: LustreError: Skipped 4649141 
> previous similar messages
> Feb 15 10:16:54 reshpc116 kernel: Lustre: 
> 6254:0:(client.c:1476:ptlrpc_expire_one_request()) @@@ Request 
> x1360125198261945 sent from reshpcfs-OST0005-osc-ffff8830175c8400 to 
> NID 10.0.250.47 at o2ib3 1344s ago has timed out (1344s prior to deadline).
> Feb 15 10:16:54 reshpc116 kernel: Lustre: 
> reshpcfs-OST0005-osc-ffff8830175c8400: Connection to service 
> reshpcfs-OST0005 via nid 10.0.250.47 at o2ib3 was lost; in progress 
> operations using this service will wait for recovery to complete.
> Feb 15 10:16:54 reshpc116 kernel: LustreError: 11-0: an error occurred 
> while communicating with 10.0.250.47 at o2ib3. The ost_connect operation 
> failed with -16
> Feb 15 10:16:54 reshpc116 kernel: LustreError: Skipped 2888779 
> previous similar messages
> Feb 15 10:16:55 reshpc116 kernel: Lustre: 
> 6254:0:(client.c:1476:ptlrpc_expire_one_request()) @@@ Request 
> x1360125198261947 sent from reshpcfs-OST0005-osc-ffff8830175c8400 to 
> NID 10.0.250.47 at o2ib3 1344s ago has timed out (1344s prior to deadline).
> Feb 15 10:18:11 reshpc116 kernel: LustreError: 11-0: an error occurred 
> while communicating with 10.0.250.47 at o2ib3. The ost_connect operation 
> failed with -16
> Feb 15 10:18:11 reshpc116 kernel: LustreError: Skipped 10 previous 
> similar messages
> Feb 15 10:20:45 reshpc116 kernel: LustreError: 11-0: an error occurred 
> while communicating with 10.0.250.47 at o2ib3. The ost_connect operation 
> failed with -16
> Feb 15 10:20:45 reshpc116 kernel: LustreError: Skipped 21 previous 
> similar messages
> Feb 15 10:25:46 reshpc116 kernel: LustreError: 11-0: an error occurred 
> while communicating with 10.0.250.47 at o2ib3. The ost_connect operation 
> failed with -16
> Feb 15 10:25:46 reshpc116 kernel: LustreError: Skipped 42 previous 
> similar messages
> Feb 15 10:31:43 reshpc116 kernel: Lustre: 
> reshpcfs-OST0005-osc-ffff8830175c8400: Connection restored to service 
> reshpcfs-OST0005 using nid 10.0.250.47 at o2ib3.
> --
>
> Due to disk space issues on my lustre filesystem one of the OST''s
were
> full and I deactivated that OST this morning.  I thought that 
> operation just puts it in a read only state and that clients can still 
> access the data from that OST.  After activating this OST again the 
> client connected again and was okay after this.  How else would you 
> deal with a OST that is close to 100% full?  Is it okay to leave the 
> OST active and the clients will know not to write data to that OST?
>
> Thanks,
> -J
>
>
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20110215/f9918291/attachment.html

Cliff White

2011-Feb-15 19:20 UTC

head link

[Lustre-discuss] Lustre client error

Client situation depends on where you deactivated the OST - if you
deactivate on the MDS only,
clients should be able to read.

What is best to do when an OST fills up really depends on what else you are
doing at the time, and
how much control you have over what the clients are doing and other things.
If you can solve the space issue with a quick rm -rf, best to leave it
online, likewise if all your clients are
trying to bang on it and failing, best to turn things off. YMMV

cliffw

On Tue, Feb 15, 2011 at 10:57 AM, Jagga Soorma <jagga13 at gmail.com>
wrote:
> Hi Guys,
>
> One of my clients got a hung lustre mount this morning and I saw the
> following errors in my logs:
>
> --
> ..snip..
> Feb 15 09:38:07 reshpc116 kernel: LustreError: 11-0: an error occurred
> while communicating with 10.0.250.47 at o2ib3. The ost_write operation
failed
> with -28
> Feb 15 09:38:07 reshpc116 kernel: LustreError: Skipped 4755836 previous
> similar messages
> Feb 15 09:48:07 reshpc116 kernel: LustreError: 11-0: an error occurred
> while communicating with 10.0.250.47 at o2ib3. The ost_write operation
failed
> with -28
> Feb 15 09:48:07 reshpc116 kernel: LustreError: Skipped 4649141 previous
> similar messages
> Feb 15 10:16:54 reshpc116 kernel: Lustre:
> 6254:0:(client.c:1476:ptlrpc_expire_one_request()) @@@ Request
> x1360125198261945 sent from reshpcfs-OST0005-osc-ffff8830175c8400 to NID
> 10.0.250.47 at o2ib3 1344s ago has timed out (1344s prior to deadline).
> Feb 15 10:16:54 reshpc116 kernel: Lustre:
> reshpcfs-OST0005-osc-ffff8830175c8400: Connection to service
> reshpcfs-OST0005 via nid 10.0.250.47 at o2ib3 was lost; in progress
> operations using this service will wait for recovery to complete.
> Feb 15 10:16:54 reshpc116 kernel: LustreError: 11-0: an error occurred
> while communicating with 10.0.250.47 at o2ib3. The ost_connect operation
> failed with -16
> Feb 15 10:16:54 reshpc116 kernel: LustreError: Skipped 2888779 previous
> similar messages
> Feb 15 10:16:55 reshpc116 kernel: Lustre:
> 6254:0:(client.c:1476:ptlrpc_expire_one_request()) @@@ Request
> x1360125198261947 sent from reshpcfs-OST0005-osc-ffff8830175c8400 to NID
> 10.0.250.47 at o2ib3 1344s ago has timed out (1344s prior to deadline).
> Feb 15 10:18:11 reshpc116 kernel: LustreError: 11-0: an error occurred
> while communicating with 10.0.250.47 at o2ib3. The ost_connect operation
> failed with -16
> Feb 15 10:18:11 reshpc116 kernel: LustreError: Skipped 10 previous similar
> messages
> Feb 15 10:20:45 reshpc116 kernel: LustreError: 11-0: an error occurred
> while communicating with 10.0.250.47 at o2ib3. The ost_connect operation
> failed with -16
> Feb 15 10:20:45 reshpc116 kernel: LustreError: Skipped 21 previous similar
> messages
> Feb 15 10:25:46 reshpc116 kernel: LustreError: 11-0: an error occurred
> while communicating with 10.0.250.47 at o2ib3. The ost_connect operation
> failed with -16
> Feb 15 10:25:46 reshpc116 kernel: LustreError: Skipped 42 previous similar
> messages
> Feb 15 10:31:43 reshpc116 kernel: Lustre:
> reshpcfs-OST0005-osc-ffff8830175c8400: Connection restored to service
> reshpcfs-OST0005 using nid 10.0.250.47 at o2ib3.
> --
>
> Due to disk space issues on my lustre filesystem one of the OST''s
were full
> and I deactivated that OST this morning.  I thought that operation just
puts
> it in a read only state and that clients can still access the data from
that
> OST.  After activating this OST again the client connected again and was
> okay after this.  How else would you deal with a OST that is close to 100%
> full?  Is it okay to leave the OST active and the clients will know not to
> write data to that OST?
>
> Thanks,
> -J
>
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20110215/661ff22b/attachment.html

Andreas Dilger

2011-Feb-15 19:45 UTC

head link

[Lustre-discuss] Lustre client error

On 2011-02-15, at 12:20, Cliff White wrote:> Client situation depends on where you deactivated the OST - if you
deactivate on the MDS only, clients should be able to read.
> 
> What is best to do when an OST fills up really depends on what else you are
doing at the time, and how much control you have over what the clients are doing
and other things.  If you can solve the space issue with a quick rm -rf, best to
leave it online, likewise if all your clients are trying to bang on it and
failing, best to turn things off. YMMV
In theory, with 1.8 the full OST should be skipped for new object allocations,
but this is not robust in the face of e.g. a single very large file being
written to the OST that takes it from "average" usage to being full.
> On Tue, Feb 15, 2011 at 10:57 AM, Jagga Soorma <jagga13 at gmail.com>
wrote:
> Hi Guys,
> 
> One of my clients got a hung lustre mount this morning and I saw the
following errors in my logs:
> 
> --
> ..snip..
> Feb 15 09:38:07 reshpc116 kernel: LustreError: 11-0: an error occurred
while communicating with 10.0.250.47 at o2ib3. The ost_write operation failed
with -28
> Feb 15 09:38:07 reshpc116 kernel: LustreError: Skipped 4755836 previous
similar messages
> Feb 15 09:48:07 reshpc116 kernel: LustreError: 11-0: an error occurred
while communicating with 10.0.250.47 at o2ib3. The ost_write operation failed
with -28
> Feb 15 09:48:07 reshpc116 kernel: LustreError: Skipped 4649141 previous
similar messages
> Feb 15 10:16:54 reshpc116 kernel: Lustre:
6254:0:(client.c:1476:ptlrpc_expire_one_request()) @@@ Request x1360125198261945
sent from reshpcfs-OST0005-osc-ffff8830175c8400 to NID 10.0.250.47 at o2ib3
1344s ago has timed out (1344s prior to deadline).
> Feb 15 10:16:54 reshpc116 kernel: Lustre:
reshpcfs-OST0005-osc-ffff8830175c8400: Connection to service reshpcfs-OST0005
via nid 10.0.250.47 at o2ib3 was lost; in progress operations using this service
will wait for recovery to complete.
> Feb 15 10:16:54 reshpc116 kernel: LustreError: 11-0: an error occurred
while communicating with 10.0.250.47 at o2ib3. The ost_connect operation failed
with -16
> Feb 15 10:16:54 reshpc116 kernel: LustreError: Skipped 2888779 previous
similar messages
> Feb 15 10:16:55 reshpc116 kernel: Lustre:
6254:0:(client.c:1476:ptlrpc_expire_one_request()) @@@ Request x1360125198261947
sent from reshpcfs-OST0005-osc-ffff8830175c8400 to NID 10.0.250.47 at o2ib3
1344s ago has timed out (1344s prior to deadline).
> Feb 15 10:18:11 reshpc116 kernel: LustreError: 11-0: an error occurred
while communicating with 10.0.250.47 at o2ib3. The ost_connect operation failed
with -16
> Feb 15 10:18:11 reshpc116 kernel: LustreError: Skipped 10 previous similar
messages
> Feb 15 10:20:45 reshpc116 kernel: LustreError: 11-0: an error occurred
while communicating with 10.0.250.47 at o2ib3. The ost_connect operation failed
with -16
> Feb 15 10:20:45 reshpc116 kernel: LustreError: Skipped 21 previous similar
messages
> Feb 15 10:25:46 reshpc116 kernel: LustreError: 11-0: an error occurred
while communicating with 10.0.250.47 at o2ib3. The ost_connect operation failed
with -16
> Feb 15 10:25:46 reshpc116 kernel: LustreError: Skipped 42 previous similar
messages
> Feb 15 10:31:43 reshpc116 kernel: Lustre:
reshpcfs-OST0005-osc-ffff8830175c8400: Connection restored to service
reshpcfs-OST0005 using nid 10.0.250.47 at o2ib3.
> --
> 
> Due to disk space issues on my lustre filesystem one of the OST''s
were full and I deactivated that OST this morning.  I thought that operation
just puts it in a read only state and that clients can still access the data
from that OST.  After activating this OST again the client connected again and
was okay after this.  How else would you deal with a OST that is close to 100%
full?  Is it okay to leave the OST active and the clients will know not to write
data to that OST?
> 
> Thanks,
> -J
> 
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
> 
> 
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss

Cheers, Andreas
--
Andreas Dilger 
Principal Engineer
Whamcloud, Inc.

Jagga Soorma

2011-Feb-15 22:37 UTC

head link

[Lustre-discuss] Lustre client error

I did deactivate this OST on the MDS server.  So how would I deal with a OST
filling up?  The OST''s don''t seem to be filling up evenly
either.  How does
lustre handle a OST that is at 100%?  Would it not use this specific OST for
writes if there are other OST available with capacity?

Thanks,
-J

On Tue, Feb 15, 2011 at 11:45 AM, Andreas Dilger <adilger at
whamcloud.com>wrote:
> On 2011-02-15, at 12:20, Cliff White wrote:
> > Client situation depends on where you deactivated the OST - if you
> deactivate on the MDS only, clients should be able to read.
> >
> > What is best to do when an OST fills up really depends on what else
you
> are doing at the time, and how much control you have over what the clients
> are doing and other things.  If you can solve the space issue with a quick
> rm -rf, best to leave it online, likewise if all your clients are trying to
> bang on it and failing, best to turn things off. YMMV
>
> In theory, with 1.8 the full OST should be skipped for new object
> allocations, but this is not robust in the face of e.g. a single very large
> file being written to the OST that takes it from "average" usage
to being
> full.
>
> > On Tue, Feb 15, 2011 at 10:57 AM, Jagga Soorma <jagga13 at
gmail.com>
> wrote:
> > Hi Guys,
> >
> > One of my clients got a hung lustre mount this morning and I saw the
> following errors in my logs:
> >
> > --
> > ..snip..
> > Feb 15 09:38:07 reshpc116 kernel: LustreError: 11-0: an error occurred
> while communicating with 10.0.250.47 at o2ib3. The ost_write operation
failed
> with -28
> > Feb 15 09:38:07 reshpc116 kernel: LustreError: Skipped 4755836
previous
> similar messages
> > Feb 15 09:48:07 reshpc116 kernel: LustreError: 11-0: an error occurred
> while communicating with 10.0.250.47 at o2ib3. The ost_write operation
failed
> with -28
> > Feb 15 09:48:07 reshpc116 kernel: LustreError: Skipped 4649141
previous
> similar messages
> > Feb 15 10:16:54 reshpc116 kernel: Lustre:
> 6254:0:(client.c:1476:ptlrpc_expire_one_request()) @@@ Request
> x1360125198261945 sent from reshpcfs-OST0005-osc-ffff8830175c8400 to NID
> 10.0.250.47 at o2ib3 1344s ago has timed out (1344s prior to deadline).
> > Feb 15 10:16:54 reshpc116 kernel: Lustre:
> reshpcfs-OST0005-osc-ffff8830175c8400: Connection to service
> reshpcfs-OST0005 via nid 10.0.250.47 at o2ib3 was lost; in progress
> operations using this service will wait for recovery to complete.
> > Feb 15 10:16:54 reshpc116 kernel: LustreError: 11-0: an error occurred
> while communicating with 10.0.250.47 at o2ib3. The ost_connect operation
> failed with -16
> > Feb 15 10:16:54 reshpc116 kernel: LustreError: Skipped 2888779
previous
> similar messages
> > Feb 15 10:16:55 reshpc116 kernel: Lustre:
> 6254:0:(client.c:1476:ptlrpc_expire_one_request()) @@@ Request
> x1360125198261947 sent from reshpcfs-OST0005-osc-ffff8830175c8400 to NID
> 10.0.250.47 at o2ib3 1344s ago has timed out (1344s prior to deadline).
> > Feb 15 10:18:11 reshpc116 kernel: LustreError: 11-0: an error occurred
> while communicating with 10.0.250.47 at o2ib3. The ost_connect operation
> failed with -16
> > Feb 15 10:18:11 reshpc116 kernel: LustreError: Skipped 10 previous
> similar messages
> > Feb 15 10:20:45 reshpc116 kernel: LustreError: 11-0: an error occurred
> while communicating with 10.0.250.47 at o2ib3. The ost_connect operation
> failed with -16
> > Feb 15 10:20:45 reshpc116 kernel: LustreError: Skipped 21 previous
> similar messages
> > Feb 15 10:25:46 reshpc116 kernel: LustreError: 11-0: an error occurred
> while communicating with 10.0.250.47 at o2ib3. The ost_connect operation
> failed with -16
> > Feb 15 10:25:46 reshpc116 kernel: LustreError: Skipped 42 previous
> similar messages
> > Feb 15 10:31:43 reshpc116 kernel: Lustre:
> reshpcfs-OST0005-osc-ffff8830175c8400: Connection restored to service
> reshpcfs-OST0005 using nid 10.0.250.47 at o2ib3.
> > --
> >
> > Due to disk space issues on my lustre filesystem one of the
OST''s were
> full and I deactivated that OST this morning.  I thought that operation
just
> puts it in a read only state and that clients can still access the data
from
> that OST.  After activating this OST again the client connected again and
> was okay after this.  How else would you deal with a OST that is close to
> 100% full?  Is it okay to leave the OST active and the clients will know
not
> to write data to that OST?
> >
> > Thanks,
> > -J
> >
> > _______________________________________________
> > Lustre-discuss mailing list
> > Lustre-discuss at lists.lustre.org
> > http://lists.lustre.org/mailman/listinfo/lustre-discuss
> >
> >
> > _______________________________________________
> > Lustre-discuss mailing list
> > Lustre-discuss at lists.lustre.org
> > http://lists.lustre.org/mailman/listinfo/lustre-discuss
>
>
> Cheers, Andreas
> --
> Andreas Dilger
> Principal Engineer
> Whamcloud, Inc.
>
>
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20110215/57a3fbe5/attachment.html

Jagga Soorma

2011-Feb-15 23:01 UTC

head link

[Lustre-discuss] Lustre client error

Also, it looks like the client is reporting a different %used compared to
the oss server itself:

client:
reshpc101:~ # lfs df -h | grep -i 0007
reshpcfs-OST0007_UUID      2.0T      1.7T    202.7G   84% /reshpcfs[OST:7]

oss:
/dev/mapper/mpath7    2.0T  1.9T   40G  98% /gnet/lustre/oss02/mpath7

Here is how the data seems to be distributed on one of the OSS''s:
--
/dev/mapper/mpath5    2.0T  1.2T  688G  65% /gnet/lustre/oss02/mpath5
/dev/mapper/mpath6    2.0T  1.7T  224G  89% /gnet/lustre/oss02/mpath6
/dev/mapper/mpath7    2.0T  1.9T   41G  98% /gnet/lustre/oss02/mpath7
/dev/mapper/mpath8    2.0T  1.3T  671G  65% /gnet/lustre/oss02/mpath8
/dev/mapper/mpath9    2.0T  1.3T  634G  67% /gnet/lustre/oss02/mpath9
--

-J

On Tue, Feb 15, 2011 at 2:37 PM, Jagga Soorma <jagga13 at gmail.com>
wrote:
> I did deactivate this OST on the MDS server.  So how would I deal with a
> OST filling up?  The OST''s don''t seem to be filling up
evenly either.  How
> does lustre handle a OST that is at 100%?  Would it not use this specific
> OST for writes if there are other OST available with capacity?
>
> Thanks,
> -J
>
>
> On Tue, Feb 15, 2011 at 11:45 AM, Andreas Dilger <adilger at
whamcloud.com>wrote:
>
>> On 2011-02-15, at 12:20, Cliff White wrote:
>> > Client situation depends on where you deactivated the OST - if you
>> deactivate on the MDS only, clients should be able to read.
>> >
>> > What is best to do when an OST fills up really depends on what
else you
>> are doing at the time, and how much control you have over what the
clients
>> are doing and other things.  If you can solve the space issue with a
quick
>> rm -rf, best to leave it online, likewise if all your clients are
trying to
>> bang on it and failing, best to turn things off. YMMV
>>
>> In theory, with 1.8 the full OST should be skipped for new object
>> allocations, but this is not robust in the face of e.g. a single very
large
>> file being written to the OST that takes it from "average"
usage to being
>> full.
>>
>> > On Tue, Feb 15, 2011 at 10:57 AM, Jagga Soorma <jagga13 at
gmail.com>
>> wrote:
>> > Hi Guys,
>> >
>> > One of my clients got a hung lustre mount this morning and I saw
the
>> following errors in my logs:
>> >
>> > --
>> > ..snip..
>> > Feb 15 09:38:07 reshpc116 kernel: LustreError: 11-0: an error
occurred
>> while communicating with 10.0.250.47 at o2ib3. The ost_write operation
>> failed with -28
>> > Feb 15 09:38:07 reshpc116 kernel: LustreError: Skipped 4755836
previous
>> similar messages
>> > Feb 15 09:48:07 reshpc116 kernel: LustreError: 11-0: an error
occurred
>> while communicating with 10.0.250.47 at o2ib3. The ost_write operation
>> failed with -28
>> > Feb 15 09:48:07 reshpc116 kernel: LustreError: Skipped 4649141
previous
>> similar messages
>> > Feb 15 10:16:54 reshpc116 kernel: Lustre:
>> 6254:0:(client.c:1476:ptlrpc_expire_one_request()) @@@ Request
>> x1360125198261945 sent from reshpcfs-OST0005-osc-ffff8830175c8400 to
NID
>> 10.0.250.47 at o2ib3 1344s ago has timed out (1344s prior to deadline).
>> > Feb 15 10:16:54 reshpc116 kernel: Lustre:
>> reshpcfs-OST0005-osc-ffff8830175c8400: Connection to service
>> reshpcfs-OST0005 via nid 10.0.250.47 at o2ib3 was lost; in progress
>> operations using this service will wait for recovery to complete.
>> > Feb 15 10:16:54 reshpc116 kernel: LustreError: 11-0: an error
occurred
>> while communicating with 10.0.250.47 at o2ib3. The ost_connect
operation
>> failed with -16
>> > Feb 15 10:16:54 reshpc116 kernel: LustreError: Skipped 2888779
previous
>> similar messages
>> > Feb 15 10:16:55 reshpc116 kernel: Lustre:
>> 6254:0:(client.c:1476:ptlrpc_expire_one_request()) @@@ Request
>> x1360125198261947 sent from reshpcfs-OST0005-osc-ffff8830175c8400 to
NID
>> 10.0.250.47 at o2ib3 1344s ago has timed out (1344s prior to deadline).
>> > Feb 15 10:18:11 reshpc116 kernel: LustreError: 11-0: an error
occurred
>> while communicating with 10.0.250.47 at o2ib3. The ost_connect
operation
>> failed with -16
>> > Feb 15 10:18:11 reshpc116 kernel: LustreError: Skipped 10 previous
>> similar messages
>> > Feb 15 10:20:45 reshpc116 kernel: LustreError: 11-0: an error
occurred
>> while communicating with 10.0.250.47 at o2ib3. The ost_connect
operation
>> failed with -16
>> > Feb 15 10:20:45 reshpc116 kernel: LustreError: Skipped 21 previous
>> similar messages
>> > Feb 15 10:25:46 reshpc116 kernel: LustreError: 11-0: an error
occurred
>> while communicating with 10.0.250.47 at o2ib3. The ost_connect
operation
>> failed with -16
>> > Feb 15 10:25:46 reshpc116 kernel: LustreError: Skipped 42 previous
>> similar messages
>> > Feb 15 10:31:43 reshpc116 kernel: Lustre:
>> reshpcfs-OST0005-osc-ffff8830175c8400: Connection restored to service
>> reshpcfs-OST0005 using nid 10.0.250.47 at o2ib3.
>> > --
>> >
>> > Due to disk space issues on my lustre filesystem one of the
OST''s were
>> full and I deactivated that OST this morning.  I thought that operation
just
>> puts it in a read only state and that clients can still access the data
from
>> that OST.  After activating this OST again the client connected again
and
>> was okay after this.  How else would you deal with a OST that is close
to
>> 100% full?  Is it okay to leave the OST active and the clients will
know not
>> to write data to that OST?
>> >
>> > Thanks,
>> > -J
>> >
>> > _______________________________________________
>> > Lustre-discuss mailing list
>> > Lustre-discuss at lists.lustre.org
>> > http://lists.lustre.org/mailman/listinfo/lustre-discuss
>> >
>> >
>> > _______________________________________________
>> > Lustre-discuss mailing list
>> > Lustre-discuss at lists.lustre.org
>> > http://lists.lustre.org/mailman/listinfo/lustre-discuss
>>
>>
>> Cheers, Andreas
>> --
>> Andreas Dilger
>> Principal Engineer
>> Whamcloud, Inc.
>>
>>
>>
>>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20110215/62f8be94/attachment-0001.html

Jagga Soorma

2011-Feb-15 23:05 UTC

head link

[Lustre-discuss] Lustre client error

I might be looking at the wrong OST.  What is the best way to map the actual
/dev/mapper/mpath[X] to what OST ID is used for that volume?

Thanks,
-J

On Tue, Feb 15, 2011 at 3:01 PM, Jagga Soorma <jagga13 at gmail.com>
wrote:
> Also, it looks like the client is reporting a different %used compared to
> the oss server itself:
>
> client:
> reshpc101:~ # lfs df -h | grep -i 0007
> reshpcfs-OST0007_UUID      2.0T      1.7T    202.7G   84% /reshpcfs[OST:7]
>
> oss:
> /dev/mapper/mpath7    2.0T  1.9T   40G  98% /gnet/lustre/oss02/mpath7
>
> Here is how the data seems to be distributed on one of the OSS''s:
> --
> /dev/mapper/mpath5    2.0T  1.2T  688G  65% /gnet/lustre/oss02/mpath5
> /dev/mapper/mpath6    2.0T  1.7T  224G  89% /gnet/lustre/oss02/mpath6
> /dev/mapper/mpath7    2.0T  1.9T   41G  98% /gnet/lustre/oss02/mpath7
> /dev/mapper/mpath8    2.0T  1.3T  671G  65% /gnet/lustre/oss02/mpath8
> /dev/mapper/mpath9    2.0T  1.3T  634G  67% /gnet/lustre/oss02/mpath9
> --
>
> -J
>
>
> On Tue, Feb 15, 2011 at 2:37 PM, Jagga Soorma <jagga13 at gmail.com>
wrote:
>
>> I did deactivate this OST on the MDS server.  So how would I deal with
a
>> OST filling up?  The OST''s don''t seem to be filling
up evenly either.  How
>> does lustre handle a OST that is at 100%?  Would it not use this
specific
>> OST for writes if there are other OST available with capacity?
>>
>> Thanks,
>> -J
>>
>>
>> On Tue, Feb 15, 2011 at 11:45 AM, Andreas Dilger <adilger at
whamcloud.com>wrote:
>>
>>> On 2011-02-15, at 12:20, Cliff White wrote:
>>> > Client situation depends on where you deactivated the OST - if
you
>>> deactivate on the MDS only, clients should be able to read.
>>> >
>>> > What is best to do when an OST fills up really depends on what
else you
>>> are doing at the time, and how much control you have over what the
clients
>>> are doing and other things.  If you can solve the space issue with
a quick
>>> rm -rf, best to leave it online, likewise if all your clients are
trying to
>>> bang on it and failing, best to turn things off. YMMV
>>>
>>> In theory, with 1.8 the full OST should be skipped for new object
>>> allocations, but this is not robust in the face of e.g. a single
very large
>>> file being written to the OST that takes it from
"average" usage to being
>>> full.
>>>
>>> > On Tue, Feb 15, 2011 at 10:57 AM, Jagga Soorma <jagga13 at
gmail.com>
>>> wrote:
>>> > Hi Guys,
>>> >
>>> > One of my clients got a hung lustre mount this morning and I
saw the
>>> following errors in my logs:
>>> >
>>> > --
>>> > ..snip..
>>> > Feb 15 09:38:07 reshpc116 kernel: LustreError: 11-0: an error
occurred
>>> while communicating with 10.0.250.47 at o2ib3. The ost_write
operation
>>> failed with -28
>>> > Feb 15 09:38:07 reshpc116 kernel: LustreError: Skipped 4755836
previous
>>> similar messages
>>> > Feb 15 09:48:07 reshpc116 kernel: LustreError: 11-0: an error
occurred
>>> while communicating with 10.0.250.47 at o2ib3. The ost_write
operation
>>> failed with -28
>>> > Feb 15 09:48:07 reshpc116 kernel: LustreError: Skipped 4649141
previous
>>> similar messages
>>> > Feb 15 10:16:54 reshpc116 kernel: Lustre:
>>> 6254:0:(client.c:1476:ptlrpc_expire_one_request()) @@@ Request
>>> x1360125198261945 sent from reshpcfs-OST0005-osc-ffff8830175c8400
to NID
>>> 10.0.250.47 at o2ib3 1344s ago has timed out (1344s prior to
deadline).
>>> > Feb 15 10:16:54 reshpc116 kernel: Lustre:
>>> reshpcfs-OST0005-osc-ffff8830175c8400: Connection to service
>>> reshpcfs-OST0005 via nid 10.0.250.47 at o2ib3 was lost; in progress
>>> operations using this service will wait for recovery to complete.
>>> > Feb 15 10:16:54 reshpc116 kernel: LustreError: 11-0: an error
occurred
>>> while communicating with 10.0.250.47 at o2ib3. The ost_connect
operation
>>> failed with -16
>>> > Feb 15 10:16:54 reshpc116 kernel: LustreError: Skipped 2888779
previous
>>> similar messages
>>> > Feb 15 10:16:55 reshpc116 kernel: Lustre:
>>> 6254:0:(client.c:1476:ptlrpc_expire_one_request()) @@@ Request
>>> x1360125198261947 sent from reshpcfs-OST0005-osc-ffff8830175c8400
to NID
>>> 10.0.250.47 at o2ib3 1344s ago has timed out (1344s prior to
deadline).
>>> > Feb 15 10:18:11 reshpc116 kernel: LustreError: 11-0: an error
occurred
>>> while communicating with 10.0.250.47 at o2ib3. The ost_connect
operation
>>> failed with -16
>>> > Feb 15 10:18:11 reshpc116 kernel: LustreError: Skipped 10
previous
>>> similar messages
>>> > Feb 15 10:20:45 reshpc116 kernel: LustreError: 11-0: an error
occurred
>>> while communicating with 10.0.250.47 at o2ib3. The ost_connect
operation
>>> failed with -16
>>> > Feb 15 10:20:45 reshpc116 kernel: LustreError: Skipped 21
previous
>>> similar messages
>>> > Feb 15 10:25:46 reshpc116 kernel: LustreError: 11-0: an error
occurred
>>> while communicating with 10.0.250.47 at o2ib3. The ost_connect
operation
>>> failed with -16
>>> > Feb 15 10:25:46 reshpc116 kernel: LustreError: Skipped 42
previous
>>> similar messages
>>> > Feb 15 10:31:43 reshpc116 kernel: Lustre:
>>> reshpcfs-OST0005-osc-ffff8830175c8400: Connection restored to
service
>>> reshpcfs-OST0005 using nid 10.0.250.47 at o2ib3.
>>> > --
>>> >
>>> > Due to disk space issues on my lustre filesystem one of the
OST''s were
>>> full and I deactivated that OST this morning.  I thought that
operation just
>>> puts it in a read only state and that clients can still access the
data from
>>> that OST.  After activating this OST again the client connected
again and
>>> was okay after this.  How else would you deal with a OST that is
close to
>>> 100% full?  Is it okay to leave the OST active and the clients will
know not
>>> to write data to that OST?
>>> >
>>> > Thanks,
>>> > -J
>>> >
>>> > _______________________________________________
>>> > Lustre-discuss mailing list
>>> > Lustre-discuss at lists.lustre.org
>>> > http://lists.lustre.org/mailman/listinfo/lustre-discuss
>>> >
>>> >
>>> > _______________________________________________
>>> > Lustre-discuss mailing list
>>> > Lustre-discuss at lists.lustre.org
>>> > http://lists.lustre.org/mailman/listinfo/lustre-discuss
>>>
>>>
>>> Cheers, Andreas
>>> --
>>> Andreas Dilger
>>> Principal Engineer
>>> Whamcloud, Inc.
>>>
>>>
>>>
>>>
>>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20110215/0e689396/attachment.html

Jagga Soorma

2011-Feb-16 00:09 UTC

head link

[Lustre-discuss] Lustre client error

This OST is 100% now with only 12GB remaining and something is actively
writing to this volume.  What would be the appropriate thing to do in this
scenario?  If I set this to read only on the mds then some of my clients
start hanging up.

Should I be running "lfs find -O OST_UID /lustre" and then move the
files
out of this filesystem and re-add them back?  But then there is no gurantee
that they will not be written to this specific OST.

Any help would be greately appreciated.

Thanks,
-J

On Tue, Feb 15, 2011 at 3:05 PM, Jagga Soorma <jagga13 at gmail.com>
wrote:
> I might be looking at the wrong OST.  What is the best way to map the
> actual /dev/mapper/mpath[X] to what OST ID is used for that volume?
>
> Thanks,
> -J
>
>
> On Tue, Feb 15, 2011 at 3:01 PM, Jagga Soorma <jagga13 at gmail.com>
wrote:
>
>> Also, it looks like the client is reporting a different %used compared
to
>> the oss server itself:
>>
>> client:
>> reshpc101:~ # lfs df -h | grep -i 0007
>> reshpcfs-OST0007_UUID      2.0T      1.7T    202.7G   84%
/reshpcfs[OST:7]
>>
>> oss:
>> /dev/mapper/mpath7    2.0T  1.9T   40G  98% /gnet/lustre/oss02/mpath7
>>
>> Here is how the data seems to be distributed on one of the
OSS''s:
>> --
>> /dev/mapper/mpath5    2.0T  1.2T  688G  65% /gnet/lustre/oss02/mpath5
>> /dev/mapper/mpath6    2.0T  1.7T  224G  89% /gnet/lustre/oss02/mpath6
>> /dev/mapper/mpath7    2.0T  1.9T   41G  98% /gnet/lustre/oss02/mpath7
>> /dev/mapper/mpath8    2.0T  1.3T  671G  65% /gnet/lustre/oss02/mpath8
>> /dev/mapper/mpath9    2.0T  1.3T  634G  67% /gnet/lustre/oss02/mpath9
>> --
>>
>> -J
>>
>>
>> On Tue, Feb 15, 2011 at 2:37 PM, Jagga Soorma <jagga13 at
gmail.com> wrote:
>>
>>> I did deactivate this OST on the MDS server.  So how would I deal
with a
>>> OST filling up?  The OST''s don''t seem to be
filling up evenly either.  How
>>> does lustre handle a OST that is at 100%?  Would it not use this
specific
>>> OST for writes if there are other OST available with capacity?
>>>
>>> Thanks,
>>> -J
>>>
>>>
>>> On Tue, Feb 15, 2011 at 11:45 AM, Andreas Dilger <adilger at
whamcloud.com>wrote:
>>>
>>>> On 2011-02-15, at 12:20, Cliff White wrote:
>>>> > Client situation depends on where you deactivated the OST
- if you
>>>> deactivate on the MDS only, clients should be able to read.
>>>> >
>>>> > What is best to do when an OST fills up really depends on
what else
>>>> you are doing at the time, and how much control you have over
what the
>>>> clients are doing and other things.  If you can solve the space
issue with a
>>>> quick rm -rf, best to leave it online, likewise if all your
clients are
>>>> trying to bang on it and failing, best to turn things off. YMMV
>>>>
>>>> In theory, with 1.8 the full OST should be skipped for new
object
>>>> allocations, but this is not robust in the face of e.g. a
single very large
>>>> file being written to the OST that takes it from
"average" usage to being
>>>> full.
>>>>
>>>> > On Tue, Feb 15, 2011 at 10:57 AM, Jagga Soorma <jagga13
at gmail.com>
>>>> wrote:
>>>> > Hi Guys,
>>>> >
>>>> > One of my clients got a hung lustre mount this morning and
I saw the
>>>> following errors in my logs:
>>>> >
>>>> > --
>>>> > ..snip..
>>>> > Feb 15 09:38:07 reshpc116 kernel: LustreError: 11-0: an
error occurred
>>>> while communicating with 10.0.250.47 at o2ib3. The ost_write
operation
>>>> failed with -28
>>>> > Feb 15 09:38:07 reshpc116 kernel: LustreError: Skipped
4755836
>>>> previous similar messages
>>>> > Feb 15 09:48:07 reshpc116 kernel: LustreError: 11-0: an
error occurred
>>>> while communicating with 10.0.250.47 at o2ib3. The ost_write
operation
>>>> failed with -28
>>>> > Feb 15 09:48:07 reshpc116 kernel: LustreError: Skipped
4649141
>>>> previous similar messages
>>>> > Feb 15 10:16:54 reshpc116 kernel: Lustre:
>>>> 6254:0:(client.c:1476:ptlrpc_expire_one_request()) @@@ Request
>>>> x1360125198261945 sent from
reshpcfs-OST0005-osc-ffff8830175c8400 to NID
>>>> 10.0.250.47 at o2ib3 1344s ago has timed out (1344s prior to
deadline).
>>>> > Feb 15 10:16:54 reshpc116 kernel: Lustre:
>>>> reshpcfs-OST0005-osc-ffff8830175c8400: Connection to service
>>>> reshpcfs-OST0005 via nid 10.0.250.47 at o2ib3 was lost; in
progress
>>>> operations using this service will wait for recovery to
complete.
>>>> > Feb 15 10:16:54 reshpc116 kernel: LustreError: 11-0: an
error occurred
>>>> while communicating with 10.0.250.47 at o2ib3. The ost_connect
operation
>>>> failed with -16
>>>> > Feb 15 10:16:54 reshpc116 kernel: LustreError: Skipped
2888779
>>>> previous similar messages
>>>> > Feb 15 10:16:55 reshpc116 kernel: Lustre:
>>>> 6254:0:(client.c:1476:ptlrpc_expire_one_request()) @@@ Request
>>>> x1360125198261947 sent from
reshpcfs-OST0005-osc-ffff8830175c8400 to NID
>>>> 10.0.250.47 at o2ib3 1344s ago has timed out (1344s prior to
deadline).
>>>> > Feb 15 10:18:11 reshpc116 kernel: LustreError: 11-0: an
error occurred
>>>> while communicating with 10.0.250.47 at o2ib3. The ost_connect
operation
>>>> failed with -16
>>>> > Feb 15 10:18:11 reshpc116 kernel: LustreError: Skipped 10
previous
>>>> similar messages
>>>> > Feb 15 10:20:45 reshpc116 kernel: LustreError: 11-0: an
error occurred
>>>> while communicating with 10.0.250.47 at o2ib3. The ost_connect
operation
>>>> failed with -16
>>>> > Feb 15 10:20:45 reshpc116 kernel: LustreError: Skipped 21
previous
>>>> similar messages
>>>> > Feb 15 10:25:46 reshpc116 kernel: LustreError: 11-0: an
error occurred
>>>> while communicating with 10.0.250.47 at o2ib3. The ost_connect
operation
>>>> failed with -16
>>>> > Feb 15 10:25:46 reshpc116 kernel: LustreError: Skipped 42
previous
>>>> similar messages
>>>> > Feb 15 10:31:43 reshpc116 kernel: Lustre:
>>>> reshpcfs-OST0005-osc-ffff8830175c8400: Connection restored to
service
>>>> reshpcfs-OST0005 using nid 10.0.250.47 at o2ib3.
>>>> > --
>>>> >
>>>> > Due to disk space issues on my lustre filesystem one of
the OST''s were
>>>> full and I deactivated that OST this morning.  I thought that
operation just
>>>> puts it in a read only state and that clients can still access
the data from
>>>> that OST.  After activating this OST again the client connected
again and
>>>> was okay after this.  How else would you deal with a OST that
is close to
>>>> 100% full?  Is it okay to leave the OST active and the clients
will know not
>>>> to write data to that OST?
>>>> >
>>>> > Thanks,
>>>> > -J
>>>> >
>>>> > _______________________________________________
>>>> > Lustre-discuss mailing list
>>>> > Lustre-discuss at lists.lustre.org
>>>> > http://lists.lustre.org/mailman/listinfo/lustre-discuss
>>>> >
>>>> >
>>>> > _______________________________________________
>>>> > Lustre-discuss mailing list
>>>> > Lustre-discuss at lists.lustre.org
>>>> > http://lists.lustre.org/mailman/listinfo/lustre-discuss
>>>>
>>>>
>>>> Cheers, Andreas
>>>> --
>>>> Andreas Dilger
>>>> Principal Engineer
>>>> Whamcloud, Inc.
>>>>
>>>>
>>>>
>>>>
>>>
>>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20110215/2338bd81/attachment-0001.html

Cliff White

2011-Feb-16 00:25 UTC

head link

[Lustre-discuss] Lustre client error

you can use lfs find or lfs getstripe to identify where files are.
If you move the files out and move them back, the QOS policy should
re-distribute them evenly, but it very much depends. If you have clients
using a stripe count of 1,
a single large file can fill up one OST.
df on the client reports space for the entire filesystem, df on the OSS
reports space for the targets
attached to that server, so yes the results will be different.
cliffw

On Tue, Feb 15, 2011 at 4:09 PM, Jagga Soorma <jagga13 at gmail.com>
wrote:
> This OST is 100% now with only 12GB remaining and something is actively
> writing to this volume.  What would be the appropriate thing to do in this
> scenario?  If I set this to read only on the mds then some of my clients
> start hanging up.
>
> Should I be running "lfs find -O OST_UID /lustre" and then move
the files
> out of this filesystem and re-add them back?  But then there is no gurantee
> that they will not be written to this specific OST.
>
> Any help would be greately appreciated.
>
> Thanks,
> -J
>
>
> On Tue, Feb 15, 2011 at 3:05 PM, Jagga Soorma <jagga13 at gmail.com>
wrote:
>
>> I might be looking at the wrong OST.  What is the best way to map the
>> actual /dev/mapper/mpath[X] to what OST ID is used for that volume?
>>
>> Thanks,
>> -J
>>
>>
>> On Tue, Feb 15, 2011 at 3:01 PM, Jagga Soorma <jagga13 at
gmail.com> wrote:
>>
>>> Also, it looks like the client is reporting a different %used
compared to
>>> the oss server itself:
>>>
>>> client:
>>> reshpc101:~ # lfs df -h | grep -i 0007
>>> reshpcfs-OST0007_UUID      2.0T      1.7T    202.7G   84%
>>> /reshpcfs[OST:7]
>>>
>>> oss:
>>> /dev/mapper/mpath7    2.0T  1.9T   40G  98%
/gnet/lustre/oss02/mpath7
>>>
>>> Here is how the data seems to be distributed on one of the
OSS''s:
>>> --
>>> /dev/mapper/mpath5    2.0T  1.2T  688G  65%
/gnet/lustre/oss02/mpath5
>>> /dev/mapper/mpath6    2.0T  1.7T  224G  89%
/gnet/lustre/oss02/mpath6
>>> /dev/mapper/mpath7    2.0T  1.9T   41G  98%
/gnet/lustre/oss02/mpath7
>>> /dev/mapper/mpath8    2.0T  1.3T  671G  65%
/gnet/lustre/oss02/mpath8
>>> /dev/mapper/mpath9    2.0T  1.3T  634G  67%
/gnet/lustre/oss02/mpath9
>>> --
>>>
>>> -J
>>>
>>>
>>> On Tue, Feb 15, 2011 at 2:37 PM, Jagga Soorma <jagga13 at
gmail.com> wrote:
>>>
>>>> I did deactivate this OST on the MDS server.  So how would I
deal with a
>>>> OST filling up?  The OST''s don''t seem to be
filling up evenly either.  How
>>>> does lustre handle a OST that is at 100%?  Would it not use
this specific
>>>> OST for writes if there are other OST available with capacity?
>>>>
>>>> Thanks,
>>>> -J
>>>>
>>>>
>>>> On Tue, Feb 15, 2011 at 11:45 AM, Andreas Dilger <adilger at
whamcloud.com
>>>> > wrote:
>>>>
>>>>> On 2011-02-15, at 12:20, Cliff White wrote:
>>>>> > Client situation depends on where you deactivated the
OST - if you
>>>>> deactivate on the MDS only, clients should be able to read.
>>>>> >
>>>>> > What is best to do when an OST fills up really depends
on what else
>>>>> you are doing at the time, and how much control you have
over what the
>>>>> clients are doing and other things.  If you can solve the
space issue with a
>>>>> quick rm -rf, best to leave it online, likewise if all your
clients are
>>>>> trying to bang on it and failing, best to turn things off.
YMMV
>>>>>
>>>>> In theory, with 1.8 the full OST should be skipped for new
object
>>>>> allocations, but this is not robust in the face of e.g. a
single very large
>>>>> file being written to the OST that takes it from
"average" usage to being
>>>>> full.
>>>>>
>>>>> > On Tue, Feb 15, 2011 at 10:57 AM, Jagga Soorma
<jagga13 at gmail.com>
>>>>> wrote:
>>>>> > Hi Guys,
>>>>> >
>>>>> > One of my clients got a hung lustre mount this morning
and I saw the
>>>>> following errors in my logs:
>>>>> >
>>>>> > --
>>>>> > ..snip..
>>>>> > Feb 15 09:38:07 reshpc116 kernel: LustreError: 11-0:
an error
>>>>> occurred while communicating with 10.0.250.47 at o2ib3. The
ost_write
>>>>> operation failed with -28
>>>>> > Feb 15 09:38:07 reshpc116 kernel: LustreError: Skipped
4755836
>>>>> previous similar messages
>>>>> > Feb 15 09:48:07 reshpc116 kernel: LustreError: 11-0:
an error
>>>>> occurred while communicating with 10.0.250.47 at o2ib3. The
ost_write
>>>>> operation failed with -28
>>>>> > Feb 15 09:48:07 reshpc116 kernel: LustreError: Skipped
4649141
>>>>> previous similar messages
>>>>> > Feb 15 10:16:54 reshpc116 kernel: Lustre:
>>>>> 6254:0:(client.c:1476:ptlrpc_expire_one_request()) @@@
Request
>>>>> x1360125198261945 sent from
reshpcfs-OST0005-osc-ffff8830175c8400 to NID
>>>>> 10.0.250.47 at o2ib3 1344s ago has timed out (1344s prior
to deadline).
>>>>> > Feb 15 10:16:54 reshpc116 kernel: Lustre:
>>>>> reshpcfs-OST0005-osc-ffff8830175c8400: Connection to
service
>>>>> reshpcfs-OST0005 via nid 10.0.250.47 at o2ib3 was lost; in
progress
>>>>> operations using this service will wait for recovery to
complete.
>>>>> > Feb 15 10:16:54 reshpc116 kernel: LustreError: 11-0:
an error
>>>>> occurred while communicating with 10.0.250.47 at o2ib3. The
ost_connect
>>>>> operation failed with -16
>>>>> > Feb 15 10:16:54 reshpc116 kernel: LustreError: Skipped
2888779
>>>>> previous similar messages
>>>>> > Feb 15 10:16:55 reshpc116 kernel: Lustre:
>>>>> 6254:0:(client.c:1476:ptlrpc_expire_one_request()) @@@
Request
>>>>> x1360125198261947 sent from
reshpcfs-OST0005-osc-ffff8830175c8400 to NID
>>>>> 10.0.250.47 at o2ib3 1344s ago has timed out (1344s prior
to deadline).
>>>>> > Feb 15 10:18:11 reshpc116 kernel: LustreError: 11-0:
an error
>>>>> occurred while communicating with 10.0.250.47 at o2ib3. The
ost_connect
>>>>> operation failed with -16
>>>>> > Feb 15 10:18:11 reshpc116 kernel: LustreError: Skipped
10 previous
>>>>> similar messages
>>>>> > Feb 15 10:20:45 reshpc116 kernel: LustreError: 11-0:
an error
>>>>> occurred while communicating with 10.0.250.47 at o2ib3. The
ost_connect
>>>>> operation failed with -16
>>>>> > Feb 15 10:20:45 reshpc116 kernel: LustreError: Skipped
21 previous
>>>>> similar messages
>>>>> > Feb 15 10:25:46 reshpc116 kernel: LustreError: 11-0:
an error
>>>>> occurred while communicating with 10.0.250.47 at o2ib3. The
ost_connect
>>>>> operation failed with -16
>>>>> > Feb 15 10:25:46 reshpc116 kernel: LustreError: Skipped
42 previous
>>>>> similar messages
>>>>> > Feb 15 10:31:43 reshpc116 kernel: Lustre:
>>>>> reshpcfs-OST0005-osc-ffff8830175c8400: Connection restored
to service
>>>>> reshpcfs-OST0005 using nid 10.0.250.47 at o2ib3.
>>>>> > --
>>>>> >
>>>>> > Due to disk space issues on my lustre filesystem one
of the OST''s
>>>>> were full and I deactivated that OST this morning.  I
thought that operation
>>>>> just puts it in a read only state and that clients can
still access the data
>>>>> from that OST.  After activating this OST again the client
connected again
>>>>> and was okay after this.  How else would you deal with a
OST that is close
>>>>> to 100% full?  Is it okay to leave the OST active and the
clients will know
>>>>> not to write data to that OST?
>>>>> >
>>>>> > Thanks,
>>>>> > -J
>>>>> >
>>>>> > _______________________________________________
>>>>> > Lustre-discuss mailing list
>>>>> > Lustre-discuss at lists.lustre.org
>>>>> >
http://lists.lustre.org/mailman/listinfo/lustre-discuss
>>>>> >
>>>>> >
>>>>> > _______________________________________________
>>>>> > Lustre-discuss mailing list
>>>>> > Lustre-discuss at lists.lustre.org
>>>>> >
http://lists.lustre.org/mailman/listinfo/lustre-discuss
>>>>>
>>>>>
>>>>> Cheers, Andreas
>>>>> --
>>>>> Andreas Dilger
>>>>> Principal Engineer
>>>>> Whamcloud, Inc.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20110215/adaa069c/attachment.html

Jagga Soorma

2011-Feb-16 17:39 UTC

head link

[Lustre-discuss] Lustre client error

Another thing that I just noticed is that after deactivating a OST on the
MDS, I am no longer able to check the quota''s for users.  Here is the
message I receive:

--
Disk quotas for user testuser (uid 17229):
     Filesystem  kbytes   quota   limit   grace   files   quota   limit
grace
      /lustre     [0]     [0]     [0]             [0]     [0]     [0]

Some errors happened when getting quota info. Some devices may be not
working or deactivated. The data in "[]" is inaccurate.
--

Is this normal and expected?  Or am I missing something here?

Thanks for all your support.  It is much appreciated.

Regards,
-J

On Tue, Feb 15, 2011 at 4:25 PM, Cliff White <cliffw at whamcloud.com>
wrote:
> you can use lfs find or lfs getstripe to identify where files are.
> If you move the files out and move them back, the QOS policy should
> re-distribute them evenly, but it very much depends. If you have clients
> using a stripe count of 1,
> a single large file can fill up one OST.
> df on the client reports space for the entire filesystem, df on the OSS
> reports space for the targets
> attached to that server, so yes the results will be different.
> cliffw
>
>
> On Tue, Feb 15, 2011 at 4:09 PM, Jagga Soorma <jagga13 at gmail.com>
wrote:
>
>> This OST is 100% now with only 12GB remaining and something is actively
>> writing to this volume.  What would be the appropriate thing to do in
this
>> scenario?  If I set this to read only on the mds then some of my
clients
>> start hanging up.
>>
>> Should I be running "lfs find -O OST_UID /lustre" and then
move the files
>> out of this filesystem and re-add them back?  But then there is no
gurantee
>> that they will not be written to this specific OST.
>>
>> Any help would be greately appreciated.
>>
>> Thanks,
>> -J
>>
>>
>> On Tue, Feb 15, 2011 at 3:05 PM, Jagga Soorma <jagga13 at
gmail.com> wrote:
>>
>>> I might be looking at the wrong OST.  What is the best way to map
the
>>> actual /dev/mapper/mpath[X] to what OST ID is used for that volume?
>>>
>>> Thanks,
>>> -J
>>>
>>>
>>> On Tue, Feb 15, 2011 at 3:01 PM, Jagga Soorma <jagga13 at
gmail.com> wrote:
>>>
>>>> Also, it looks like the client is reporting a different %used
compared
>>>> to the oss server itself:
>>>>
>>>> client:
>>>> reshpc101:~ # lfs df -h | grep -i 0007
>>>> reshpcfs-OST0007_UUID      2.0T      1.7T    202.7G   84%
>>>> /reshpcfs[OST:7]
>>>>
>>>> oss:
>>>> /dev/mapper/mpath7    2.0T  1.9T   40G  98%
/gnet/lustre/oss02/mpath7
>>>>
>>>> Here is how the data seems to be distributed on one of the
OSS''s:
>>>> --
>>>> /dev/mapper/mpath5    2.0T  1.2T  688G  65%
/gnet/lustre/oss02/mpath5
>>>> /dev/mapper/mpath6    2.0T  1.7T  224G  89%
/gnet/lustre/oss02/mpath6
>>>> /dev/mapper/mpath7    2.0T  1.9T   41G  98%
/gnet/lustre/oss02/mpath7
>>>> /dev/mapper/mpath8    2.0T  1.3T  671G  65%
/gnet/lustre/oss02/mpath8
>>>> /dev/mapper/mpath9    2.0T  1.3T  634G  67%
/gnet/lustre/oss02/mpath9
>>>> --
>>>>
>>>> -J
>>>>
>>>>
>>>> On Tue, Feb 15, 2011 at 2:37 PM, Jagga Soorma <jagga13 at
gmail.com>wrote:
>>>>
>>>>> I did deactivate this OST on the MDS server.  So how would
I deal with
>>>>> a OST filling up?  The OST''s don''t seem
to be filling up evenly either.  How
>>>>> does lustre handle a OST that is at 100%?  Would it not use
this specific
>>>>> OST for writes if there are other OST available with
capacity?
>>>>>
>>>>> Thanks,
>>>>> -J
>>>>>
>>>>>
>>>>> On Tue, Feb 15, 2011 at 11:45 AM, Andreas Dilger <
>>>>> adilger at whamcloud.com> wrote:
>>>>>
>>>>>> On 2011-02-15, at 12:20, Cliff White wrote:
>>>>>> > Client situation depends on where you deactivated
the OST - if you
>>>>>> deactivate on the MDS only, clients should be able to
read.
>>>>>> >
>>>>>> > What is best to do when an OST fills up really
depends on what else
>>>>>> you are doing at the time, and how much control you
have over what the
>>>>>> clients are doing and other things.  If you can solve
the space issue with a
>>>>>> quick rm -rf, best to leave it online, likewise if all
your clients are
>>>>>> trying to bang on it and failing, best to turn things
off. YMMV
>>>>>>
>>>>>> In theory, with 1.8 the full OST should be skipped for
new object
>>>>>> allocations, but this is not robust in the face of e.g.
a single very large
>>>>>> file being written to the OST that takes it from
"average" usage to being
>>>>>> full.
>>>>>>
>>>>>> > On Tue, Feb 15, 2011 at 10:57 AM, Jagga Soorma
<jagga13 at gmail.com>
>>>>>> wrote:
>>>>>> > Hi Guys,
>>>>>> >
>>>>>> > One of my clients got a hung lustre mount this
morning and I saw the
>>>>>> following errors in my logs:
>>>>>> >
>>>>>> > --
>>>>>> > ..snip..
>>>>>> > Feb 15 09:38:07 reshpc116 kernel: LustreError:
11-0: an error
>>>>>> occurred while communicating with 10.0.250.47 at o2ib3.
The ost_write
>>>>>> operation failed with -28
>>>>>> > Feb 15 09:38:07 reshpc116 kernel: LustreError:
Skipped 4755836
>>>>>> previous similar messages
>>>>>> > Feb 15 09:48:07 reshpc116 kernel: LustreError:
11-0: an error
>>>>>> occurred while communicating with 10.0.250.47 at o2ib3.
The ost_write
>>>>>> operation failed with -28
>>>>>> > Feb 15 09:48:07 reshpc116 kernel: LustreError:
Skipped 4649141
>>>>>> previous similar messages
>>>>>> > Feb 15 10:16:54 reshpc116 kernel: Lustre:
>>>>>> 6254:0:(client.c:1476:ptlrpc_expire_one_request()) @@@
Request
>>>>>> x1360125198261945 sent from
reshpcfs-OST0005-osc-ffff8830175c8400 to NID
>>>>>> 10.0.250.47 at o2ib3 1344s ago has timed out (1344s
prior to deadline).
>>>>>> > Feb 15 10:16:54 reshpc116 kernel: Lustre:
>>>>>> reshpcfs-OST0005-osc-ffff8830175c8400: Connection to
service
>>>>>> reshpcfs-OST0005 via nid 10.0.250.47 at o2ib3 was lost;
in progress
>>>>>> operations using this service will wait for recovery to
complete.
>>>>>> > Feb 15 10:16:54 reshpc116 kernel: LustreError:
11-0: an error
>>>>>> occurred while communicating with 10.0.250.47 at o2ib3.
The ost_connect
>>>>>> operation failed with -16
>>>>>> > Feb 15 10:16:54 reshpc116 kernel: LustreError:
Skipped 2888779
>>>>>> previous similar messages
>>>>>> > Feb 15 10:16:55 reshpc116 kernel: Lustre:
>>>>>> 6254:0:(client.c:1476:ptlrpc_expire_one_request()) @@@
Request
>>>>>> x1360125198261947 sent from
reshpcfs-OST0005-osc-ffff8830175c8400 to NID
>>>>>> 10.0.250.47 at o2ib3 1344s ago has timed out (1344s
prior to deadline).
>>>>>> > Feb 15 10:18:11 reshpc116 kernel: LustreError:
11-0: an error
>>>>>> occurred while communicating with 10.0.250.47 at o2ib3.
The ost_connect
>>>>>> operation failed with -16
>>>>>> > Feb 15 10:18:11 reshpc116 kernel: LustreError:
Skipped 10 previous
>>>>>> similar messages
>>>>>> > Feb 15 10:20:45 reshpc116 kernel: LustreError:
11-0: an error
>>>>>> occurred while communicating with 10.0.250.47 at o2ib3.
The ost_connect
>>>>>> operation failed with -16
>>>>>> > Feb 15 10:20:45 reshpc116 kernel: LustreError:
Skipped 21 previous
>>>>>> similar messages
>>>>>> > Feb 15 10:25:46 reshpc116 kernel: LustreError:
11-0: an error
>>>>>> occurred while communicating with 10.0.250.47 at o2ib3.
The ost_connect
>>>>>> operation failed with -16
>>>>>> > Feb 15 10:25:46 reshpc116 kernel: LustreError:
Skipped 42 previous
>>>>>> similar messages
>>>>>> > Feb 15 10:31:43 reshpc116 kernel: Lustre:
>>>>>> reshpcfs-OST0005-osc-ffff8830175c8400: Connection
restored to service
>>>>>> reshpcfs-OST0005 using nid 10.0.250.47 at o2ib3.
>>>>>> > --
>>>>>> >
>>>>>> > Due to disk space issues on my lustre filesystem
one of the OST''s
>>>>>> were full and I deactivated that OST this morning.  I
thought that operation
>>>>>> just puts it in a read only state and that clients can
still access the data
>>>>>> from that OST.  After activating this OST again the
client connected again
>>>>>> and was okay after this.  How else would you deal with
a OST that is close
>>>>>> to 100% full?  Is it okay to leave the OST active and
the clients will know
>>>>>> not to write data to that OST?
>>>>>> >
>>>>>> > Thanks,
>>>>>> > -J
>>>>>> >
>>>>>> > _______________________________________________
>>>>>> > Lustre-discuss mailing list
>>>>>> > Lustre-discuss at lists.lustre.org
>>>>>> >
http://lists.lustre.org/mailman/listinfo/lustre-discuss
>>>>>> >
>>>>>> >
>>>>>> > _______________________________________________
>>>>>> > Lustre-discuss mailing list
>>>>>> > Lustre-discuss at lists.lustre.org
>>>>>> >
http://lists.lustre.org/mailman/listinfo/lustre-discuss
>>>>>>
>>>>>>
>>>>>> Cheers, Andreas
>>>>>> --
>>>>>> Andreas Dilger
>>>>>> Principal Engineer
>>>>>> Whamcloud, Inc.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20110216/632fe16f/attachment-0001.html

Kevin Van Maren

2011-Feb-17 22:14 UTC

head link

[Lustre-discuss] Lustre client error

To figure out which OST is which, use "e2label /dev/sdX" (or
"e2label
/dev/mapper/mpath7") which will print the OST index in hex.

If clients run out of space, but there is space left, see Bug 22755 
(mostly fixed in Lustre 1.8.4).

Lustre assigns the OST index at file creation time.  Lustre will avoid 
full OSTs, but once a file is created any growth must be accommodated by 
the initial OST assignment(s).  Deactivating the OST on the MDS will 
prevent new allocations, but they shouldn''t be happening anyway.

You can copy/rename some large files to put them on another OST which 
will free up space on the full OST (move will not allocate new space, 
just change the directory name).

Kevin



Jagga Soorma wrote:> This OST is 100% now with only 12GB remaining and something is 
> actively writing to this volume.  What would be the appropriate thing 
> to do in this scenario?  If I set this to read only on the mds then 
> some of my clients start hanging up.
>
> Should I be running "lfs find -O OST_UID /lustre" and then move
the
> files out of this filesystem and re-add them back?  But then there is 
> no gurantee that they will not be written to this specific OST.
>
> Any help would be greately appreciated.
>
> Thanks,
> -J
>
> On Tue, Feb 15, 2011 at 3:05 PM, Jagga Soorma <jagga13 at gmail.com 
> <mailto:jagga13 at gmail.com>> wrote:
>
>     I might be looking at the wrong OST.  What is the best way to map
>     the actual /dev/mapper/mpath[X] to what OST ID is used for that
>     volume?
>
>     Thanks,
>     -J
>
>
>     On Tue, Feb 15, 2011 at 3:01 PM, Jagga Soorma <jagga13 at gmail.com
>     <mailto:jagga13 at gmail.com>> wrote:
>
>         Also, it looks like the client is reporting a different %used
>         compared to the oss server itself:
>
>         client:
>         reshpc101:~ # lfs df -h | grep -i 0007
>         reshpcfs-OST0007_UUID      2.0T      1.7T    202.7G   84%
>         /reshpcfs[OST:7]
>
>         oss:
>         /dev/mapper/mpath7    2.0T  1.9T   40G  98%
>         /gnet/lustre/oss02/mpath7
>
>         Here is how the data seems to be distributed on one of the
OSS''s:
>         --
>         /dev/mapper/mpath5    2.0T  1.2T  688G  65%
>         /gnet/lustre/oss02/mpath5
>         /dev/mapper/mpath6    2.0T  1.7T  224G  89%
>         /gnet/lustre/oss02/mpath6
>         /dev/mapper/mpath7    2.0T  1.9T   41G  98%
>         /gnet/lustre/oss02/mpath7
>         /dev/mapper/mpath8    2.0T  1.3T  671G  65%
>         /gnet/lustre/oss02/mpath8
>         /dev/mapper/mpath9    2.0T  1.3T  634G  67%
>         /gnet/lustre/oss02/mpath9
>         --
>
>         -J
>
>
>         On Tue, Feb 15, 2011 at 2:37 PM, Jagga Soorma
>         <jagga13 at gmail.com <mailto:jagga13 at gmail.com>>
wrote:
>
>             I did deactivate this OST on the MDS server.  So how would
>             I deal with a OST filling up?  The OST''s
don''t seem to be
>             filling up evenly either.  How does lustre handle a OST
>             that is at 100%?  Would it not use this specific OST for
>             writes if there are other OST available with capacity? 
>
>             Thanks,
>             -J
>
>
>             On Tue, Feb 15, 2011 at 11:45 AM, Andreas Dilger
>             <adilger at whamcloud.com <mailto:adilger at
whamcloud.com>> wrote:
>
>                 On 2011-02-15, at 12:20, Cliff White wrote:
>                 > Client situation depends on where you deactivated
>                 the OST - if you deactivate on the MDS only, clients
>                 should be able to read.
>                 >
>                 > What is best to do when an OST fills up really
>                 depends on what else you are doing at the time, and
>                 how much control you have over what the clients are
>                 doing and other things.  If you can solve the space
>                 issue with a quick rm -rf, best to leave it online,
>                 likewise if all your clients are trying to bang on it
>                 and failing, best to turn things off. YMMV
>
>                 In theory, with 1.8 the full OST should be skipped for
>                 new object allocations, but this is not robust in the
>                 face of e.g. a single very large file being written to
>                 the OST that takes it from "average" usage to
being full.
>
>                 > On Tue, Feb 15, 2011 at 10:57 AM, Jagga Soorma
>                 <jagga13 at gmail.com <mailto:jagga13 at
gmail.com>> wrote:
>                 > Hi Guys,
>                 >
>                 > One of my clients got a hung lustre mount this
>                 morning and I saw the following errors in my logs:
>                 >
>                 > --
>                 > ..snip..
>                 > Feb 15 09:38:07 reshpc116 kernel: LustreError: 11-0:
>                 an error occurred while communicating with
>                 10.0.250.47 at o2ib3. The ost_write operation failed with
-28
>                 > Feb 15 09:38:07 reshpc116 kernel: LustreError:
>                 Skipped 4755836 previous similar messages
>                 > Feb 15 09:48:07 reshpc116 kernel: LustreError: 11-0:
>                 an error occurred while communicating with
>                 10.0.250.47 at o2ib3. The ost_write operation failed with
-28
>                 > Feb 15 09:48:07 reshpc116 kernel: LustreError:
>                 Skipped 4649141 previous similar messages
>                 > Feb 15 10:16:54 reshpc116 kernel: Lustre:
>                 6254:0:(client.c:1476:ptlrpc_expire_one_request()) @@@
>                 Request x1360125198261945 sent from
>                 reshpcfs-OST0005-osc-ffff8830175c8400 to NID
>                 10.0.250.47 at o2ib3 1344s ago has timed out (1344s prior
>                 to deadline).
>                 > Feb 15 10:16:54 reshpc116 kernel: Lustre:
>                 reshpcfs-OST0005-osc-ffff8830175c8400: Connection to
>                 service reshpcfs-OST0005 via nid 10.0.250.47 at o2ib3 was
>                 lost; in progress operations using this service will
>                 wait for recovery to complete.
>                 > Feb 15 10:16:54 reshpc116 kernel: LustreError: 11-0:
>                 an error occurred while communicating with
>                 10.0.250.47 at o2ib3. The ost_connect operation failed
>                 with -16
>                 > Feb 15 10:16:54 reshpc116 kernel: LustreError:
>                 Skipped 2888779 previous similar messages
>                 > Feb 15 10:16:55 reshpc116 kernel: Lustre:
>                 6254:0:(client.c:1476:ptlrpc_expire_one_request()) @@@
>                 Request x1360125198261947 sent from
>                 reshpcfs-OST0005-osc-ffff8830175c8400 to NID
>                 10.0.250.47 at o2ib3 1344s ago has timed out (1344s prior
>                 to deadline).
>                 > Feb 15 10:18:11 reshpc116 kernel: LustreError: 11-0:
>                 an error occurred while communicating with
>                 10.0.250.47 at o2ib3. The ost_connect operation failed
>                 with -16
>                 > Feb 15 10:18:11 reshpc116 kernel: LustreError:
>                 Skipped 10 previous similar messages
>                 > Feb 15 10:20:45 reshpc116 kernel: LustreError: 11-0:
>                 an error occurred while communicating with
>                 10.0.250.47 at o2ib3. The ost_connect operation failed
>                 with -16
>                 > Feb 15 10:20:45 reshpc116 kernel: LustreError:
>                 Skipped 21 previous similar messages
>                 > Feb 15 10:25:46 reshpc116 kernel: LustreError: 11-0:
>                 an error occurred while communicating with
>                 10.0.250.47 at o2ib3. The ost_connect operation failed
>                 with -16
>                 > Feb 15 10:25:46 reshpc116 kernel: LustreError:
>                 Skipped 42 previous similar messages
>                 > Feb 15 10:31:43 reshpc116 kernel: Lustre:
>                 reshpcfs-OST0005-osc-ffff8830175c8400: Connection
>                 restored to service reshpcfs-OST0005 using nid
>                 10.0.250.47 at o2ib3.
>                 > --
>                 >
>                 > Due to disk space issues on my lustre filesystem one
>                 of the OST''s were full and I deactivated that OST
this
>                 morning.  I thought that operation just puts it in a
>                 read only state and that clients can still access the
>                 data from that OST.  After activating this OST again
>                 the client connected again and was okay after this.
>                  How else would you deal with a OST that is close to
>                 100% full?  Is it okay to leave the OST active and the
>                 clients will know not to write data to that OST?
>                 >
>                 > Thanks,
>                 > -J
>                 >
>                 > _______________________________________________
>                 > Lustre-discuss mailing list
>                 > Lustre-discuss at lists.lustre.org
>                 <mailto:Lustre-discuss at lists.lustre.org>
>                 >
http://lists.lustre.org/mailman/listinfo/lustre-discuss
>                 >
>                 >
>                 > _______________________________________________
>                 > Lustre-discuss mailing list
>                 > Lustre-discuss at lists.lustre.org
>                 <mailto:Lustre-discuss at lists.lustre.org>
>                 >
http://lists.lustre.org/mailman/listinfo/lustre-discuss
>
>
>                 Cheers, Andreas
>                 --
>                 Andreas Dilger
>                 Principal Engineer
>                 Whamcloud, Inc.
>
>
>
>
>
>
>
> ------------------------------------------------------------------------
>
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>

Lustre discuss - Feb 2011 - Lustre client error

[Lustre-discuss] Lustre client error

[Lustre-discuss] Lustre client error

[Lustre-discuss] Lustre client error

[Lustre-discuss] Lustre client error

[Lustre-discuss] Lustre client error

[Lustre-discuss] Lustre client error

[Lustre-discuss] Lustre client error

[Lustre-discuss] Lustre client error

[Lustre-discuss] Lustre client error

[Lustre-discuss] Lustre client error

[Lustre-discuss] Lustre client error