thr3ads.net - Lustre discuss - [Lustre-discuss] Problem after upgrade to 1.6.5 [Jul 2008]

If this information is useful, please help other people find it:
Share via:

Enrico Morelli

2008-Jul-07 10:27 UTC

[Lustre-discuss] Problem after upgrade to 1.6.5

Dear all,

I''ve a problem after the upgrade from 1.6.4.1 to 1.6.5.

I''ve four OSTs to create a /fastfs lustre filesystem. In each OST I
have the following fstab:

/dev/vg/fastfs_ost   /fastfs_ost   lustre defaults,_netdev        0 0
lustre-server at tcp0:/fastfs   /fastfs   lustre _netdev,defaults    0 0

On the lustre-server I have:
/dev/data_se/fastfs_mdt    /fastfs_mdt  lustre  defaults,_netdev  0 0

On one OST (192.168.100.101) I have the following error:
Lustre: Client fastfs-client has started
Lustre: Request x686 sent from fastfs-OST0000-osc-c5b6cc00 to NID 0 at lo
5s ago has timed out (limit 5s). 
Lustre: Skipped 62 previous similar messages

Infact on the other I obtain:
Lustre: Client fastfs-client has started
Lustre: Request x463 sent from fastfs-OST0000-osc-f7d1f800 to NID
192.168.100.101 at tcp 5s ago has timed out (limit 5s). 
Lustre: Skipped 32 previous similar messages


But on the lustre-server I have two OST that seems to be dead and one
in timeout:
Lustre: Client fastfs-client has started
Lustre: fastfs-MDT0000: haven''t heard from client
d7fd9368-3f2b-7625-9c48-3de83b5c4cd3 (at 192.168.100.103 at tcp) in 231
seconds. I think it''s dead, and I am evicting it. 
Lustre: fastfs-MDT0000: haven''t heard from client
42c0e2c4-0844-8b8b-69b2-9c16ff0ba043 (at 192.168.100.100 at tcp) in 229
seconds. I think it''s dead, and I am evicting it.
Lustre: Request x2950836 sent from fastfs-OST0000-osc to NID
192.168.100.101 at tcp 50s ago has timed out (limit 50s). 
Lustre: Skipped 65 previous similar messages

On all machine I''ve installed the following rpms:
lustre-ldiskfs-3.0.4-2.6.9_67.0.7.EL_lustre.1.6.5smp
kernel-lustre-smp-2.6.9-67.0.7.EL_lustre.1.6.5
lustre-1.6.5-2.6.9_67.0.7.EL_lustre.1.6.5smp
lustre-modules-1.6.5-2.6.9_67.0.7.EL_lustre.1.6.5smp

On each node I have the following active modules:

lustre                644716  2 
lov                   414696  3 lustre
mdc                   144900  3 lustre
lquota                212116  3 
osc                   224680  6 lustre
ksocklnd              138984  1 
ptlrpc                970676  6 mgc,lustre,lov,mdc,lquota,osc
obdclass              677464  9 mgc,lustre,lov,mdc,lquota,osc,ptlrpc
lnet                  267292  4 lustre,ksocklnd,ptlrpc,obdclass
lvfs                   90360  8
mgc,lustre,lov,mdc,lquota,osc,ptlrpc,obdclass libcfs
132044  11
mgc,lustre,lov,mdc,lquota,osc,ksocklnd,ptlrpc,obdclass,lnet,lvfs

With 1.6.4.1 all works fine, where I can check to solve the problem?

Thanks
-- 
-------------------------------------------------------------------
       (o_
(o_    //\  Coltivate Linux che tanto Windows si pianta da solo.
(/)_   V_/_
+------------------------------------------------------------------+
|     ENRICO MORELLI         |  email: morelli at CERM.UNIFI.IT       |
| *     *       *       *    |  phone: +39 055 4574269             |
|  University of Florence    |  fax  : +39 055 4574253             |
|  CERM - via Sacconi, 6 -  50019 Sesto Fiorentino (FI) - ITALY    |
+------------------------------------------------------------------+

Enrico Morelli

2008-Jul-08 12:41 UTC

head link

[Lustre-discuss] Problem after upgrade to 1.6.5

On Mon, 7 Jul 2008 12:27:10 +0200
Enrico Morelli <morelli at cerm.unifi.it> wrote:
> Dear all,
> 
> I''ve a problem after the upgrade from 1.6.4.1 to 1.6.5.
> 
> I''ve four OSTs to create a /fastfs lustre filesystem. In each OST
I
> have the following fstab:
> 
> /dev/vg/fastfs_ost   /fastfs_ost   lustre defaults,_netdev        0 0
> lustre-server at tcp0:/fastfs   /fastfs   lustre _netdev,defaults    0 0
> 
> On the lustre-server I have:
> /dev/data_se/fastfs_mdt    /fastfs_mdt  lustre  defaults,_netdev  0 0
> 
> On one OST (192.168.100.101) I have the following error:
> Lustre: Client fastfs-client has started
> Lustre: Request x686 sent from fastfs-OST0000-osc-c5b6cc00 to NID 0 at lo
> 5s ago has timed out (limit 5s). 
> Lustre: Skipped 62 previous similar messages
> 
> Infact on the other I obtain:
> Lustre: Client fastfs-client has started
> Lustre: Request x463 sent from fastfs-OST0000-osc-f7d1f800 to NID
> 192.168.100.101 at tcp 5s ago has timed out (limit 5s). 
> Lustre: Skipped 32 previous similar messages
> 
> 
> But on the lustre-server I have two OST that seems to be dead and one
> in timeout:
> Lustre: Client fastfs-client has started
> Lustre: fastfs-MDT0000: haven''t heard from client
> d7fd9368-3f2b-7625-9c48-3de83b5c4cd3 (at 192.168.100.103 at tcp) in 231
> seconds. I think it''s dead, and I am evicting it. 
> Lustre: fastfs-MDT0000: haven''t heard from client
> 42c0e2c4-0844-8b8b-69b2-9c16ff0ba043 (at 192.168.100.100 at tcp) in 229
> seconds. I think it''s dead, and I am evicting it.
> Lustre: Request x2950836 sent from fastfs-OST0000-osc to NID
> 192.168.100.101 at tcp 50s ago has timed out (limit 50s). 
> Lustre: Skipped 65 previous similar messages
> 
> On all machine I''ve installed the following rpms:
> lustre-ldiskfs-3.0.4-2.6.9_67.0.7.EL_lustre.1.6.5smp
> kernel-lustre-smp-2.6.9-67.0.7.EL_lustre.1.6.5
> lustre-1.6.5-2.6.9_67.0.7.EL_lustre.1.6.5smp
> lustre-modules-1.6.5-2.6.9_67.0.7.EL_lustre.1.6.5smp
> 
> On each node I have the following active modules:
> 
> lustre                644716  2 
> lov                   414696  3 lustre
> mdc                   144900  3 lustre
> lquota                212116  3 
> osc                   224680  6 lustre
> ksocklnd              138984  1 
> ptlrpc                970676  6 mgc,lustre,lov,mdc,lquota,osc
> obdclass              677464  9 mgc,lustre,lov,mdc,lquota,osc,ptlrpc
> lnet                  267292  4 lustre,ksocklnd,ptlrpc,obdclass
> lvfs                   90360  8
> mgc,lustre,lov,mdc,lquota,osc,ptlrpc,obdclass libcfs
> 132044  11
> mgc,lustre,lov,mdc,lquota,osc,ksocklnd,ptlrpc,obdclass,lnet,lvfs
> 
> With 1.6.4.1 all works fine, where I can check to solve the problem?
> 
> Thanks

The problems was solved itself. Tomorrow I found the /fastfs lustre
filesystem mounted everywhere without problems.

-- 
-------------------------------------------------------------------
       (o_
(o_    //\  Coltivate Linux che tanto Windows si pianta da solo.
(/)_   V_/_
+------------------------------------------------------------------+
|     ENRICO MORELLI         |  email: morelli at CERM.UNIFI.IT       |
| *     *       *       *    |  phone: +39 055 4574269             |
|  University of Florence    |  fax  : +39 055 4574253             |
|  CERM - via Sacconi, 6 -  50019 Sesto Fiorentino (FI) - ITALY    |
+------------------------------------------------------------------+

daledude

2008-Jul-08 23:32 UTC

head link

[Lustre-discuss] Problem after upgrade to 1.6.5

> The problems was solved itself. Tomorrow I found the /fastfs lustre
> filesystem mounted everywhere without problems.

Do you know how the problem solved itself or did you fix something? Im
having a similar but I cant tell if its because the raid controller
(perc6/e) just cant get past 3.8tb of a 5tb raid5, or if there is a
problem with drives in the raid5 (md1000) even tho megacli says all is
good.

Thanks,
Dale

Andreas Dilger

2008-Jul-09 06:19 UTC

head link

[Lustre-discuss] Problem after upgrade to 1.6.5

On Jul 08, 2008  16:32 -0700, daledude wrote:> > The problems was solved itself. Tomorrow I found the /fastfs lustre
> > filesystem mounted everywhere without problems.
> 
> Do you know how the problem solved itself or did you fix something? Im
> having a similar but I cant tell if its because the raid controller
> (perc6/e) just cant get past 3.8tb of a 5tb raid5, or if there is a
> problem with drives in the raid5 (md1000) even tho megacli says all is
> good.
If this is a branch-new installation (i.e. there isn''t any data on the
RAID that you want to use/keep) then you could run "llverdev" on the
device to see if the device is working properly.  A "partial" (-p) run
is enough to do a quick test of the device, but if you really aren''t
sure of the state of the devices then a "long" (-l) test can be useful
(though somewhat slow).

The ldiskfs filesystem in recent lustre releases is working with up
to 8TB devices (though not more yet), so this shouldn''t be a problem
for your 5TB device.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

daledude

2008-Jul-09 15:09 UTC

head link

[Lustre-discuss] Problem after upgrade to 1.6.5

On Jul 9, 2:19?am, Andreas Dilger <adil... at sun.com>
wrote:> If this is a branch-new installation (i.e. there isn''t any data on
the
> RAID that you want to use/keep) then you could run "llverdev" on
the
> device to see if the device is working properly. ?A "partial"
(-p) run
> is enough to do a quick test of the device, but if you really
aren''t
> sure of the state of the devices then a "long" (-l) test can be
useful
> (though somewhat slow).
>
> The ldiskfs filesystem in recent lustre releases is working with up
> to 8TB devices (though not more yet), so this shouldn''t be a
problem
> for your 5TB device.
>
> Cheers, Andreas
Thanks Andreas for the llverdev tip. That will definitely be useful.

I ended up lowering the max # of ll_ost_io''s that can run.
Went down to 15 from 128. I was getting the timeouts and
disconnects after I removed another OSS that had 2 OST''s
so it seems it put more burden on the single OSS left.

Lustre discuss - Jul 2008 - Problem after upgrade to 1.6.5

[Lustre-discuss] Problem after upgrade to 1.6.5

[Lustre-discuss] Problem after upgrade to 1.6.5

[Lustre-discuss] Problem after upgrade to 1.6.5

[Lustre-discuss] Problem after upgrade to 1.6.5

[Lustre-discuss] Problem after upgrade to 1.6.5