Dear all, I''ve a problem after the upgrade from 1.6.4.1 to 1.6.5. I''ve four OSTs to create a /fastfs lustre filesystem. In each OST I have the following fstab: /dev/vg/fastfs_ost /fastfs_ost lustre defaults,_netdev 0 0 lustre-server at tcp0:/fastfs /fastfs lustre _netdev,defaults 0 0 On the lustre-server I have: /dev/data_se/fastfs_mdt /fastfs_mdt lustre defaults,_netdev 0 0 On one OST (192.168.100.101) I have the following error: Lustre: Client fastfs-client has started Lustre: Request x686 sent from fastfs-OST0000-osc-c5b6cc00 to NID 0 at lo 5s ago has timed out (limit 5s). Lustre: Skipped 62 previous similar messages Infact on the other I obtain: Lustre: Client fastfs-client has started Lustre: Request x463 sent from fastfs-OST0000-osc-f7d1f800 to NID 192.168.100.101 at tcp 5s ago has timed out (limit 5s). Lustre: Skipped 32 previous similar messages But on the lustre-server I have two OST that seems to be dead and one in timeout: Lustre: Client fastfs-client has started Lustre: fastfs-MDT0000: haven''t heard from client d7fd9368-3f2b-7625-9c48-3de83b5c4cd3 (at 192.168.100.103 at tcp) in 231 seconds. I think it''s dead, and I am evicting it. Lustre: fastfs-MDT0000: haven''t heard from client 42c0e2c4-0844-8b8b-69b2-9c16ff0ba043 (at 192.168.100.100 at tcp) in 229 seconds. I think it''s dead, and I am evicting it. Lustre: Request x2950836 sent from fastfs-OST0000-osc to NID 192.168.100.101 at tcp 50s ago has timed out (limit 50s). Lustre: Skipped 65 previous similar messages On all machine I''ve installed the following rpms: lustre-ldiskfs-3.0.4-2.6.9_67.0.7.EL_lustre.1.6.5smp kernel-lustre-smp-2.6.9-67.0.7.EL_lustre.1.6.5 lustre-1.6.5-2.6.9_67.0.7.EL_lustre.1.6.5smp lustre-modules-1.6.5-2.6.9_67.0.7.EL_lustre.1.6.5smp On each node I have the following active modules: lustre 644716 2 lov 414696 3 lustre mdc 144900 3 lustre lquota 212116 3 osc 224680 6 lustre ksocklnd 138984 1 ptlrpc 970676 6 mgc,lustre,lov,mdc,lquota,osc obdclass 677464 9 mgc,lustre,lov,mdc,lquota,osc,ptlrpc lnet 267292 4 lustre,ksocklnd,ptlrpc,obdclass lvfs 90360 8 mgc,lustre,lov,mdc,lquota,osc,ptlrpc,obdclass libcfs 132044 11 mgc,lustre,lov,mdc,lquota,osc,ksocklnd,ptlrpc,obdclass,lnet,lvfs With 1.6.4.1 all works fine, where I can check to solve the problem? Thanks -- ------------------------------------------------------------------- (o_ (o_ //\ Coltivate Linux che tanto Windows si pianta da solo. (/)_ V_/_ +------------------------------------------------------------------+ | ENRICO MORELLI | email: morelli at CERM.UNIFI.IT | | * * * * | phone: +39 055 4574269 | | University of Florence | fax : +39 055 4574253 | | CERM - via Sacconi, 6 - 50019 Sesto Fiorentino (FI) - ITALY | +------------------------------------------------------------------+
On Mon, 7 Jul 2008 12:27:10 +0200 Enrico Morelli <morelli at cerm.unifi.it> wrote:> Dear all, > > I''ve a problem after the upgrade from 1.6.4.1 to 1.6.5. > > I''ve four OSTs to create a /fastfs lustre filesystem. In each OST I > have the following fstab: > > /dev/vg/fastfs_ost /fastfs_ost lustre defaults,_netdev 0 0 > lustre-server at tcp0:/fastfs /fastfs lustre _netdev,defaults 0 0 > > On the lustre-server I have: > /dev/data_se/fastfs_mdt /fastfs_mdt lustre defaults,_netdev 0 0 > > On one OST (192.168.100.101) I have the following error: > Lustre: Client fastfs-client has started > Lustre: Request x686 sent from fastfs-OST0000-osc-c5b6cc00 to NID 0 at lo > 5s ago has timed out (limit 5s). > Lustre: Skipped 62 previous similar messages > > Infact on the other I obtain: > Lustre: Client fastfs-client has started > Lustre: Request x463 sent from fastfs-OST0000-osc-f7d1f800 to NID > 192.168.100.101 at tcp 5s ago has timed out (limit 5s). > Lustre: Skipped 32 previous similar messages > > > But on the lustre-server I have two OST that seems to be dead and one > in timeout: > Lustre: Client fastfs-client has started > Lustre: fastfs-MDT0000: haven''t heard from client > d7fd9368-3f2b-7625-9c48-3de83b5c4cd3 (at 192.168.100.103 at tcp) in 231 > seconds. I think it''s dead, and I am evicting it. > Lustre: fastfs-MDT0000: haven''t heard from client > 42c0e2c4-0844-8b8b-69b2-9c16ff0ba043 (at 192.168.100.100 at tcp) in 229 > seconds. I think it''s dead, and I am evicting it. > Lustre: Request x2950836 sent from fastfs-OST0000-osc to NID > 192.168.100.101 at tcp 50s ago has timed out (limit 50s). > Lustre: Skipped 65 previous similar messages > > On all machine I''ve installed the following rpms: > lustre-ldiskfs-3.0.4-2.6.9_67.0.7.EL_lustre.1.6.5smp > kernel-lustre-smp-2.6.9-67.0.7.EL_lustre.1.6.5 > lustre-1.6.5-2.6.9_67.0.7.EL_lustre.1.6.5smp > lustre-modules-1.6.5-2.6.9_67.0.7.EL_lustre.1.6.5smp > > On each node I have the following active modules: > > lustre 644716 2 > lov 414696 3 lustre > mdc 144900 3 lustre > lquota 212116 3 > osc 224680 6 lustre > ksocklnd 138984 1 > ptlrpc 970676 6 mgc,lustre,lov,mdc,lquota,osc > obdclass 677464 9 mgc,lustre,lov,mdc,lquota,osc,ptlrpc > lnet 267292 4 lustre,ksocklnd,ptlrpc,obdclass > lvfs 90360 8 > mgc,lustre,lov,mdc,lquota,osc,ptlrpc,obdclass libcfs > 132044 11 > mgc,lustre,lov,mdc,lquota,osc,ksocklnd,ptlrpc,obdclass,lnet,lvfs > > With 1.6.4.1 all works fine, where I can check to solve the problem? > > ThanksThe problems was solved itself. Tomorrow I found the /fastfs lustre filesystem mounted everywhere without problems. -- ------------------------------------------------------------------- (o_ (o_ //\ Coltivate Linux che tanto Windows si pianta da solo. (/)_ V_/_ +------------------------------------------------------------------+ | ENRICO MORELLI | email: morelli at CERM.UNIFI.IT | | * * * * | phone: +39 055 4574269 | | University of Florence | fax : +39 055 4574253 | | CERM - via Sacconi, 6 - 50019 Sesto Fiorentino (FI) - ITALY | +------------------------------------------------------------------+
> The problems was solved itself. Tomorrow I found the /fastfs lustre > filesystem mounted everywhere without problems.Do you know how the problem solved itself or did you fix something? Im having a similar but I cant tell if its because the raid controller (perc6/e) just cant get past 3.8tb of a 5tb raid5, or if there is a problem with drives in the raid5 (md1000) even tho megacli says all is good. Thanks, Dale
On Jul 08, 2008 16:32 -0700, daledude wrote:> > The problems was solved itself. Tomorrow I found the /fastfs lustre > > filesystem mounted everywhere without problems. > > Do you know how the problem solved itself or did you fix something? Im > having a similar but I cant tell if its because the raid controller > (perc6/e) just cant get past 3.8tb of a 5tb raid5, or if there is a > problem with drives in the raid5 (md1000) even tho megacli says all is > good.If this is a branch-new installation (i.e. there isn''t any data on the RAID that you want to use/keep) then you could run "llverdev" on the device to see if the device is working properly. A "partial" (-p) run is enough to do a quick test of the device, but if you really aren''t sure of the state of the devices then a "long" (-l) test can be useful (though somewhat slow). The ldiskfs filesystem in recent lustre releases is working with up to 8TB devices (though not more yet), so this shouldn''t be a problem for your 5TB device. Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc.
On Jul 9, 2:19?am, Andreas Dilger <adil... at sun.com> wrote:> If this is a branch-new installation (i.e. there isn''t any data on the > RAID that you want to use/keep) then you could run "llverdev" on the > device to see if the device is working properly. ?A "partial" (-p) run > is enough to do a quick test of the device, but if you really aren''t > sure of the state of the devices then a "long" (-l) test can be useful > (though somewhat slow). > > The ldiskfs filesystem in recent lustre releases is working with up > to 8TB devices (though not more yet), so this shouldn''t be a problem > for your 5TB device. > > Cheers, AndreasThanks Andreas for the llverdev tip. That will definitely be useful. I ended up lowering the max # of ll_ost_io''s that can run. Went down to 15 from 128. I was getting the timeouts and disconnects after I removed another OSS that had 2 OST''s so it seems it put more burden on the single OSS left.