I have a Lustre (1.6.7) system that looks OKish (as far as I can see) from the mds and most of the clients. From one client however (the users'' login machine) it looks broken. Some files are missing, some seem broken, and the df command hangs. Rebooting the client doesn''t change anything. Is it broken, or is there some persistent information that I need to flush? When I do an ls on a partially broken directory, I get the following two lines in /var/log/messages: Nov 18 12:13:53 mhdc kernel: [ 7093.751196] LustreError: 10919:0:(file.c:999:ll_glimpse_size()) obd_enqueue returned rc -5, returning -EIO Nov 18 12:13:53 mhdc kernel: [ 7093.761098] LustreError: 10919:0:(file.c:999:ll_glimpse_size()) Skipped 9 previous similar messages Any ideas how to proceed with the least disruption? Thanks in advance, Herbert -- Herbert Fruchtl Senior Scientific Computing Officer School of Chemistry, School of Mathematics and Statistics University of St Andrews -- The University of St Andrews is a charity registered in Scotland: No SC013532
Could you elaborate about how "broken" the files are? From your description and the error message you provide, I suspect that one(or some) of the OSTs went down. What does `lctl dl` show? ? 2010-11-18???8:18? Herbert Fruchtl ???> I have a Lustre (1.6.7) system that looks OKish (as far as I can see) from the > mds and most of the clients. From one client however (the users'' login machine) > it looks broken. Some files are missing, some seem broken, and the df command > hangs. > > Rebooting the client doesn''t change anything. Is it broken, or is there some > persistent information that I need to flush? When I do an ls on a partially > broken directory, I get the following two lines in /var/log/messages: > > Nov 18 12:13:53 mhdc kernel: [ 7093.751196] LustreError: > 10919:0:(file.c:999:ll_glimpse_size()) obd_enqueue returned rc -5, returning -EIO > Nov 18 12:13:53 mhdc kernel: [ 7093.761098] LustreError: > 10919:0:(file.c:999:ll_glimpse_size()) Skipped 9 previous similar messages > > Any ideas how to proceed with the least disruption? > > Thanks in advance, > > Herbert > -- > Herbert Fruchtl > Senior Scientific Computing Officer > School of Chemistry, School of Mathematics and Statistics > University of St Andrews > -- > The University of St Andrews is a charity registered in Scotland: > No SC013532 > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss
I was wrong about only one client having problems. It seems to be all of them, except the mds server (see below), so it is a problem of the filesystem (not the client) after all.> Could you elaborate about how "broken" the files are?When I do an ''ls'', the filenames are flashing in red (this is for example the case for broken symbolic links). Permissions, date and owner are missing, like in the middle of the next three lines: -rw------- 1 root root 18308319 Jul 16 2009 stat_1247756353.gz ?--------- ? ? ? ? ? stat_1248125742.gz drwxr-xr-x 2 stephane ukmhd 4096 Jul 8 2009 stephane Attempting to access the file more closely results in an I/O error: [root at mhdc ~]# ls -l /workspace/ls-lR_2009-01-20 ls: /workspace/ls-lR_2009-01-20: Input/output error [root at mhdc ~]# cp /workspace/ls-lR_2009-01-20 /tmp cp: cannot stat `/workspace/ls-lR_2009-01-20'': Input/output error> > From your description and the error message you provide, I suspect that one(or some) of the OSTs went down. What does `lctl dl` show? >The files are accessible from the mds server, and the OSTs seem visible from the "broken" clients: [root at mhdc ~]# lctl dl 0 UP mgc MGC192.168.101.214 at tcp 63568484-f714-da05-c5c2-b96db1b22962 5 1 UP lov home-clilov-ffff8100d7ecf000 651d7044-988f-f324-6896-3e09edf8a90b 4 2 UP mdc home-MDT0000-mdc-ffff8100d7ecf000 651d7044-988f-f324-6896-3e09edf8a90b 5 3 UP osc home-OST0001-osc-ffff8100d7ecf000 651d7044-988f-f324-6896-3e09edf8a90b 5 4 UP osc home-OST0003-osc-ffff8100d7ecf000 651d7044-988f-f324-6896-3e09edf8a90b 5 5 UP osc home-OST0002-osc-ffff8100d7ecf000 651d7044-988f-f324-6896-3e09edf8a90b 5 6 UP osc home-OST0005-osc-ffff8100d7ecf000 651d7044-988f-f324-6896-3e09edf8a90b 5 7 UP osc home-OST0004-osc-ffff8100d7ecf000 651d7044-988f-f324-6896-3e09edf8a90b 5 8 UP osc home-OST0000-osc-ffff8100d7ecf000 651d7044-988f-f324-6896-3e09edf8a90b 5 Does this help? Herbert> ? 2010-11-18???8:18? Herbert Fruchtl ??? > >> I have a Lustre (1.6.7) system that looks OKish (as far as I can see) from the >> mds and most of the clients. From one client however (the users'' login machine) >> it looks broken. Some files are missing, some seem broken, and the df command >> hangs. >> >> Rebooting the client doesn''t change anything. Is it broken, or is there some >> persistent information that I need to flush? When I do an ls on a partially >> broken directory, I get the following two lines in /var/log/messages: >> >> Nov 18 12:13:53 mhdc kernel: [ 7093.751196] LustreError: >> 10919:0:(file.c:999:ll_glimpse_size()) obd_enqueue returned rc -5, returning -EIO >> Nov 18 12:13:53 mhdc kernel: [ 7093.761098] LustreError: >> 10919:0:(file.c:999:ll_glimpse_size()) Skipped 9 previous similar messages >> >> Any ideas how to proceed with the least disruption? >> >> Thanks in advance, >> >> Herbert >> -- >> Herbert Fruchtl >> Senior Scientific Computing Officer >> School of Chemistry, School of Mathematics and Statistics >> University of St Andrews >> -- >> The University of St Andrews is a charity registered in Scotland: >> No SC013532 >> _______________________________________________ >> Lustre-discuss mailing list >> Lustre-discuss at lists.lustre.org >> http://lists.lustre.org/mailman/listinfo/lustre-discuss >-- Herbert Fruchtl Senior Scientific Computing Officer School of Chemistry, School of Mathematics and Statistics University of St Andrews -- The University of St Andrews is a charity registered in Scotland: No SC013532
Hello, ? 2010-11-18???10:03? Herbert Fruchtl ???> I was wrong about only one client having problems. It seems to > be all of them, except the mds server (see below), so it is a > problem of the filesystem (not the client) after all. > >> Could you elaborate about how "broken" the files are? > > When I do an ''ls'', the filenames are flashing in red (this is > for example the case for broken symbolic links). Permissions, date > and owner are missing, like in the middle of the next three > lines: > -rw------- 1 root root 18308319 Jul 16 2009 stat_1247756353.gz > ?--------- ? ? ? ? ? stat_1248125742.gz > drwxr-xr-x 2 stephane ukmhd 4096 Jul 8 2009 stephane > > Attempting to access the file more closely results in an I/O error: > [root at mhdc ~]# ls -l /workspace/ls-lR_2009-01-20 > ls: /workspace/ls-lR_2009-01-20: Input/output error > [root at mhdc ~]# cp /workspace/ls-lR_2009-01-20 /tmp > cp: cannot stat `/workspace/ls-lR_2009-01-20'': Input/output errorThis looks very much like some OSTs are failing.> >> >> From your description and the error message you provide, I suspect that one(or some) of the OSTs went down. What does `lctl dl` show? >> > The files are accessible from the mds server, and the OSTs seem > visible from the "broken" clients: > [root at mhdc ~]# lctl dl > 0 UP mgc MGC192.168.101.214 at tcp 63568484-f714-da05-c5c2-b96db1b22962 5 > 1 UP lov home-clilov-ffff8100d7ecf000 651d7044-988f-f324-6896-3e09edf8a90b 4 > 2 UP mdc home-MDT0000-mdc-ffff8100d7ecf000 651d7044-988f-f324-6896-3e09edf8a90b 5 > 3 UP osc home-OST0001-osc-ffff8100d7ecf000 651d7044-988f-f324-6896-3e09edf8a90b 5 > 4 UP osc home-OST0003-osc-ffff8100d7ecf000 651d7044-988f-f324-6896-3e09edf8a90b 5 > 5 UP osc home-OST0002-osc-ffff8100d7ecf000 651d7044-988f-f324-6896-3e09edf8a90b 5 > 6 UP osc home-OST0005-osc-ffff8100d7ecf000 651d7044-988f-f324-6896-3e09edf8a90b 5 > 7 UP osc home-OST0004-osc-ffff8100d7ecf000 651d7044-988f-f324-6896-3e09edf8a90b 5 > 8 UP osc home-OST0000-osc-ffff8100d7ecf000 651d7044-988f-f324-6896-3e09edf8a90b 5 > > Does this help?I mean ''lctl dl'' output on the OSS servers. Make sure that your OSTs are all mounted and running well.> > Herbert > >> ? 2010-11-18???8:18? Herbert Fruchtl ??? >> >>> I have a Lustre (1.6.7) system that looks OKish (as far as I can see) from the >>> mds and most of the clients. From one client however (the users'' login machine) >>> it looks broken. Some files are missing, some seem broken, and the df command >>> hangs. >>> >>> Rebooting the client doesn''t change anything. Is it broken, or is there some >>> persistent information that I need to flush? When I do an ls on a partially >>> broken directory, I get the following two lines in /var/log/messages: >>> >>> Nov 18 12:13:53 mhdc kernel: [ 7093.751196] LustreError: >>> 10919:0:(file.c:999:ll_glimpse_size()) obd_enqueue returned rc -5, returning -EIO >>> Nov 18 12:13:53 mhdc kernel: [ 7093.761098] LustreError: >>> 10919:0:(file.c:999:ll_glimpse_size()) Skipped 9 previous similar messages >>> >>> Any ideas how to proceed with the least disruption? >>> >>> Thanks in advance, >>> >>> Herbert >>> -- >>> Herbert Fruchtl >>> Senior Scientific Computing Officer >>> School of Chemistry, School of Mathematics and Statistics >>> University of St Andrews >>> -- >>> The University of St Andrews is a charity registered in Scotland: >>> No SC013532 >>> _______________________________________________ >>> Lustre-discuss mailing list >>> Lustre-discuss at lists.lustre.org >>> http://lists.lustre.org/mailman/listinfo/lustre-discuss >> > > -- > Herbert Fruchtl > Senior Scientific Computing Officer > School of Chemistry, School of Mathematics and Statistics > University of St Andrews > -- > The University of St Andrews is a charity registered in Scotland: > No SC013532 > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss-------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20101118/99f2502a/attachment-0001.html
Wang Yibin wrote:> Hello, > > ? 2010-11-18???10:03? Herbert Fruchtl ??? > >> I was wrong about only one client having problems. It seems to >> be all of them, except the mds server (see below), so it is a >> problem of the filesystem (not the client) after all.It looks like you may have corruption on the mdt or an ost, where the objects on an OST can''t be found for the directory entry. Have you had a crash recently or run Lustre fsck? You might need to do fsck and delete (unlink) the "broken" files. I suppose it''s also possible you''re seeing fallout from an earlier LBUG or something. Try ''cat /proc/fs/lustre/health_check'' on all the servers. Kevin
Hello! On Nov 18, 2010, at 7:18 AM, Herbert Fruchtl wrote:> Rebooting the client doesn''t change anything. Is it broken, or is there some > persistent information that I need to flush? When I do an ls on a partially > broken directory, I get the following two lines in /var/log/messages: > > Nov 18 12:13:53 mhdc kernel: [ 7093.751196] LustreError: > 10919:0:(file.c:999:ll_glimpse_size()) obd_enqueue returned rc -5, returning -EIOYou might want to check the /dev/log/messages (or dmesg) on your OSTs to see what was corresponding complaints from OSTs at the same time. That would give us better ideas as to what OSTs are actually unhappy about. But so far it indeed looks like you have files referencing some non-existent objects. (the message on OSTs would look like lvb_init errors). Bye, Oleg