Hi all, I have some problem in getting the client connected with few partition of a lustre file-system In particular some days ago, we have serious issues on two raidset. After a reboot the partition becomes already available, at least locally to the file server. I tried to mount the partition, after an "e2fsck" on the partition. The e2fsck found some issue and fixed them. It seems that most of the nodes, keep connected to the partition that experienced problems, but the nodes on which those partition where "deactivated" are not able to re-join affected partitions. In particular, on the server side I see those error on the logs, Jan 26 14:43:59 dot1-se-01 kernel: LustreError: 6542:0:(filter_io_26.c:684:filter_commitrw_write()) error starting transaction: rc = -30 Jan 26 14:44:08 dot1-se-01 kernel: LustreError: 6578:0:(filter_io_26.c:684:filter_commitrw_write()) error starting transaction: rc = -30 Jan 26 14:44:18 dot1-se-01 kernel: LustreError: 6485:0:(filter_io_26.c:684:filter_commitrw_write()) error starting transaction: rc = -30 Jan 26 14:44:18 dot1-se-01 kernel: LustreError: 6555:0:(filter_io_26.c:684:filter_commitrw_write()) error starting transaction: rc = -30 Jan 26 14:44:26 dot1-se-01 kernel: LustreError: 6496:0:(filter_io_26.c:684:filter_commitrw_write()) error starting transaction: rc = -30 Jan 26 14:44:28 dot1-se-01 kernel: LustreError: 6512:0:(filter_io_26.c:684:filter_commitrw_write()) error starting transaction: rc = -30 while on the client I see: Jan 26 15:24:06 pccms35 kernel: LustreError: 11-0: an error occurred while communicating with 212.189.205.34 at tcp. The ost_connect operation failed with -30 Jan 26 15:24:06 pccms35 kernel: LustreError: Skipped 77 previous similar messages Jan 26 15:25:46 pccms35 kernel: Lustre: 9624:0:(import.c:508:import_select_connection()) lustre-OST0001-osc-ffff81019f1d3800: tried all connections, increasing latency to 36s Jan 26 15:25:46 pccms35 kernel: Lustre: 9624:0:(import.c:508:import_select_connection()) Skipped 77 previous similar messages The same behavior is shown also by "new" client joining the cluster. Any hint on this kind of issue? Best Regards, Cheers, Giacinto -- -- ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Giacinto Donvito LIBI -- EGEE3 SA1 INFN - Bari ITALY ------------------------------------------------------------------ giacinto.donvito at ba.infn.it | GTalk/GMail: donvito.giacinto at gmail.com tel. +39 080 5443244 Fax +39 0805442470 | Skype: giacinto_it VOIP: +41225481596 | MSN: donvito.giacinto at hotmail.it AIM/iChat: gdonvito1 | Yahoo: eric1_it ------------------------------------------------------------------ "At least once in a lifetime it is convenient to put everything to discussion" Descartes -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20100126/f4e3c7cf/attachment-0001.html -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 1760 bytes Desc: not available Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20100126/f4e3c7cf/attachment-0001.bin
On Tue, 2010-01-26 at 15:48 +0100, Giacinto Donvito wrote:> > Jan 26 14:43:59 dot1-se-01 kernel: LustreError: > 6542:0:(filter_io_26.c:684:filter_commitrw_write()) error starting > transaction: rc = -30Your target is read-only, typically because ldiskfs has found a critical error and switched the target to read-only to prevent any further damage. You need to e2fsck -f that (and perhaps more) target(s). b. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 197 bytes Desc: This is a digitally signed message part Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20100126/b7306223/attachment.bin
Thanks Brian, but, I have tested it already with at least one partition: e2fsck -fy /dev/sdb1 but anyhow the client still do not see it after a partition remount. It could be possible that I need to switch the entire server off and restart it? What else I could check ? Cheers, Giacinto -- -- ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Giacinto Donvito LIBI -- EGEE3 SA1 INFN - Bari ITALY ------------------------------------------------------------------ giacinto.donvito at ba.infn.it | GTalk/GMail: donvito.giacinto at gmail.com tel. +39 080 5443244 Fax +39 0805442470 | Skype: giacinto_it VOIP: +41225481596 | MSN: donvito.giacinto at hotmail.it AIM/iChat: gdonvito1 | Yahoo: eric1_it ------------------------------------------------------------------ "A lie gets halfway around the world before the truth has a chance to get its pants on." - Sir Winston Churchill (1874-1965) Il giorno 26/gen/2010, alle ore 18.46, Brian J. Murrell ha scritto:> On Tue, 2010-01-26 at 15:48 +0100, Giacinto Donvito wrote: >> >> Jan 26 14:43:59 dot1-se-01 kernel: LustreError: >> 6542:0:(filter_io_26.c:684:filter_commitrw_write()) error starting >> transaction: rc = -30 > > Your target is read-only, typically because ldiskfs has found a critical > error and switched the target to read-only to prevent any further > damage. > > You need to e2fsck -f that (and perhaps more) target(s). > > b. > > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss-------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 1760 bytes Desc: not available Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20100126/0750dc13/attachment.bin
On Tue, 2010-01-26 at 19:47 +0100, Giacinto Donvito wrote:> > but, I have tested it already with at least one partition: > > e2fsck -fy /dev/sdb1And does that run cleanly (i.e no repairs are needed) on all of your targets? If not, you need to get to that point. If you run e2fsck -fy on a single target twice and it returns errors both times, then either there is a bug in the e2fsck or the device is still corrupting. You should also ensure you are using Sun''s latest e2fsprogs release.> but anyhow the client still do not see it after a partition remount.You need to fix your issues on the OST(s) before you can move on to seeing what the client does.> It could be possible that I need to switch the entire server off and restart it?Not likely, no. b. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 197 bytes Desc: This is a digitally signed message part Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20100126/04de19d6/attachment.bin
I have executed the e2fsck many times on affected partitions, obtaining always the same error. for example: ######################## Error reading block 307363840 (Attempt to read block from filesystem resulted in short read) while reading inode and block bitmaps. Ignore error? yes ######################## What I could check now? Cheers, Giacinto Il giorno 26/gen/2010, alle ore 20.03, Brian J. Murrell ha scritto:>> e2fsck -fy /dev/sdb1 >-- -- ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Giacinto Donvito LIBI -- EGEE3 SA1 INFN - Bari ITALY ------------------------------------------------------------------ giacinto.donvito at ba.infn.it | GTalk/GMail: donvito.giacinto at gmail.com tel. +39 080 5443244 Fax +39 0805442470 | Skype: giacinto_it VOIP: +41225481596 | MSN: donvito.giacinto at hotmail.it AIM/iChat: gdonvito1 | Yahoo: eric1_it ------------------------------------------------------------------ "When I am working on a problem I never think about beauty. I only think about how to solve the problem. But when I have finished, if the solution is not beautiful, I know it is wrong." - Buckminster Fuller (1895-1983) -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 1760 bytes Desc: not available Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20100127/9ad6cbdf/attachment.bin
On Wed, 2010-01-27 at 11:23 +0100, Giacinto Donvito wrote:> I have executed the e2fsck many times on affected partitions, obtaining always the same error. > > for example: > ######################## > Error reading block 307363840 (Attempt to read block from filesystem resulted in short read) while reading inode and block bitmaps. Ignore error? yes > ######################## > > What I could check now?Well, it''s really difficult to tell without being able to dig in deeper, but it looks like either your filesystem is bigger than your device or you are having some other kind of problem reading from the physical disk. Perhaps it''s time to consider that drive dead and replace it. Once you do replace it (with a device big enough to hold the data), you could "image copy" (i.e. dd) the contents from the old device to the new one and then restart your e2fsck operation on that new device. b. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 197 bytes Desc: This is a digitally signed message part Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20100127/c41dcfa3/attachment.bin
Ok... thanks I''m coping/moving the files out of those partitions... and I will investigate deeper on the raid device to see if it was a temporarily problems. Thank you for all the help. Cheers, Giacinto -- -- ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Giacinto Donvito LIBI -- EGEE3 SA1 INFN - Bari ITALY ------------------------------------------------------------------ giacinto.donvito at ba.infn.it | GTalk/GMail: donvito.giacinto at gmail.com tel. +39 080 5443244 Fax +39 0805442470 | Skype: giacinto_it VOIP: +41225481596 | MSN: donvito.giacinto at hotmail.it AIM/iChat: gdonvito1 | Yahoo: eric1_it ------------------------------------------------------------------ "Good people do not need laws to tell them to act responsibly, while bad people will find a way around the laws." - Plato (427-347 B.C.) Il giorno 27/gen/2010, alle ore 14.51, Brian J. Murrell ha scritto:> On Wed, 2010-01-27 at 11:23 +0100, Giacinto Donvito wrote: >> I have executed the e2fsck many times on affected partitions, obtaining always the same error. >> >> for example: >> ######################## >> Error reading block 307363840 (Attempt to read block from filesystem resulted in short read) while reading inode and block bitmaps. Ignore error? yes >> ######################## >> >> What I could check now? > > Well, it''s really difficult to tell without being able to dig in deeper, > but it looks like either your filesystem is bigger than your device or > you are having some other kind of problem reading from the physical > disk. Perhaps it''s time to consider that drive dead and replace it. > > Once you do replace it (with a device big enough to hold the data), you > could "image copy" (i.e. dd) the contents from the old device to the new > one and then restart your e2fsck operation on that new device. > > b. > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss-------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 1760 bytes Desc: not available Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20100127/ebac3644/attachment-0001.bin