We have been experiecing problems recently, where our Lustre filesystem is becoming read-only (we can''t even see our data). For example when I invoke ''ls'' or ''find'' ls: .: Read-only file system find: cannot get current directory: Read-only file system The client version is: lustre: 1.6.6 kernel: patchless build: 1.6.6-19691231190000-PRISTINE-.usr.src.linux-2.6.18-92.1.10.el5 The MDS/OST version is: lustre: 1.6.4.3 kernel: patchless On the MDS: lctl dl -bash-3.1$ lctl dl 0 UP mgs MGS MGS 23 1 UP mgc MGC141.128.90.153 at tcp caf62255-4f22-53a3-25d9-c7f9e6c31277 5 2 UP mdt MDS MDS_uuid 3 3 UP lov lfs002-mdtlov lfs002-mdtlov_UUID 4 4 UP mds lfs002-MDT0000 lfs002-MDT0000_UUID 21 5 UP osc lfs002-OST0000-osc lfs002-mdtlov_UUID 5 6 UP osc lfs002-OST0001-osc lfs002-mdtlov_UUID 5 7 UP ost OSS OSS_uuid 3 8 UP obdfilter lfs002-OST0001 lfs002-OST0001_UUID 23 It interesting because we have another lustre filesystem where we can see things and everything is working properly. Is there something I am missing? TIA
First time ever, # cat health_check device lfs002-MDT0000 reported unhealthy NOT HEALTHY I have never seen this. On Sat, Mar 21, 2009 at 5:26 AM, Mag Gam <magawake at gmail.com> wrote:> We have been experiecing problems recently, where our Lustre > filesystem is becoming read-only (we can''t even see our data). > > For example when I invoke ''ls'' or ''find'' > > ls: .: Read-only file system > find: cannot get current directory: Read-only file system > > The client version is: > > lustre: 1.6.6 > kernel: patchless > build: ?1.6.6-19691231190000-PRISTINE-.usr.src.linux-2.6.18-92.1.10.el5 > > > The MDS/OST version is: > lustre: 1.6.4.3 > kernel: patchless > > On the MDS: > ?lctl dl > -bash-3.1$ lctl dl > ?0 UP mgs MGS MGS 23 > ?1 UP mgc MGC141.128.90.153 at tcp caf62255-4f22-53a3-25d9-c7f9e6c31277 5 > ?2 UP mdt MDS MDS_uuid 3 > ?3 UP lov lfs002-mdtlov lfs002-mdtlov_UUID 4 > ?4 UP mds lfs002-MDT0000 lfs002-MDT0000_UUID 21 > ?5 UP osc lfs002-OST0000-osc lfs002-mdtlov_UUID 5 > ?6 UP osc lfs002-OST0001-osc lfs002-mdtlov_UUID 5 > ?7 UP ost OSS OSS_uuid 3 > ?8 UP obdfilter lfs002-OST0001 lfs002-OST0001_UUID 23 > > > It interesting because we have another lustre filesystem where we can > see things and everything is working properly. ?Is there something I > am missing? > TIA >
On Sat, 2009-03-21 at 05:26 -0700, Mag Gam wrote:> We have been experiecing problems recently, where our Lustre > filesystem is becoming read-only (we can''t even see our data). > > For example when I invoke ''ls'' or ''find'' > > ls: .: Read-only file system > find: cannot get current directory: Read-only file systemThis usually happens because some error on the servers causes lustre to turn the media read-only in an attempt to prevent corruption. The underlying cause is usually hardware related. Check the lustre servers to see which device has been made read-only and then check your storage to see what situations might have caused it. b. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 197 bytes Desc: This is a digitally signed message part Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20090323/16f6d416/attachment.bin
Hi Brian: thanks for the reply as usual... Well, there were no server problems but I did run a e2fsck on the MDT and OST I managed to fix the corruptions (or have the e2fsck do it for me). So far no issues. But I do notice some files which have ?????? by their stats. Fortunately, these aren''t important files, but I am not sure how to remove them. Any thoughts? TIA On Mon, Mar 23, 2009 at 8:42 AM, Brian J. Murrell <Brian.Murrell at sun.com> wrote:> On Sat, 2009-03-21 at 05:26 -0700, Mag Gam wrote: >> We have been experiecing problems recently, where our Lustre >> filesystem is becoming read-only (we can''t even see our data). >> >> For example when I invoke ''ls'' or ''find'' >> >> ls: .: Read-only file system >> find: cannot get current directory: Read-only file system > > This usually happens because some error on the servers causes lustre to > turn the media read-only in an attempt to prevent corruption. The > underlying cause is usually hardware related. Check the lustre servers > to see which device has been made read-only and then check your storage > to see what situations might have caused it. > > b. > > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss > >
On Mon, 2009-03-23 at 22:17 -0400, Mag Gam wrote:> Hi Brian:Hi,> Well, there were no server problems but I did run a e2fsck on the MDT > and OST I managed to fix the corruptions (or have the e2fsck do it for > me).There must have been some problems for fscks to have needed to fix problems.> So far no issues. But I do notice some files which have ?????? by their stats.Sounds like some objects have gone missing. Did you check the lost +found on the targets that you did fsck to see if anything got relocated there? If there are entries there, you will have to try to figure out where they go back (there was a discussion here in the last week or two about methods/a tool to do that), or if all else fails, you can run lfsck to bring the MDT and OSTs back into sync with each other. b. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 197 bytes Desc: This is a digitally signed message part Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20090324/07ff1541/attachment.bin