Daniel Leaberry
2007-Apr-24 08:43 UTC
[Lustre-discuss] 0 byte files and ll_readdir() error
We''re running 1.6b7 and have noticed the following two problems. I''m wondering if they''re correlated. 1. We get files that are 0 bytes. They have nothing in them. 2. We get these errors across our 30 nodes LustreError: 7030:0:(dir.c:330:ll_readdir()) error reading dir 167108765/2378987153 page 13: rc -5 LustreError: 7029:0:(dir.c:330:ll_readdir()) error reading dir 171699532/2388399554 page 9: rc -5 LustreError: 7027:0:(dir.c:330:ll_readdir()) error reading dir 171403580/2387428410 page 2: rc -5 LustreError: 6990:0:(dir.c:330:ll_readdir()) error reading dir 171011300/2386583645 page 8: rc -5 LustreError: 7027:0:(dir.c:330:ll_readdir()) error reading dir 172286916/2390172901 page 13: rc -5 LustreError: 6990:0:(dir.c:330:ll_readdir()) error reading dir 172030180/2388919021 page 13: rc -5 LustreError: 7027:0:(dir.c:330:ll_readdir()) error reading dir 172321971/2390308492 page 3: rc -5 LustreError: 7027:0:(dir.c:330:ll_readdir()) error reading dir 163603484/1208913504 page 8: rc -5 LustreError: 6990:0:(dir.c:330:ll_readdir()) error reading dir 172748079/2390802528 page 13: rc -5 LustreError: 9133:0:(dir.c:330:ll_readdir()) error reading dir 172818070/2390892206 page 2: rc -5 LustreError: 9171:0:(dir.c:330:ll_readdir()) error reading dir 168359805/2380837293 page 8: rc -5 LustreError: 9187:0:(dir.c:330:ll_readdir()) error reading dir 163706128/1209056171 page 7: rc -5 LustreError: 9199:0:(dir.c:330:ll_readdir()) error reading dir 165116087/1211142674 page 0: rc -5 LustreError: 9217:0:(dir.c:330:ll_readdir()) error reading dir 162005170/1206582728 page 12: rc -5 LustreError: 9216:0:(dir.c:330:ll_readdir()) error reading dir 162686166/1207618778 page 12: rc -5 LustreError: 6990:0:(dir.c:330:ll_readdir()) error reading dir 163079284/1208141145 page 3: rc -5 Can someone explain the ll_readdir error? Thanks -- Daniel Leaberry Systems Administrator iArchives Tel: 801-494-6528 Cell: 801-376-6411
Daniel Leaberry
2007-Apr-24 08:49 UTC
[Lustre-discuss] 0 byte files and ll_readdir() error
Hate to reply to my own message but the osses also have the following message Apr 24 08:22:50 lu-oss01 kernel: LustreError: 8882:0:(client.c:514:ptlrpc_import_delay_req()) Skipped 119 previous similar messages Apr 24 08:33:29 lu-oss01 kernel: LustreError: 8882:0:(client.c:514:ptlrpc_import_delay_req()) @@@ IMP_INVALID req@0000010037d3ca00 x10335184/t0 o101->MGS@MGC192.168.101.31@tcp_0:26 lens 232/240 ref 1 fl Rpc:/0/0 rc 0/0 Apr 24 08:33:29 lu-oss01 kernel: LustreError: 8882:0:(client.c:514:ptlrpc_import_delay_req()) Skipped 119 previous similar messages Apr 24 08:44:10 lu-oss01 kernel: LustreError: 8882:0:(client.c:514:ptlrpc_import_delay_req()) @@@ IMP_INVALID req@00000101f5a00800 x10340196/t0 o101->MGS@MGC192.168.101.31@tcp_0:26 lens 232/240 ref 1 fl Rpc:/0/0 rc 0/0 The MDS is 192.168.101.31 Daniel Leaberry Systems Administrator iArchives Tel: 801-494-6528 Cell: 801-376-6411 Daniel Leaberry wrote:> We''re running 1.6b7 and have noticed the following two problems. I''m > wondering if they''re correlated. > > 1. We get files that are 0 bytes. They have nothing in them. > > 2. We get these errors across our 30 nodes > LustreError: 7030:0:(dir.c:330:ll_readdir()) error reading dir > 167108765/2378987153 page 13: rc -5 > LustreError: 7029:0:(dir.c:330:ll_readdir()) error reading dir > 171699532/2388399554 page 9: rc -5 > LustreError: 7027:0:(dir.c:330:ll_readdir()) error reading dir > 171403580/2387428410 page 2: rc -5 > LustreError: 6990:0:(dir.c:330:ll_readdir()) error reading dir > 171011300/2386583645 page 8: rc -5 > LustreError: 7027:0:(dir.c:330:ll_readdir()) error reading dir > 172286916/2390172901 page 13: rc -5 > LustreError: 6990:0:(dir.c:330:ll_readdir()) error reading dir > 172030180/2388919021 page 13: rc -5 > LustreError: 7027:0:(dir.c:330:ll_readdir()) error reading dir > 172321971/2390308492 page 3: rc -5 > LustreError: 7027:0:(dir.c:330:ll_readdir()) error reading dir > 163603484/1208913504 page 8: rc -5 > LustreError: 6990:0:(dir.c:330:ll_readdir()) error reading dir > 172748079/2390802528 page 13: rc -5 > LustreError: 9133:0:(dir.c:330:ll_readdir()) error reading dir > 172818070/2390892206 page 2: rc -5 > LustreError: 9171:0:(dir.c:330:ll_readdir()) error reading dir > 168359805/2380837293 page 8: rc -5 > LustreError: 9187:0:(dir.c:330:ll_readdir()) error reading dir > 163706128/1209056171 page 7: rc -5 > LustreError: 9199:0:(dir.c:330:ll_readdir()) error reading dir > 165116087/1211142674 page 0: rc -5 > LustreError: 9217:0:(dir.c:330:ll_readdir()) error reading dir > 162005170/1206582728 page 12: rc -5 > LustreError: 9216:0:(dir.c:330:ll_readdir()) error reading dir > 162686166/1207618778 page 12: rc -5 > LustreError: 6990:0:(dir.c:330:ll_readdir()) error reading dir > 163079284/1208141145 page 3: rc -5 > > Can someone explain the ll_readdir error? > > Thanks >
On Apr 24, 2007 08:42 -0600, Daniel Leaberry wrote:> We''re running 1.6b7 and have noticed the following two problems. I''m > wondering if they''re correlated. > > 1. We get files that are 0 bytes. They have nothing in them.This may or may not be related to the recent bug 12181 problem. That bug will be fixed in 1.6.0+ and 1.4.10.1 and 1.4.11+. It can also happen if the clients are evicted while they are writing to the file.> 2. We get these errors across our 30 nodes > LustreError: 7030:0:(dir.c:330:ll_readdir()) error reading dir > 167108765/2378987153 page 13: rc -5 > LustreError: 7029:0:(dir.c:330:ll_readdir()) error reading dir > 171699532/2388399554 page 9: rc -5 > LustreError: 7027:0:(dir.c:330:ll_readdir()) error reading dir > 171403580/2387428410 page 2: rc -5 > LustreError: 6990:0:(dir.c:330:ll_readdir()) error reading dir > 171011300/2386583645 page 8: rc -5 > LustreError: 7027:0:(dir.c:330:ll_readdir()) error reading dir > 172286916/2390172901 page 13: rc -5 > LustreError: 6990:0:(dir.c:330:ll_readdir()) error reading dir > 172030180/2388919021 page 13: rc -5 > LustreError: 7027:0:(dir.c:330:ll_readdir()) error reading dir > 172321971/2390308492 page 3: rc -5 > LustreError: 7027:0:(dir.c:330:ll_readdir()) error reading dir > 163603484/1208913504 page 8: rc -5 > LustreError: 6990:0:(dir.c:330:ll_readdir()) error reading dir > 172748079/2390802528 page 13: rc -5 > LustreError: 9133:0:(dir.c:330:ll_readdir()) error reading dir > 172818070/2390892206 page 2: rc -5 > LustreError: 9171:0:(dir.c:330:ll_readdir()) error reading dir > 168359805/2380837293 page 8: rc -5 > LustreError: 9187:0:(dir.c:330:ll_readdir()) error reading dir > 163706128/1209056171 page 7: rc -5 > LustreError: 9199:0:(dir.c:330:ll_readdir()) error reading dir > 165116087/1211142674 page 0: rc -5 > LustreError: 9217:0:(dir.c:330:ll_readdir()) error reading dir > 162005170/1206582728 page 12: rc -5 > LustreError: 9216:0:(dir.c:330:ll_readdir()) error reading dir > 162686166/1207618778 page 12: rc -5 > LustreError: 6990:0:(dir.c:330:ll_readdir()) error reading dir > 163079284/1208141145 page 3: rc -5These are reporting IO errors while reading directories from the MDS. This isn''t a problem I''ve seen before, it''s hard to say what is the root cause. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc.
Daniel Leaberry
2007-Apr-25 08:38 UTC
[Lustre-discuss] 0 byte files and ll_readdir() error
> On Apr 24, 2007 08:42 -0600, Daniel Leaberry wrote: > >> We''re running 1.6b7 and have noticed the following two problems. I''m >> wondering if they''re correlated. >> >> 1. We get files that are 0 bytes. They have nothing in them. >> > > This may or may not be related to the recent bug 12181 problem. > That bug will be fixed in 1.6.0+ and 1.4.10.1 and 1.4.11+. > > It can also happen if the clients are evicted while they are > writing to the file. > >I figured out why this happened but I''m not sure if my explanation is valid. We run lustre as more of a general purpose filesystem but usually with larger size files. We use autofs to mount and unmount filesystems. The timeout is set to 120 seconds (after that much inactivity the filesystem is unmounted) On a particular machine that was being accessed infrequently and with small files what I think happened is a batch of xml files would be written, the metadata would be created on the MDS (hence the zero-byte files), but because lustre is trying to optimize the rpcs for 1MB io''s and the client is doing caching the data wouldn''t be written to the OST''s. Then autofs would unmount the filesystem without flushing the write buffers (That doesn''t make sense) and a few minutes later I would get a client evicted message on the MDS. Since the client was evicted all caches are flushed and the data was lost. I''m not sure why autofs unmounting the filesystem wouldn''t flush the buffers and I''m also not sure why unmounting doesn''t seem to inform the MDS that the client is leaving. I know lustre probably isn''t expecting to be mounted and unmounted every 5 minutes but is this expected behavior?>> 2. We get these errors across our 30 nodes >> LustreError: 7030:0:(dir.c:330:ll_readdir()) error reading dir >> 167108765/2378987153 page 13: rc -5 >> LustreError: 7029:0:(dir.c:330:ll_readdir()) error reading dir >> 171699532/2388399554 page 9: rc -5 >> LustreError: 7027:0:(dir.c:330:ll_readdir()) error reading dir >> 171403580/2387428410 page 2: rc -5 >> LustreError: 6990:0:(dir.c:330:ll_readdir()) error reading dir >> 171011300/2386583645 page 8: rc -5 >> LustreError: 7027:0:(dir.c:330:ll_readdir()) error reading dir >> 172286916/2390172901 page 13: rc -5 >> LustreError: 6990:0:(dir.c:330:ll_readdir()) error reading dir >> 172030180/2388919021 page 13: rc -5 >> LustreError: 7027:0:(dir.c:330:ll_readdir()) error reading dir >> 172321971/2390308492 page 3: rc -5 >> LustreError: 7027:0:(dir.c:330:ll_readdir()) error reading dir >> 163603484/1208913504 page 8: rc -5 >> LustreError: 6990:0:(dir.c:330:ll_readdir()) error reading dir >> 172748079/2390802528 page 13: rc -5 >> LustreError: 9133:0:(dir.c:330:ll_readdir()) error reading dir >> 172818070/2390892206 page 2: rc -5 >> LustreError: 9171:0:(dir.c:330:ll_readdir()) error reading dir >> 168359805/2380837293 page 8: rc -5 >> LustreError: 9187:0:(dir.c:330:ll_readdir()) error reading dir >> 163706128/1209056171 page 7: rc -5 >> LustreError: 9199:0:(dir.c:330:ll_readdir()) error reading dir >> 165116087/1211142674 page 0: rc -5 >> LustreError: 9217:0:(dir.c:330:ll_readdir()) error reading dir >> 162005170/1206582728 page 12: rc -5 >> LustreError: 9216:0:(dir.c:330:ll_readdir()) error reading dir >> 162686166/1207618778 page 12: rc -5 >> LustreError: 6990:0:(dir.c:330:ll_readdir()) error reading dir >> 163079284/1208141145 page 3: rc -5 >> > > These are reporting IO errors while reading directories from the MDS. > This isn''t a problem I''ve seen before, it''s hard to say what is the > root cause. > >Is it possible the clients are just messed up? Especially since I get no errors on the MDS? I suppose this might be due to our autofs mount/umounting so many times.> Cheers, Andreas > -- > Andreas Dilger > Principal Software Engineer > Cluster File Systems, Inc. > >
On Apr 25, 2007 08:38 -0600, Daniel Leaberry wrote:> >On Apr 24, 2007 08:42 -0600, Daniel Leaberry wrote: > >>We''re running 1.6b7 and have noticed the following two problems. I''m > >>wondering if they''re correlated. > >> > >>1. We get files that are 0 bytes. They have nothing in them. > > > >This may or may not be related to the recent bug 12181 problem. > >That bug will be fixed in 1.6.0+ and 1.4.10.1 and 1.4.11+. > > > >It can also happen if the clients are evicted while they are > >writing to the file. > > I figured out why this happened but I''m not sure if my explanation is > valid. We run lustre as more of a general purpose filesystem but usually > with larger size files. We use autofs to mount and unmount filesystems. > The timeout is set to 120 seconds (after that much inactivity the > filesystem is unmounted)Please try the patch in bug 12181. This sounds like it may be related. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc.