Robinson Tiemuqinke
2006-Aug-13 07:01 UTC
extremely slow "ls" on a cleared fatty ext3 directory on FC4/5
Hi, A stupid flat directory /tmp holding 5 millon files, the directory locates on a ext3 file system with dir_index feature turned on. The running Linux are FC4 and FC5. The files are just directly under /tmp, not in any subdirectories -- they are results of mis-operations of users. Then a 'ls' or 'find' command will take one hour to finish, a lot of other applications on the computer boxes are affected. I managed to have deleted the files one by one with a 'find . |xargs rm -rf' similar command in about 10 hours. but after a file system sync, it still take me 20 minutes to list the cleaned /tmp directory again -- even now the directory holds only 8 files total. so I try to 'ls' the directory itself (not any files and subdirectories on it) and find that its size is stupidly large (it is 131M even after deletion) compared with 4K for normal directories. -bash-3.00# ls -alFdh /tmp* drwxrwxrwt 4 root staff 4.0K Aug 12 23:17 new_tmp/ drwxrwxrwt 4 root staff 131M Aug 12 20:30 tmp/ Anyone know why the former fatty directory still looks unchanged and takes hours to traverse even after 99.999999% files got removed? If there are any ways to fix this kind of problem without rebooting machine? I'm afraid of the commands "rsync -avHn /tmp/ /new_tmp/; rm -rf /tmp/ && mv /new_tmp/ /tmp" because other applications are accessing /tmp/ as well. Please help. Thanks a lot. __________________________________________________ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com
Theodore Tso
2006-Aug-13 17:46 UTC
extremely slow "ls" on a cleared fatty ext3 directory on FC4/5
On Sun, Aug 13, 2006 at 12:01:17AM -0700, Robinson Tiemuqinke wrote:> > A stupid flat directory /tmp holding 5 millon files, > the directory locates on a ext3 file system with > dir_index feature turned on. The running Linux are FC4 > and FC5. > > The files are just directly under /tmp, not in any > subdirectories -- they are results of mis-operations > of users.Wow! How many users do you have on your system? And over what period of time did this build up?>From a system administration point of view, a really good idea is tohave a job which just deletes all file in /tmp that stick around for longer than 24 hours or so, and unconditionally on reboot. Then when the users scream, you can give them access to a /scratch partition which has lsightly more lax rules, such as deletion after 1 or 2 weeks, and with a README which says, "not backed up --- data can be deleted at any time, and if you complain, we will laugh at you". :-)>From a technical point of view, what's happening is that dir_indexspeeds up directory lookups by using a hash tree. Unfortunately, POSIX imposes requirements about how readdir() is supposed to work if files are added or deleted while the readdir() is in process. (Basically a file which is created or deleted during the readdir must appear once or not at all, and all other files must be returned exactly once.) This isn't too bad, except that this requirement must also be maintained even across a telldir() which saves a linear offset into the director, and seekdir() which seeks back to that location on disk. This interface is horribly broken, as it fundamentally assumes a linear linked list implementation such as was used three decades ago in Unix. And, it gives filesystem implementors nightmares when they are required to provide this interface even when they are trying to use more advanced data structures that no longer have a linear directory layout --- say, like a B-tree. Different filesystems solve this in different ways; some use multiple B-trees, with one B-tree only so that readdir() can have the proper semantics. This has the downside that file creations and deletions now have to update two separate trees. The choice which ext3 used was a simpler one, which is that we simply return files in hash sort order. This provides the correct semantics, but unfortunately it means that workloads which do a readdir() followed by a stat() of each file ends up accessing the inode table in an effectively random order. This can also happen if the inode table is fragmented, but this causes the worst case to happen every single time. There are solutions; and the simplest is to have programs read the entire directory into memory, and then sort by the list by inodes before actually stat'ing the file. This can be done in userspace much more easily than in the kernel, since userspace memory is swappable, and kernel memory is not. I have written an ld_preload which allows a program to do the right thing without needing to modify the program: http://www.redhat.com/archives/ext3-users/2004-September/msg00025.html Unfortunately, for programs that use telldir() and seekdir(), and hold on to the telldir() pointer for a long time, and still expect POSIX semantics, this will not necessarily work correctly, so it's not something I would recommend for the systemwide ld_preload. But it is useful for accelerating programs that haven't yet been modified, such as ls and find. Other programs, such as mutt's maildir handling, have already been so modified, and is a much better solution. (In fact, it provides speedup benefits on all filesystems, but just much more on ext3 filesystems with dir_tree enabled.) The fact that ext3 doesn't shrink directories is a long-standing Unix implementation restriction. It's not impossible for us to add support for truncating directories as files get deleted, but it's just never bubbled up to the top of the todo list; in practice, workloads that create gigantic directories that then shrink down to nothing are relatively rare.> If there are any ways to fix this kind of problem > without rebooting machine? I'm afraid of the commands > "rsync -avHn /tmp/ /new_tmp/; rm -rf /tmp/ && mv > /new_tmp/ /tmp" because other applications are > accessing /tmp/ as well.Not without rebooting, but probably it will required scheduled downtime where you kick all of the users off, and then recreate the tmp directory --- either using rsync, or just doing a plain old "rm -rf /tmp; mkdir /tmp". If users are expecting that files stick around in /tmp, that's huge cultural problem, and it will come back to haunt you in multiple ways.... - Ted
Michael Hennebry
2006-Aug-14 18:43 UTC
extremely slow "ls" on a cleared fatty ext3 directory on FC4/5
On Sun, 13 Aug 2006, Robinson Tiemuqinke wrote that /tmp on an ext3 filesystem had held 5 million plain files, that he deleted most of them, that though the files are gone just listing the remaining 8 files of /tmp takes 20 minutes.> -bash-3.00# ls -alFdh /tmp* > drwxrwxrwt 4 root staff 4.0K Aug 12 23:17 new_tmp/ > drwxrwxrwt 4 root staff 131M Aug 12 20:30 tmp/ > > Anyone know why the former fatty directory still looks > unchanged and takes hours to traverse even after > 99.999999% files got removed?Another poster stated that on ext3, directories can grow, but not shrink.> If there are any ways to fix this kind of problem > without rebooting machine? I'm afraid of the commands > "rsync -avHn /tmp/ /new_tmp/; rm -rf /tmp/ && mv > /new_tmp/ /tmp" because other applications are > accessing /tmp/ as well.If /tmp is its entire partition, I think that the only way is to reformat the partition. You don't necessarily have to reboot, but you will need to kick off anyone using /tmp . If /tmp is a soft link to /fred/tmp and /fred is its entire partition, you might be able to do something like this. # cd /fred # mkdir new.tmp # cd new.tmp # ln ../tmp/* . # hard links won't work on directories # cd .. # mv tmp old.tmp # # 'twould be best if no one tried to use /tmp at this point # mv new.tmp tmp # rm -r old.tmp There might still be problems if someone had the original /tmp as the current directory or just open. -- Mike hennebry at web.cs.ndsu.NoDak.edu "it stands to reason that they weren't always called the ancients." -- Daniel Jackson