I''ve run into an odd problem which I lovingly refer to as a "black
hole directory".
On a Thumper used for mail stores we''ve found find''s take an
exceptionally long time to run. There are directories that have as many as
400,000 files, which I immediately considered the culprit. However, under
investigation, they aren''t the problem at all. The problem is seen
here in this truss output (first column is delta time):
0.0001 lstat64("tmp", 0x08046A20) = 0
0.0000 openat(AT_FDCWD, "tmp", O_RDONLY|O_NDELAY|O_LARGEFILE) = 8
0.0001 fcntl(8, F_SETFD, 0x00000001) = 0
0.0000 fstat64(8, 0x08046920) = 0
0.0000 fstat64(8, 0x08046AB0) = 0
0.0000 fchdir(8) = 0
1321.3133 getdents64(8, 0xFEE48000, 8192) = 48
1255.8416 getdents64(8, 0xFEE48000, 8192) = 0
0.0001 fchdir(7) = 0
0.0001 close(8) = 0
These two getdents64 syscalls take approx 20 mins each. Notice that the
directory structure is 48 bytes, the directory is empty:
drwx------ 2 102 102 2 Feb 21 02:24 tmp
My assumption is that the directory is corrupt, but I''d like to prove
that. I have a scrub running on the pool, but its got about 16 hours to go
before it completes. 20% complete thus far and nothing is reported.
No errors are logged when I stimulate this problem.
Does anyone have suggestions on how to get additional data on this issue?
I''ve used dtrace flows to examine, however what I really want to see is
the zio''s as a result of the getdents, but can''t see how to do
so. Ideally I''d quiet the system and watch all zio''s
occurring while I stimulate it, but this is production and not possible. If
anyone knows how to watch DMU/ZIO activity that _only_ pertains to a certain PID
please let me know. ;)
Suggestions on how to pro-actively catch these sorts of instances are welcome,
as are alternative explanations.
benr.
This message posted from opensolaris.org