I have a non-redundant zpool configured on one slice of my disk, and in the past week have had two directories simply disappear, in two different filesystems. The first was my email directory under my homedir (which is a ZFS fs) - I put this disappearance down to Thunderbird despite it never happening before. Then today, I was running a build of our product and an entire directory hierarchy in another FS disappeared. In both cases, I was able to recover by copying over the missing directories from an earlier snapshot. I ran a scrub on the zpool this morning (before the second directory went missing) and no errors were reported. I''m running S10 U1 patched up to date. Does anyone have any ideas or suggestions as to how I might try to figure out what''s wrong? Thanks, Trev -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/x-pkcs7-signature Size: 3253 bytes Desc: S/MIME Cryptographic Signature URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20061215/8db0f217/attachment.bin>
On Friday 15 December 2006 15:28, Trevor Watson wrote:> Does anyone have any ideas or suggestions as to how I might try to figure > out what''s wrong?I have no idea, but I''ve had the same thing happen to me yesterday (see http://www.opensolaris.org/jive/thread.jspa?threadID=20294&tstart=0 ). When I rebooted and my pool no longer mounted! It might or might not be related to your problem, however I''d be careful if you haven''t rebooted since that happened.
Were there any errors reported in /var/adm/messages, or do you see any logged via fmdump? In Solaris 10, ''ls'' will not print any error message if reading from a directory fails. (Fixed in Nevada.) If something damaged a directory (including ZFS detecting a checksum error), its contents (or some of them) may appear to vanish without any error being printed. You can see the error returned via ''truss'' of the ls command, but since you''ve already recovered from snapshots, your file system may not be in the failed state any more. (However, as noted, backing up your data now -- onto a *fresh* backup, not replacing any existing backup -- is a very good idea.) This message posted from opensolaris.org
Anton B. Rang wrote:> Were there any errors reported in /var/adm/messages, or do you see any logged via fmdump? > > In Solaris 10, ''ls'' will not print any error message if reading from a directory fails. (Fixed in Nevada.) If something damaged a directory (including ZFS detecting a checksum error), its contents (or some of them) may appear to vanish without any error being printed. >''zpool status -v'' will tell you the damaged files (if there are any). eric> You can see the error returned via ''truss'' of the ls command, but since you''ve already recovered from snapshots, your file system may not be in the failed state any more. (However, as noted, backing up your data now -- onto a *fresh* backup, not replacing any existing backup -- is a very good idea.) > > > This message posted from opensolaris.org > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Anton B. Rang wrote:> Were there any errors reported in /var/adm/messages, or do you see any logged via fmdump?Nothing, unfortunately.> In Solaris 10, ''ls'' will not print any error message if reading from a directory fails. (Fixed in Nevada.) If something damaged a directory (including ZFS detecting a checksum error), its contents (or some of them) may appear to vanish without any error being printed.I''ll try trussing ls if it happens again. The implication in what you''ve written is that ZFS doesn''t report an error if it detects an invalid checksum. Is that correct? Thx, Trev -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/x-pkcs7-signature Size: 3253 bytes Desc: S/MIME Cryptographic Signature URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20061215/368706b6/attachment.bin>
> The implication in what you''ve written is that ZFS doesn''t report an error if > it detects an invalid checksum. Is that correct?No, sorry I wasn''t more clear. ZFS detects and reports the invalid checksum. If the checksum error occurs on a directory, this can result in an error being returned from a readdir/getdents call. The ''ls'' command in Solaris 10, however, will not print any error messages if it gets an error reading from a directory. It simply stops, as if it had reached the end of the directory. I don''t think this is what happened to you, though, if you didn''t see any reported errors. This message posted from opensolaris.org
Trevor Watson wrote:> Anton B. Rang wrote: > >> Were there any errors reported in /var/adm/messages, or do you see any >> logged via fmdump? > > > Nothing, unfortunately. > >> In Solaris 10, ''ls'' will not print any error message if reading from a >> directory fails. (Fixed in Nevada.) If something damaged a directory >> (including ZFS detecting a checksum error), its contents (or some of >> them) may appear to vanish without any error being printed. > > > I''ll try trussing ls if it happens again. > The implication in what you''ve written is that ZFS doesn''t report an > error if it detects an invalid checksum. Is that correct? >If there is a checksum error, then ''ls'' will fail. Here i forceable put a checksum error on a directory: # ls /monkey/dir ls: error reading directory /monkey/dir: I/O error Via truss, you can see getdents64() will return EIO. # truss ls /monkey/dir execve("/usr/bin/ls", 0x08047C4C, 0x08047C58) argc = 2 resolvepath("/usr/lib/ld.so.1", "/lib/ld.so.1", 1023) = 12 resolvepath("/usr/bin/ls", "/usr/bin/ls", 1023) = 11 sysconfig(_CONFIG_PAGESIZE) = 4096 xstat(2, "/usr/bin/ls", 0x08047A08) = 0 open("/var/ld/ld.config", O_RDONLY) Err#2 ENOENT xstat(2, "/lib/libsec.so.1", 0x080471D8) = 0 resolvepath("/lib/libsec.so.1", "/lib/libsec.so.1", 1023) = 16 open("/lib/libsec.so.1", O_RDONLY) = 3 mmap(0x00010000, 4096, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_ALIGN, 3, 0) = 0xBFFB0000 mmap(0x00000000, 4096, PROT_READ|PROT_WRITE|PROT_EXEC, MAP_PRIVATE|MAP_ANON, -1, 0) = 0xBFFA0000 mmap(0x00010000, 147456, PROT_NONE, MAP_PRIVATE|MAP_NORESERVE|MAP_ANON|MAP_ALIGN, -1, 0) = 0xBFF70000 mmap(0xBFF70000, 54015, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_FIXED|MAP_TEXT, 3, 0) = 0xBFF70000 mmap(0xBFF8E000, 12045, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_INITDATA, 3, 57344) = 0xBFF8E000 mmap(0xBFF91000, 8408, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANON, -1, 0) = 0xBFF91000 munmap(0xBFF7E000, 65536) = 0 memcntl(0xBFF70000, 11452, MC_ADVISE, MADV_WILLNEED, 0, 0) = 0 close(3) = 0 xstat(2, "/lib/libc.so.1", 0x080471D8) = 0 resolvepath("/lib/libc.so.1", "/lib/libc.so.1", 1023) = 14 open("/lib/libc.so.1", O_RDONLY) = 3 mmap(0xBFFB0000, 4096, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_FIXED, 3, 0) = 0xBFFB0000 mmap(0x00010000, 1040384, PROT_NONE, MAP_PRIVATE|MAP_NORESERVE|MAP_ANON|MAP_ALIGN, -1, 0) = 0xBFE70000 mmap(0xBFE70000, 937863, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_FIXED|MAP_TEXT, 3, 0) = 0xBFE70000 mmap(0xBFF65000, 27222, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_INITDATA, 3, 937984) = 0xBFF65000 mmap(0xBFF6C000, 5560, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANON, -1, 0) = 0xBFF6C000 munmap(0xBFF55000, 65536) = 0 memcntl(0xBFE70000, 187828, MC_ADVISE, MADV_WILLNEED, 0, 0) = 0 close(3) = 0 xstat(2, "/lib/libavl.so.1", 0x080471D8) = 0 resolvepath("/lib/libavl.so.1", "/lib/libavl.so.1", 1023) = 16 open("/lib/libavl.so.1", O_RDONLY) = 3 mmap(0xBFFB0000, 4096, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_FIXED, 3, 0) = 0xBFFB0000 mmap(0x00010000, 73728, PROT_NONE, MAP_PRIVATE|MAP_NORESERVE|MAP_ANON|MAP_ALIGN, -1, 0) = 0xBFE50000 mmap(0xBFE50000, 3228, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_FIXED|MAP_TEXT, 3, 0) = 0xBFE50000 mmap(0xBFE61000, 220, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_INITDATA, 3, 4096) = 0xBFE61000 munmap(0xBFE51000, 65536) = 0 memcntl(0xBFE50000, 1256, MC_ADVISE, MADV_WILLNEED, 0, 0) = 0 close(3) = 0 munmap(0xBFFB0000, 4096) = 0 mmap(0x00010000, 24576, PROT_READ|PROT_WRITE|PROT_EXEC, MAP_PRIVATE|MAP_ANON|MAP_ALIGN, -1, 0) = 0xBFFB0000 getcontext(0x080477B0) getrlimit(RLIMIT_STACK, 0x080477A8) = 0 getpid() = 100888 [100887] lwp_private(0, 1, 0xBFFB2000) = 0x000001C3 setustack(0xBFFB2060) sysi86(SI86FPSTART, 0xBFF6CBE8, 0x0000133F, 0x00001F80) = 0x00000001 brk(0x08066208) = 0 brk(0x08068208) = 0 time() = 1166205046 ioctl(1, TCGETA, 0x0804797C) = 0 ioctl(1, TIOCGWINSZ, 0x08065478) = 0 brk(0x08068208) = 0 brk(0x08072208) = 0 lstat64("/monkey/dir", 0x08046860) = 0 openat(AT_FDCWD, "/monkey/dir", O_RDONLY|O_NDELAY|O_LARGEFILE) = 3 fcntl(3, F_SETFD, 0x00000001) = 0 fstat64(3, 0x08047860) = 0 getdents64(3, 0xBFFB4000, 8192) Err#5 EIO fstat64(2, 0x080469A0) = 0 ls: error reading directory write(2, " l s : e r r o r r e".., 28) = 28 /monkey/dirwrite(2, " / m o n k e y / d i r", 11) = 11 : write(2, " : ", 2) = 2 I/O errorwrite(2, " I / O e r r o r", 9) = 9 write(2, "\n", 1) = 1 close(3) = 0 mmap(0x00000000, 4096, PROT_READ|PROT_WRITE|PROT_EXEC, MAP_PRIVATE|MAP_ANON, -1, 0) = 0xBFE40000 munmap(0xBFE40000, 4096) = 0 _exit(0) #
Just to make sure there''s no confusion ;-), this error message was added to ''ls'' after Solaris 10, and hasn''t been backported yet. (Bug 4985395, *ls* does not report errors from getdents().) This message posted from opensolaris.org
Was it over NFS ? Was zil_disable set on the server ? If it''s yes/yes, I still don''t know for sure if that would be grounds for a causal relationship, but I would certainly be looking into it. -r Trevor Watson writes: > Anton B. Rang wrote: > > Were there any errors reported in /var/adm/messages, or do you see any logged via fmdump? > > Nothing, unfortunately. > > > In Solaris 10, ''ls'' will not print any error message if reading from a directory fails. (Fixed in Nevada.) If something damaged a directory (including ZFS detecting a checksum error), its contents (or some of them) may appear to vanish without any error being printed. > > I''ll try trussing ls if it happens again. > The implication in what you''ve written is that ZFS doesn''t report an error if > it detects an invalid checksum. Is that correct? > > Thx, > Trev > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Roch - PAE wrote:> Was it over NFS ?No, local.> Was zil_disable set on the server ?Not unless it is set by default. I haven''t changed any ZFS params.> If it''s yes/yes, I still don''t know for sure if that would > be grounds for a causal relationship, but I would certainly > be looking into it.Thanks anyway. T. -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/x-pkcs7-signature Size: 3253 bytes Desc: S/MIME Cryptographic Signature URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20061218/6289209c/attachment.bin>