Dear all
I''m about to answer my own question with some really useful hints
from Steve, thanks for that!!!
On 03/02/2012 07:43 AM, Thomas Nau wrote:> Dear all
> I asked before but without much feedback. As the issue
> is persistent I want to give it another try. We disabled
> panicing for such kind of error in /etc/system but still
> see messages such as
>
> zfs: accessing past end of object 5b1aa/21a8008 (size=60416
access=32603+32768)
>
> in the logs. Is there any way to identify which object (file?)
> causes this?
I''m about to answer my own question with some really useful hints
from Steve, thanks for that!!!
In case your system crashed there''s an option in /etc/system to
return an IO error instead of panicing, just add
set zfs:zfs_recover=1
After rebooting the system will issue a warning such as
Mar 4 15:04:16 neo genunix: [ID 762447 kern.warning] WARNING:
zfs: accessing past end of object bab/21a8008 (size=60416 access=32603+32768)
to syslog whenever the problem shows. The important numbers are
dataset ID: bab
object ID: 21a8008
Those would also show up in the kernel panic message.
Assuming the ZFS datasets still exist and the file is
also unaltered we don''t need crash dump analysis but can
use zdb instead. I running the latest S11 bits by the way
First use some bash magic to turn the hex numbers into decimals
as zdb deals with those
# printf "%d %d\n" 0xbab 0x21a8008
987 35291144
now lookup the dataset; I assume we already have a pretty
good idea about which pool to check
# zdb -d -r pool1 | grep "ID 2987"
Dataset pool1/.../backup-clone [ZPL], ID 2987, cr_txg 190496, 1.36T, 15590871
objects
Now lookup the actual object. Add more "-v" to get even more data
# zdb -vvv pool1/backup/nfs/home/student1/backup-clone 35291144
Dataset pool1/.../backup-clone [ZPL], ID 2987, cr_txg 190496, 1.36T, 15590871
objects,
rootbp DVA[0]=<2:2a000cada00:c00:RZM:4> [L0 DMU objset] fletcher4 lzjb LE
contiguous unique
unencrypted 4-copy size=800L/200P birth=190500L/190500P fill=15590871
cksum=1abc142ec9:88286e5b5e3:18d73114fd4d4:34d14f5e348c05
Object lvl iblk dblk dsize lsize %full type
35291144 1 16K 59.0K 32K 59.0K 100.00 ZFS plain file
168 bonus System attributes
dnode flags: USED_BYTES USERUSED_ACCOUNTED
dnode maxblkid: 0
path /zep13/.mozilla/firefox/jlonp9fm.default/cookies.sqlite
uid 63883
gid 400
atime Tue Jan 10 14:15:34 2012
mtime Tue Jan 10 14:23:01 2012
ctime Tue Jan 10 14:23:01 2012
crtime Wed Oct 19 09:43:54 2011
gen 15760229
mode 100644
size 2228224
parent 34303712
links 1
pflags 40800000004
So here comes the funny stuff: according to the object data the size
is 2228224 bytes which matches of course the "ls -l" output. On the
other hand ZFS read complained after about 32k which fits the dsize/lsize
columns as we use compression.
Strange, isn''t it but wait, it gets even more confusing... The initial
panic, now turned into warnings, was caused by the TSM backup client trying
to backup the file. We use zfs clones to get a consistent backup as much
as possible. Let''s truss the client (cut some path elements):
access("/backup/pool1/.../zep13/.mozilla/firefox/jlonp9fm.default/cookies.sqlite",
R_OK) = 0
open64("/backup/pool1/.../zep13/.mozilla/firefox/jlonp9fm.default/cookies.sqlite",
O_RDONLY|O_NONBLOCK) = 6
acl("/backup/pool1/.../zep13/.mozilla/firefox/jlonp9fm.default/cookies.sqlite",
ACE_GETACL, 1024, 0x0846D780) = 3
read(6, " S Q L i t e f o r m a".., 32603) = 32603
...
read(6, 0x086B0C98, 32768) Err#5 EIO
now let''s just cat the file and see what happens:
# cat /backup/pool1/.../zep13/.mozilla/firefox/jlonp9fm.default/cookies.sqlite
> TEST
# ls -l TEST
-rw-r--r-- 1 root root 2228224 Mar 4 15:50 TEST
no complains. Observing the appropriate routine through
# dtrace -n ''::zfs_panic_recover:entry { stack(); }''
does not trigger. Checking the backup client again... no more errors
as truss also confirms
open64("/backup/pool1/.../zep13/.mozilla/firefox/jlonp9fm.default/cookies.sqlite",
O_RDONLY|O_NONBLOCK) = 6
read(6, " S Q L i t e f o r m a".., 32603) = 32603
read(6, " 3 4 1 0 8 . 2 . 2 . u t".., 32768) = 32768
read(6, "\0\0\0\0\0\0\0\0\0\0\0\0".., 32768) = 32768
...
Double checking with zdb and "ls -i" shows the same object ID.
I''m really puzzled!!! Any more ideas what''s going on?
Thomas