David Dyer-Bennet
2009-Jan-13 03:17 UTC
[zfs-discuss] zfs null pointer deref, getting data out of single-user mode
My home NAS box, that I''d upgraded to Solaris 2008.11 after a series of crashes leaving the smf database damaged, and which ran for 4 days cleanly, suddenly fell right back to where the old one had been before. Looking at the logs, I see something similar to (this is manually transcribed to paper and retyped): Bad trap: type=e (page fault) rp=f..f00050e3250 addr=28 module ZFS null pointer dereference. (That''s "rp =" some number of ''f''s and then those exact hex digits). Lots of other data was logged, and it looks as if a kernel dump was written. I have multiple instances of this crash in my logs now. I don''t know how the kernel dump space works, I don''t know if I only have the latest dump, or what. Who needs this information? And how can I get it off the system? I''ve played with various attempts to mount a thumb drive, and googled around, and I can''t find any clues on how to do it. This is in the mode text mode boot gets into when the smf database is corrupt -- maintenance mode, or some such. I have to give the username and password that I established on installation for the admin user. Not that the thumb drive will help for the kernel dump anyway, if I read the logs correctly. So now I''ve been down for more than a week, *and* I think I destroyed all my file permissions last night trying to get the final steps done and the system back into service. And I really don''t know what I''m going to do; I got to this point because I decided to reinstall over the old nv76 I was running rather than try to recover it (I recovered the data zpool), and it seems to have not advanced me any. On the one hand, that sounds like hardware; but the log entries for the crash are about ZFS null pointer derefs, which does NOT sound like hardware. I''ll be reinstalling tomorrow night, unless somebody says they need the data, in which case I can work on getting the data out, if anybody can give me some clues on *how* to get the data out. -- David Dyer-Bennet, dd-b at dd-b.net; http://dd-b.net/ Snapshots: http://dd-b.net/dd-b/SnapshotAlbum/data/ Photos: http://dd-b.net/photography/gallery/ Dragaera: http://dragaera.info
If you''ve got enough space on /var, and you had a dump partition configured, you should find a bunch of "vmcore.[n]" files in /var/crash by now. The system normally dumps the kernel core into the dump partition (which can be the swap partition) and then copies it into /var/crash on the next successful reboot. There''s likely also a stack printed at the time of the crash; that might be enough for the ZFS developers to determine if this is a known (or even fixed) bug. It''s also retrievable from the core. If it''s not a known bug, or if more data is needed, the developers might want a copy of the core.... -- This message posted from opensolaris.org
Sigh. Richard points out in private email that automatic savecore functionality is disabled in OpenSolaris; you need to manually set up a dump device and save core files if you want them. However, the stack may be sufficient to ID the bug. -- This message posted from opensolaris.org
Anton B. Rang wrote:> Sigh. Richard points out in private email that automatic savecore functionality is disabled in OpenSolaris; you need to manually set up a dump device and save core files if you want them. However, the stack may be sufficient to ID the bug.The dump device is there, you just need to copy the data from the dump device to a file system, using savecore. -- richard
> Sigh. Richard points out in private email that automatic savecore > functionality is disabled in OpenSolaris; you need to manually > set up a dump device and save core files if you want them. > However, the stack may be sufficient to ID the bug.The dump device is present, so no need to set up one. If you enable savecore using dumpadm(1m), you must create the configured savecore directory manually though. -- julien. http://blog.thilelli.net/