Török Edwin
2011-Feb-26 18:56 UTC
[Libguestfs] hivex: some issues (key encoding, ...) and suggested fixes
Hi, libhivex seems to do a great job at parsing hives most of the time, but there are some issues with a few registry keys. These can be worked around in the application that uses libhivex, but I think it'd be better if libhivex handled these itself. 1. UTF16 string in REG_SZ that has garbage after the \0\0 There is code in hivex.c to handle this already but I think it has a typo: /* Deal with the case where Windows has allocated a large buffer * full of random junk, and only the first few bytes of the buffer * contain a genuine UTF-16 string. * * In this case, iconv would try to process the junk bytes as UTF-16 * and inevitably find an illegal sequence (EILSEQ). Instead, stop * after we find the first \0\0. * * (Found by Hilko Bengen in a fresh Windows XP SOFTWARE hive). */ size_t slen = utf16_string_len_in_bytes_max (data, len); if (slen > len) len = slen; char *ret = windows_utf16_to_utf8 (data, len); slen is only used to increase length of data, but I think it should be decreasing it (to stop earlier). Example key where problem occurs: software\Microsoft\MediaPlayer\Preferences> lsval hivexsh: lsval: Invalid or incomplete multibyte or wide character "MyPlayLists"=software\Microsoft\MediaPlayer\Preferences> Same for LcnStartLocation key in HKLM\\SOFTWARE\\Microsoft\\Dfrg\\BootOptimizeFunction (it starts with 30 00 00 00 .. some garbage). Printing the key with value_value shows this, which would be fine if hivex stopped parsing after the first 00 00: 43 00 3A 00 5C 00 44 00 6F 00 63 00 75 00 6D 00 65 00 6E 00 74 00 73 00 20 00 61 00 6E 00 64 00 20 00 53 00 65 00 74 00 74 00 69 00 6E 00 67 00 73 00 5C 00 41 00 6C 00 6C 00 20 00 55 00 73 00 65 00 72 00 73 00 5C 00 44 00 6F 00 63 00 75 00 6D 00 65 00 6E 00 74 00 69 00 5C 00 4D 00 75 00 73 00 69 00 63 00 61 00 5C 00 53 00 61 00 6D 00 70 00 6C 00 65 00 20 00 50 00 6C 00 61 00 79 00 6C 00 69 00 73 00 74 00 73 00 00 00 64 F7 06 00 2E 40 92 7C A8 20 08 00 3C F5 06 00 70 09 92 7C C0 E4 98 7C EF 40 92 7C BB 40 92 7C 04 01 00 00 00 DC FD 7F 00 00 00 00 02 00 00 00 39 00 00 00 C8 05 92 7C 90 97 08 00 00 00 00 00 08 00 0A 00 88 3E 92 7C 1A 02 00 00 00 00 00 00 98 97 08 00 F8 81 5D 77 B8 1B 09 00 6A 00 00 00 00 00 00 00 E0 1B 09 00 5C 01 08 00 6A 00 6C 00 00 DC FD 7F 3C F5 06 00 02 00 00 00 A0 20 08 00 60 00 00 01 43 00 3A 00 5C 00 44 00 6F 00 63 00 75 00 6D 00 65 00 6E 00 74 00 73 00 20 00 61 00 6E 00 64 00 20 00 53 00 65 00 74 00 74 00 69 00 6E 00 67 00 73 00 5C 00 41 00 6C 00 6C 00 20 00 55 00 73 00 65 00 72 00 73 00 5C 00 44 00 61 00 74 00 69 00 20 00 61 00 70 00 70 00 6C 00 69 00 63 00 61 00 7A 00 69 00 6F 00 6E 00 69 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 Workaround: I use value_value if value_string fails 2. Non-ascii node names I found a node with a \xDC (?) in it: SOFTWARE\\ODBC\\ODBCINST.INI\\MS Code Page-\xDCbersetzer hivex.c has a comment like this: /* AFAIK the node name is always plain ASCII, so no conversion * to UTF-8 is necessary. However we do need to nul-terminate * the string. */ I think hivex should convert the node names from CP1252 (or is it ISO-8859-1?) to UTF-8. Workaround: I do the CP1252 -> UTF8 conversion myself for now 3. node_get_child is slow Documentation issue, it should say that using node_get_child is slow (because registry doesn't have an index, and you do a linear search). Workaround: I create a map of node names to children of a node, a lookup in that is faster than using node_get_child repeatedly 4. hivexml output is not a well-formed XML See problem #1 and #2, if value_string and node_name are fixed to not dump the binary garbage and just return UTF8 then I think hivexml's output would pass xmllint. Best regards, --Edwin
Matthew Booth
2011-Feb-28 14:33 UTC
[Libguestfs] hivex: some issues (key encoding, ...) and suggested fixes
On 26/02/11 18:56, T?r?k Edwin wrote:> Hi, > > libhivex seems to do a great job at parsing hives most of the time, but > there are some issues with a few registry keys. > > These can be worked around in the application that uses libhivex, but I > think it'd be better if libhivex handled these itself. > > 1. UTF16 string in REG_SZ that has garbage after the \0\0 > > There is code in hivex.c to handle this already but I think it has a typo: > > /* Deal with the case where Windows has allocated a large buffer > * full of random junk, and only the first few bytes of the buffer > * contain a genuine UTF-16 string. > * > * In this case, iconv would try to process the junk bytes as UTF-16 > * and inevitably find an illegal sequence (EILSEQ). Instead, stop > * after we find the first \0\0. > * > * (Found by Hilko Bengen in a fresh Windows XP SOFTWARE hive). > */ > size_t slen = utf16_string_len_in_bytes_max (data, len); > if (slen> len) > len = slen; > > char *ret = windows_utf16_to_utf8 (data, len); > > slen is only used to increase length of data, but I think it should be > decreasing it (to stop earlier).Yup, that certainly looks like a bug.> 2. Non-ascii node names > > I found a node with a \xDC (?) in it: > SOFTWARE\\ODBC\\ODBCINST.INI\\MS Code Page-\xDCbersetzer > > hivex.c has a comment like this: > /* AFAIK the node name is always plain ASCII, so no conversion > * to UTF-8 is necessary. However we do need to nul-terminate > * the string. > */ > > I think hivex should convert the node names from CP1252 (or is it > ISO-8859-1?) to UTF-8. > > Workaround: I do the CP1252 -> UTF8 conversion myself for now > > 3. node_get_child is slow > > Documentation issue, it should say that using node_get_child is slow > (because registry doesn't have an index, and you do a linear search). > > Workaround: I create a map of node names to children of a node, a lookup > in that is faster than using node_get_child repeatedly > > 4. hivexml output is not a well-formed XML > > See problem #1 and #2, if value_string and node_name are fixed to not > dump the binary garbage and just return UTF8 then I think hivexml's > output would pass xmllint.As it happens, I opened a BZ on this just the other day. I think there's an additional element here: it seems that sometimes a registry key genuinely contains non-text data. An example is HKLM/SOFTWARE/Microsoft/MSDTC/Security/XAKey, which I'm guessing is a cryptographic key. This would require a CDATA section. However, it's not clear to me how the tool can reliably infer that a value is binary data without specific knowledge of the schema. Matt -- Matthew Booth, RHCA, RHCSS Red Hat Engineering, Virtualisation Team GPG ID: D33C3490 GPG FPR: 3733 612D 2D05 5458 8A8A 1600 3441 EA19 D33C 3490
Richard W.M. Jones
2011-Mar-01 04:33 UTC
[Libguestfs] hivex: some issues (key encoding, ...) and suggested fixes
On Mon, Feb 28, 2011 at 02:33:32PM +0000, Matthew Booth wrote:> On 26/02/11 18:56, T?r?k Edwin wrote: > >4. hivexml output is not a well-formed XML > > > >See problem #1 and #2, if value_string and node_name are fixed to not > >dump the binary garbage and just return UTF8 then I think hivexml's > >output would pass xmllint. > > As it happens, I opened a BZ on this just the other day. I think > there's an additional element here: it seems that sometimes a > registry key genuinely contains non-text data. An example is > HKLM/SOFTWARE/Microsoft/MSDTC/Security/XAKey, which I'm guessing is > a cryptographic key. This would require a CDATA section. However, > it's not clear to me how the tool can reliably infer that a value is > binary data without specific knowledge of the schema.The type field stored in the registry is in many cases nonsensical. In hivexml we trust the type, which is wrong. We ought to either shoot hivexml or fix it. In hivexregedit / virt-win-reg, we dump all strings as binary (ie. hex(TYPE):...) for this and for other reasons to do with preserving the encoding. It's explained in the man page I think. Rich. -- Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones virt-top is 'top' for virtual machines. Tiny program with many powerful monitoring features, net stats, disk stats, logging, etc. http://et.redhat.com/~rjones/virt-top
Richard W.M. Jones
2011-Mar-02 12:22 UTC
[Libguestfs] hivex: some issues (key encoding, ...) and suggested fixes
On Sat, Feb 26, 2011 at 08:56:48PM +0200, T?r?k Edwin wrote:> Hi, > > libhivex seems to do a great job at parsing hives most of the time, but > there are some issues with a few registry keys. > > These can be worked around in the application that uses libhivex, but I > think it'd be better if libhivex handled these itself. > > 1. UTF16 string in REG_SZ that has garbage after the \0\0 > > There is code in hivex.c to handle this already but I think it has a typo: > > /* Deal with the case where Windows has allocated a large buffer > * full of random junk, and only the first few bytes of the buffer > * contain a genuine UTF-16 string. > * > * In this case, iconv would try to process the junk bytes as UTF-16 > * and inevitably find an illegal sequence (EILSEQ). Instead, stop > * after we find the first \0\0. > * > * (Found by Hilko Bengen in a fresh Windows XP SOFTWARE hive). > */ > size_t slen = utf16_string_len_in_bytes_max (data, len); > if (slen > len) > len = slen; > > char *ret = windows_utf16_to_utf8 (data, len); > > slen is only used to increase length of data, but I think it should be > decreasing it (to stop earlier).Yes, it's strange -- this does appear to be a bug. [...]> 2. Non-ascii node names > > I found a node with a \xDC (?) in it: > SOFTWARE\\ODBC\\ODBCINST.INI\\MS Code Page-\xDCbersetzer > > hivex.c has a comment like this: > /* AFAIK the node name is always plain ASCII, so no conversion > * to UTF-8 is necessary. However we do need to nul-terminate > * the string. > */ > > I think hivex should convert the node names from CP1252 (or is it > ISO-8859-1?) to UTF-8. > > Workaround: I do the CP1252 -> UTF8 conversion myself for nowThis patch was posted but I didn't apply it because it seems quite risky: https://www.redhat.com/archives/libguestfs/2010-July/msg00064.html> 3. node_get_child is slow > > Documentation issue, it should say that using node_get_child is slow > (because registry doesn't have an index, and you do a linear search). > > Workaround: I create a map of node names to children of a node, a lookup > in that is faster than using node_get_child repeatedlyAgreed.> 4. hivexml output is not a well-formed XML > > See problem #1 and #2, if value_string and node_name are fixed to not > dump the binary garbage and just return UTF8 then I think hivexml's > output would pass xmllint.Shoot or fix. Rich. -- Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones libguestfs lets you edit virtual machines. Supports shell scripting, bindings from many languages. http://libguestfs.org
Possibly Parallel Threads
- [PATCH 0/3] hivex: Improve OS X support
- [hivex] OS X, Fedora 17: iconv autotool inconsistency
- [PATCH] Mac OS X: Link iconv in libhivex
- [PATCH] Report last-modified time of hive root and nodes
- [PATCH] hivex: Added gnulib includes from builddir, as suggested by the Gnulib documentation; link hivexml against libgnu.