On Tue, May 16, 2023 at 02:37:00PM +0200, Laszlo Ersek wrote:> Something is not adding up. > > * I've run "ldd" on my locally built virt-v2v binary, to learn what shared libraries it uses. Then I located all the packages (installed RPMs) providing those libraries (symlinks in fact), using "rpm -qf". Then I installed the debuginfo packages for each of those RPMs.I've just tried it on RHEL 9 with upstream virt-v2v + commit c0bb624a151b. I'm seeing some failures but they look quite different to yours and all seem to be caused by a single leak in libvirt or how we use libvirt (at least potentially, I've not investigated, and I don't see this happening in Fedora). I have: glibc-2.34-67.el9.x86_64 glibc-debuginfo-2.34-67.el9.x86_64 glibc-debugsource-2.34-67.el9.x86_64 valgrind-3.19.0-3.el9.x86_64 valgrind-devel-3.19.0-3.el9.x86_64 libvirt-9.3.0-1.el9.x86_64 libvirt-debuginfo-9.3.0-1.el9.x86_64 libvirt-debugsource-9.3.0-1.el9.x86_64 How many of the tests fail for you? Just a small number or all of them? If it's a small number, which ones? Rich.> I *still* get stack dumps like the following (taken from "tests/test-v2v-fedora-luks-on-lvm-conversion.sh.log"): > > ==34448== Conditional jump or move depends on uninitialised value(s) > ==34448== at 0x40191DD: __GI___tunables_init (dl-tunables.c:211) > ==34448== by 0x4020056: _dl_sysdep_start (dl-sysdep.c:110) > ==34448== by 0x4021A07: _dl_start (rtld.c:502) > ==34448== by 0x4020AD7: ??? (in /usr/lib64/ld-linux-x86-64.so.2) > ==34448== by 0xE: ??? > ==34448== by 0x1FFEFFE352: ??? > ==34448== by 0x1FFEFFE35B: ??? > ==34448== by 0x1FFEFFE366: ??? > ==34448== by 0x1FFEFFE369: ??? > ==34448== by 0x1FFEFFE36E: ??? > ==34448== by 0x1FFEFFE39F: ??? > ==34448== by 0x1FFEFFE3A2: ??? > ==34448== by 0x1FFEFFE3A7: ??? > ==34448== by 0x1FFEFFE3AD: ??? > ==34448== by 0x1FFEFFE3CA: ??? > ==34448== by 0x1FFEFFE3D0: ??? > ==34448== by 0x1FFEFFE3EB: ??? > ==34448== by 0x1FFEFFE3F1: ??? > ==34448== by 0x1FFEFFE40C: ??? > ==34448== by 0x1FFEFFE412: ??? > > Note the address 0x4020AD7. Valgrind itself says that the address is somewhere inside "/usr/lib64/ld-linux-x86-64.so.2". Problem is, I *do* have the debuginfo package installed (with correct version) for that binary. The binary comes from "glibc-2.34-40.el9_1.1.x86_64", and I've got the matching "glibc-debuginfo-2.34-40.el9_1.1.x86_64" package installed. > > * Now, from that kind of (useless) backtrace, I have four instances in this test case log, in total. However, there's a different kind too (just one instance): > > ==34448== Conditional jump or move depends on uninitialised value(s) > ==34448== at 0x484A608: strlen (vg_replace_strmem.c:495) > ==34448== by 0x5443D32: strdup (strdup.c:41) > ==34448== by 0x4F09819: guestfs_int_copy_string_list (in /home/lacos/src/v2v/libguestfs/lib/.libs/libguestfs.so.0.513.0) > ==34448== by 0x4F091DD: guestfs_int_copy_environ (in /home/lacos/src/v2v/libguestfs/lib/.libs/libguestfs.so.0.513.0) > ==34448== by 0x4EB6B67: run_command (in /home/lacos/src/v2v/libguestfs/lib/.libs/libguestfs.so.0.513.0) > ==34448== by 0x4EB778D: guestfs_int_cmd_run (in /home/lacos/src/v2v/libguestfs/lib/.libs/libguestfs.so.0.513.0) > ==34448== by 0x4EC7B10: qemu_img_supports_U_option (in /home/lacos/src/v2v/libguestfs/lib/.libs/libguestfs.so.0.513.0) > ==34448== by 0x4EC775A: get_json_output (in /home/lacos/src/v2v/libguestfs/lib/.libs/libguestfs.so.0.513.0) > ==34448== by 0x4EC745D: guestfs_impl_disk_format (in /home/lacos/src/v2v/libguestfs/lib/.libs/libguestfs.so.0.513.0) > ==34448== by 0x4E8769C: guestfs_disk_format (in /home/lacos/src/v2v/libguestfs/lib/.libs/libguestfs.so.0.513.0) > ==34448== by 0x3B2A67: guestfs_int_ocaml_disk_format (in /home/lacos/src/v2v/virt-v2v/v2v/virt-v2v) > ==34448== by 0x31B9D6: camlGuestfs__fun_12954 (guestfs.ml:1186) > ==34448== by 0x334370: camlStdlib__list__map_233 (in /home/lacos/src/v2v/virt-v2v/v2v/virt-v2v) > ==34448== by 0x2AE27A: camlInput_disk__detect_local_input_format_217 (input_disk.ml:142) > ==34448== by 0x2ADE82: camlInput_disk__setup_216 (input_disk.ml:88) > ==34448== by 0x28E671: camlV2v__main_202 (v2v.ml:552) > ==34448== by 0x2DD3C1: camlTools_utils__run_main_and_handle_errors_510 (tools_utils.ml:228) > ==34448== by 0x290D07: camlV2v__entry (v2v.ml:700) > ==34448== by 0x27FB28: caml_program (in /home/lacos/src/v2v/virt-v2v/v2v/virt-v2v) > ==34448== by 0x41AD53: caml_start_program (in /home/lacos/src/v2v/virt-v2v/v2v/virt-v2v) > ==34448== by 0x41B166: caml_startup_common (in /home/lacos/src/v2v/virt-v2v/v2v/virt-v2v) > ==34448== by 0x41B1AC: caml_startup (in /home/lacos/src/v2v/virt-v2v/v2v/virt-v2v) > ==34448== by 0x27F16F: main (in /home/lacos/src/v2v/virt-v2v/v2v/virt-v2v) > > Here all addresses seem to be resolved, even those that point into my locally built libguestfs. What I don't understand however are the topmost two frames. I *think* those come from valgrind itself! So is valgrind complaining about... valgrind??? > > "vg_replace_strmem.c" is definitely a valgrind source file. I've cloned the upstream git repo and checked -- it is "shared/vg_replace_strmem.c", and that file has existed since November 2013. Yet, when I install valgrind-debugsource and valgrind-debuginfo (matching the installed valgrind version -- "valgrind-3.19.0-3.el9.x86_64"), *none* of the files in those packages are "vg_replace_strmem.c". > > After downloading the SRPM from Brew and build-prepping it, I find, in "shared/vg_replace_strmem.c": > > 476 /*---------------------- strlen ----------------------*/ > 477 > 478 // Note that this replacement often doesn't get used because gcc inlines > 479 // calls to strlen() with its own built-in version. This can be very > 480 // confusing if you aren't expecting it. Other small functions in > 481 // this file may also be inline by gcc. > 482 > 483 #define STRLEN(soname, fnname) \ > 484 SizeT VG_REPLACE_FUNCTION_EZU(20070,soname,fnname) \ > 485 ( const char* str ); \ > 486 SizeT VG_REPLACE_FUNCTION_EZU(20070,soname,fnname) \ > 487 ( const char* str ) \ > 488 { \ > 489 SizeT i = 0; \ > 490 while (str[i] != 0) i++; \ > 491 return i; \ > 492 } > 493 > 494 #if defined(VGO_linux) > 495 STRLEN(VG_Z_LIBC_SONAME, strlen) > > So basically valgrind tries to preempt the strlen() symbol from glibc with its own implementation. > > Then, "strdup.c" is not a valgrind source file, but I found it from the glibc debug packages -- "/usr/src/debug/glibc-2.34-40.el9_1.1.x86_64/string/strdup.c". (How *incredibly* useful of valgrind *not* to print the *full* pathname of a source file.) It goes like this: > > 37 /* Duplicate S, returning an identical malloc'd string. */ > 38 char * > 39 __strdup (const char *s) > 40 { > 41 size_t len = strlen (s) + 1; > 42 void *new = malloc (len); > 43 > 44 if (new == NULL) > 45 return NULL; > 46 > 47 return (char *) memcpy (new, s, len); > 48 } > > So guestfs_int_copy_string_list() calls strdup() calls strlen(), with strdup coming from glibc and strlen coming from valgrind itself. And then valgrind complains about its own strlen implementation (fun!), which is BTW an incorrect complaint, because the *C-language* code at lines 488-492 is proper. > > This whole thing looks completely busted. I'll try to fool around with glibc tunables. > > Laszlo-- Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones Read my programming and virtualization blog: http://rwmj.wordpress.com virt-top is 'top' for virtual machines. Tiny program with many powerful monitoring features, net stats, disk stats, logging, etc. http://people.redhat.com/~rjones/virt-top -------------- next part -------------- A non-text attachment was scrubbed... Name: test-suite.log.xz Type: application/x-xz Size: 7392 bytes Desc: not available URL: <http://listman.redhat.com/archives/libguestfs/attachments/20230516/7a74de0b/attachment-0001.xz>
On 5/16/23 15:26, Richard W.M. Jones wrote:> How many of the tests fail for you? Just a small number or all of > them?Almost all of them fail. I think I've figured out why. First, as I mention up-thread, there's upstream glibc bug <https://sourceware.org/bugzilla/show_bug.cgi?id=28256>, reported by you, fixed in 2.35; but the upstream fix has never been backported to RHEL-9. However, there's another piece to the puzzle. In the upstream glibc bug report, you wrote: "I'm getting this when I run any program under valgrind with glibc tunables" Keyword being "glibc tunables". I don't set them myself -- so why does my (unfixed) glibc nonetheless trigger the valgrind false positive? The answer to that is the following: I build all libguestfs projects from source. I keep referring to the following dependency graph (constructed earlier with your help): libvirt-ocaml --------- \ libnbd <--> nbdkit--- \ \ \ hivex ----> libguestfs --------------------> virt-v2v -> virt-p2v / \ / supermin -- -> guestfs-tools - (It does not show augeas, which I cannot be bothered to build locally.) Accordingly, I have a good number of shell scripts that are called: r-guestfs-tools r-libguestfs r-nbdkit r-virt-p2v r-virt-v2v For example, "r-guestfs-tools" looks like this:> #!/bin/bash > > # enable dependencies needed by guestfs-tools > SUPERMIN=$HOME/src/v2v/supermin/src/supermin \ > $HOME/src/v2v/hivex/run \ > $HOME/src/v2v/libguestfs/run \ > "$@"and, for example, "r-virt-v2v" is:> #!/bin/bash > > # enable dependencies needed by virt-v2v > SUPERMIN=$HOME/src/v2v/supermin/src/supermin \ > $HOME/src/v2v/hivex/run \ > $HOME/src/v2v/libguestfs/run \ > $HOME/src/v2v/libnbd/run \ > $HOME/src/v2v/libvirt-ocaml/run \ > $HOME/src/v2v/guestfs-tools/run \ > "$@"And when I build virt-v2v locally, I do: r-virt-v2v autoreconf -i r-virt-v2v ./configure CFLAGS=-fPIC --enable-werror=yes --prefix=/usr r-virt-v2v make -j6 r-virt-v2v make -j6 check r-virt-v2v make -j6 check-valgrind In effect this chains the "run" scripts from all the other local build trees that virt-v2v depends upon, for building. Note that virt-v2v's own run script is not chained by my own "r-virt-v2v" script -- that run script is only needed if someone wants to run (i.e., not build) virt-v2v locally. (In fact, when I'm gearing up to autoreconf & configure the virt-v2v tree, the "run" script doesn't even exist in that tree (only configure will generate it!), so I couldn't even run it from "r-virt-v2v"!) Therefore, whenever I also run virt-v2v locally, I spell out the local, now-existent, "./run" in addition, from the virt-v2v project root: r-virt-v2v ./run virt-v2v ... In effect this chains virt-v2v's own run script in addition to the run scripts of its dependencies. Now here's the problem. Consider virt-v2v's own "run.in" script:> # This is a cheap way to find some use-after-free and uninitialized > # read problems when using glibc. But if we are valgrinding then > # don't use this because it can stop valgrind from working. > if [ -z "$VG" ]; then > random_val="$(@AWK@ 'BEGIN{srand(); print 1+int(255*rand())}' < /dev/null)" > LD_PRELOAD="${LD_PRELOAD:+"$LD_PRELOAD:"}libc_malloc_debug.so.0" > GLIBC_TUNABLES=glibc.malloc.check=1:glibc.malloc.perturb=$random_val > export LD_PRELOAD GLIBC_TUNABLES > fiOuch. GLIBC_TUNABLES are known to break valgrind. Splendid. It turns out however that virt-v2v's own "run" script does not participate in "make check-valgrind". I verified that by adding an "else" branch above, printing an error message, and exiting with status 1. It does not fire. So this is all fine: the above safety check is for running virt-v2v *manually* under valgrind. "make check-valgrind" sets VG, but it does not call virt-v2v's own "run", so the VG nullity check is not even necessary to reach in the "run" script, for "make check-valgrind". And the check handles manual VG settings properly. But... remember "r-virt-v2v" again:> #!/bin/bash > > # enable dependencies needed by virt-v2v > SUPERMIN=$HOME/src/v2v/supermin/src/supermin \ > $HOME/src/v2v/hivex/run \ > $HOME/src/v2v/libguestfs/run \ > $HOME/src/v2v/libnbd/run \ > $HOME/src/v2v/libvirt-ocaml/run \ > $HOME/src/v2v/guestfs-tools/run \ > "$@"It turns out that "guestfs-tools/run" has the exact same logic for setting GLIBC_TUNABLES! So when I execute r-virt-v2v make -j6 check-valgrind then the environment for "make -j6 check-valgrind" will *inherit* a GLIBC_TUNABLES variable, from (at least!) "guestfs-tools/run". The VG variable will only be set internally to "make check-valgrind", which is too late; it does not prevent "guestfs-tools" from setting GLIBC_TUNABLES. I've verified this in the output of r-virt-v2v env which does show GLIBC_TUNABLES. And that way I hit glibc bug <https://sourceware.org/bugzilla/show_bug.cgi?id=28256>. Now here's another interesting difference: - the "run" script in guestfs-tools, virt-p2v, and virt-v2v (1) don't touch GLIBC_TUNABLES when valgrinding, and (2) set GLIBC_TUNABLES when not valgrinding, - whereas the "run" script in libnbd (which I also chain in "r-virt-v2v", at an earlier stage, see above) (1) *unsets* GLIBC_TUNABLES when valgrinding, and (2) doesn't touch GLIBC_TUNABLES when not valgrinding. See libnbd commit 2eeb0c693ce1 ("tests: Remove GLIBC_TUNABLES when running under valgrind", 2021-08-26) -- it even references the same glibc bug. Either way, if I use "env -u" to unset LD_PRELOAD and GLIBC_TUNABLES between the "chain of run scripts" and "make check-valgrind", as in: r-virt-v2v \ env -u LD_PRELOAD -u GLIBC_TUNABLES \ make -j6 check-valgrind TESTS=test-v2v-fedora-luks-on-lvm-conversion.sh then the test (test-v2v-fedora-luks-on-lvm-conversion.sh) passes; valgrind doesn't complain. Therefore, IMO, this is a bug in how the "run" scripts compose. Arguably, it should be possible to chain any number of those run scripts (it's a valid use case for a user to depend on all of the build trees at the same time), and they should all agree about GLIBC_TUNABLES. Namely, the run scripts should neither set, nor unset, GLIBC_TUNABLES and LD_PRELOAD; those variables should be ignored altogether in the "run" scripts. Should I submit patches to remove the LD_PRELOAD and GLIBC_TUNABLES tweaking from all the run scripts? Thanks! Laszlo