On 5/15/23 19:51, Richard W.M. Jones wrote:> On Mon, May 15, 2023 at 07:22:28PM +0200, Laszlo Ersek wrote: >> Hi Rich, >> >> do we expect "make check-valgrind" to succeed in virt-v2v at the moment >> (at commit e83de8abe6c5)? I see there's a "valgrind-suppressions" file >> in the project root, but "make check-valgrind" still fails for me, with >> numerous errors. > > Yes I expect it should work. Or at least it currently passes for me. > >> I'm attaching the test suite log. (Compare "--error-exitcode=119" in >> "m4/guestfs-progs.m4".) > > The problem is because you don't have sufficient debuginfo installed. > Your stack traces are full of "???" (missing symbols), and that stops > the suppressions from being effective. > > Now as for _what_ debuginfo you're missing, that's a bit trickier to > tell. The suppressions with type "Memcheck:Cond" seem to be in > libnuma and glibc, so I'd start by making sure you have full debuginfo > for those, and that might help. > > It may be that you've found a completely new problem (requiring a new > suppression), or that you're using some very old version of glibc/etc > which we never added suppressions for. The only way to fix those is > to investigate the valgrind message and try to see what it's > complaining about. But I would concentrate on trying to correct the > unresolved symbols first.Something is not adding up. * I've run "ldd" on my locally built virt-v2v binary, to learn what shared libraries it uses. Then I located all the packages (installed RPMs) providing those libraries (symlinks in fact), using "rpm -qf". Then I installed the debuginfo packages for each of those RPMs. I *still* get stack dumps like the following (taken from "tests/test-v2v-fedora-luks-on-lvm-conversion.sh.log"): ==34448== Conditional jump or move depends on uninitialised value(s) ==34448== at 0x40191DD: __GI___tunables_init (dl-tunables.c:211) ==34448== by 0x4020056: _dl_sysdep_start (dl-sysdep.c:110) ==34448== by 0x4021A07: _dl_start (rtld.c:502) ==34448== by 0x4020AD7: ??? (in /usr/lib64/ld-linux-x86-64.so.2) ==34448== by 0xE: ??? ==34448== by 0x1FFEFFE352: ??? ==34448== by 0x1FFEFFE35B: ??? ==34448== by 0x1FFEFFE366: ??? ==34448== by 0x1FFEFFE369: ??? ==34448== by 0x1FFEFFE36E: ??? ==34448== by 0x1FFEFFE39F: ??? ==34448== by 0x1FFEFFE3A2: ??? ==34448== by 0x1FFEFFE3A7: ??? ==34448== by 0x1FFEFFE3AD: ??? ==34448== by 0x1FFEFFE3CA: ??? ==34448== by 0x1FFEFFE3D0: ??? ==34448== by 0x1FFEFFE3EB: ??? ==34448== by 0x1FFEFFE3F1: ??? ==34448== by 0x1FFEFFE40C: ??? ==34448== by 0x1FFEFFE412: ??? Note the address 0x4020AD7. Valgrind itself says that the address is somewhere inside "/usr/lib64/ld-linux-x86-64.so.2". Problem is, I *do* have the debuginfo package installed (with correct version) for that binary. The binary comes from "glibc-2.34-40.el9_1.1.x86_64", and I've got the matching "glibc-debuginfo-2.34-40.el9_1.1.x86_64" package installed. * Now, from that kind of (useless) backtrace, I have four instances in this test case log, in total. However, there's a different kind too (just one instance): ==34448== Conditional jump or move depends on uninitialised value(s) ==34448== at 0x484A608: strlen (vg_replace_strmem.c:495) ==34448== by 0x5443D32: strdup (strdup.c:41) ==34448== by 0x4F09819: guestfs_int_copy_string_list (in /home/lacos/src/v2v/libguestfs/lib/.libs/libguestfs.so.0.513.0) ==34448== by 0x4F091DD: guestfs_int_copy_environ (in /home/lacos/src/v2v/libguestfs/lib/.libs/libguestfs.so.0.513.0) ==34448== by 0x4EB6B67: run_command (in /home/lacos/src/v2v/libguestfs/lib/.libs/libguestfs.so.0.513.0) ==34448== by 0x4EB778D: guestfs_int_cmd_run (in /home/lacos/src/v2v/libguestfs/lib/.libs/libguestfs.so.0.513.0) ==34448== by 0x4EC7B10: qemu_img_supports_U_option (in /home/lacos/src/v2v/libguestfs/lib/.libs/libguestfs.so.0.513.0) ==34448== by 0x4EC775A: get_json_output (in /home/lacos/src/v2v/libguestfs/lib/.libs/libguestfs.so.0.513.0) ==34448== by 0x4EC745D: guestfs_impl_disk_format (in /home/lacos/src/v2v/libguestfs/lib/.libs/libguestfs.so.0.513.0) ==34448== by 0x4E8769C: guestfs_disk_format (in /home/lacos/src/v2v/libguestfs/lib/.libs/libguestfs.so.0.513.0) ==34448== by 0x3B2A67: guestfs_int_ocaml_disk_format (in /home/lacos/src/v2v/virt-v2v/v2v/virt-v2v) ==34448== by 0x31B9D6: camlGuestfs__fun_12954 (guestfs.ml:1186) ==34448== by 0x334370: camlStdlib__list__map_233 (in /home/lacos/src/v2v/virt-v2v/v2v/virt-v2v) ==34448== by 0x2AE27A: camlInput_disk__detect_local_input_format_217 (input_disk.ml:142) ==34448== by 0x2ADE82: camlInput_disk__setup_216 (input_disk.ml:88) ==34448== by 0x28E671: camlV2v__main_202 (v2v.ml:552) ==34448== by 0x2DD3C1: camlTools_utils__run_main_and_handle_errors_510 (tools_utils.ml:228) ==34448== by 0x290D07: camlV2v__entry (v2v.ml:700) ==34448== by 0x27FB28: caml_program (in /home/lacos/src/v2v/virt-v2v/v2v/virt-v2v) ==34448== by 0x41AD53: caml_start_program (in /home/lacos/src/v2v/virt-v2v/v2v/virt-v2v) ==34448== by 0x41B166: caml_startup_common (in /home/lacos/src/v2v/virt-v2v/v2v/virt-v2v) ==34448== by 0x41B1AC: caml_startup (in /home/lacos/src/v2v/virt-v2v/v2v/virt-v2v) ==34448== by 0x27F16F: main (in /home/lacos/src/v2v/virt-v2v/v2v/virt-v2v) Here all addresses seem to be resolved, even those that point into my locally built libguestfs. What I don't understand however are the topmost two frames. I *think* those come from valgrind itself! So is valgrind complaining about... valgrind??? "vg_replace_strmem.c" is definitely a valgrind source file. I've cloned the upstream git repo and checked -- it is "shared/vg_replace_strmem.c", and that file has existed since November 2013. Yet, when I install valgrind-debugsource and valgrind-debuginfo (matching the installed valgrind version -- "valgrind-3.19.0-3.el9.x86_64"), *none* of the files in those packages are "vg_replace_strmem.c". After downloading the SRPM from Brew and build-prepping it, I find, in "shared/vg_replace_strmem.c": 476 /*---------------------- strlen ----------------------*/ 477 478 // Note that this replacement often doesn't get used because gcc inlines 479 // calls to strlen() with its own built-in version. This can be very 480 // confusing if you aren't expecting it. Other small functions in 481 // this file may also be inline by gcc. 482 483 #define STRLEN(soname, fnname) \ 484 SizeT VG_REPLACE_FUNCTION_EZU(20070,soname,fnname) \ 485 ( const char* str ); \ 486 SizeT VG_REPLACE_FUNCTION_EZU(20070,soname,fnname) \ 487 ( const char* str ) \ 488 { \ 489 SizeT i = 0; \ 490 while (str[i] != 0) i++; \ 491 return i; \ 492 } 493 494 #if defined(VGO_linux) 495 STRLEN(VG_Z_LIBC_SONAME, strlen) So basically valgrind tries to preempt the strlen() symbol from glibc with its own implementation. Then, "strdup.c" is not a valgrind source file, but I found it from the glibc debug packages -- "/usr/src/debug/glibc-2.34-40.el9_1.1.x86_64/string/strdup.c". (How *incredibly* useful of valgrind *not* to print the *full* pathname of a source file.) It goes like this: 37 /* Duplicate S, returning an identical malloc'd string. */ 38 char * 39 __strdup (const char *s) 40 { 41 size_t len = strlen (s) + 1; 42 void *new = malloc (len); 43 44 if (new == NULL) 45 return NULL; 46 47 return (char *) memcpy (new, s, len); 48 } So guestfs_int_copy_string_list() calls strdup() calls strlen(), with strdup coming from glibc and strlen coming from valgrind itself. And then valgrind complains about its own strlen implementation (fun!), which is BTW an incorrect complaint, because the *C-language* code at lines 488-492 is proper. This whole thing looks completely busted. I'll try to fool around with glibc tunables. Laszlo
On 5/16/23 14:37, Laszlo Ersek wrote:> This whole thing looks completely busted. I'll try to fool around with glibc tunables."export GLIBC_TUNABLES=glibc.cpu.hwcap_mask=0" makes no difference for the valgrind results. Laszlo
One mystery resolved: On 5/16/23 14:37, Laszlo Ersek wrote:> I *still* get stack dumps like the following (taken from "tests/test-v2v-fedora-luks-on-lvm-conversion.sh.log"): > > ==34448== Conditional jump or move depends on uninitialised value(s) > ==34448== at 0x40191DD: __GI___tunables_init (dl-tunables.c:211) > ==34448== by 0x4020056: _dl_sysdep_start (dl-sysdep.c:110) > ==34448== by 0x4021A07: _dl_start (rtld.c:502) > ==34448== by 0x4020AD7: ??? (in /usr/lib64/ld-linux-x86-64.so.2)how very fittingly at that: https://sourceware.org/bugzilla/show_bug.cgi?id=28256 Rich, you had reported *this very bug* one and half years ago, for upstream glibc. "Small world" and all that. They fixed it for 2.35. Apparently, the fix has not been backported to RHEL-9 to this day. My version (in RHEL-9.1) is "glibc-2.34-40.el9_1.1.x86_64", but looking at the latest RPM in brew (glibc-2.34-68.el9, for RHEL-9.3), that one is still not 2.35-based, and the %changelog does not indicate the patch from <https://sourceware.org/bugzilla/show_bug.cgi?id=28256#c2>. Laszlo
On Tue, May 16, 2023 at 02:37:00PM +0200, Laszlo Ersek wrote:> Something is not adding up. > > * I've run "ldd" on my locally built virt-v2v binary, to learn what shared libraries it uses. Then I located all the packages (installed RPMs) providing those libraries (symlinks in fact), using "rpm -qf". Then I installed the debuginfo packages for each of those RPMs.I've just tried it on RHEL 9 with upstream virt-v2v + commit c0bb624a151b. I'm seeing some failures but they look quite different to yours and all seem to be caused by a single leak in libvirt or how we use libvirt (at least potentially, I've not investigated, and I don't see this happening in Fedora). I have: glibc-2.34-67.el9.x86_64 glibc-debuginfo-2.34-67.el9.x86_64 glibc-debugsource-2.34-67.el9.x86_64 valgrind-3.19.0-3.el9.x86_64 valgrind-devel-3.19.0-3.el9.x86_64 libvirt-9.3.0-1.el9.x86_64 libvirt-debuginfo-9.3.0-1.el9.x86_64 libvirt-debugsource-9.3.0-1.el9.x86_64 How many of the tests fail for you? Just a small number or all of them? If it's a small number, which ones? Rich.> I *still* get stack dumps like the following (taken from "tests/test-v2v-fedora-luks-on-lvm-conversion.sh.log"): > > ==34448== Conditional jump or move depends on uninitialised value(s) > ==34448== at 0x40191DD: __GI___tunables_init (dl-tunables.c:211) > ==34448== by 0x4020056: _dl_sysdep_start (dl-sysdep.c:110) > ==34448== by 0x4021A07: _dl_start (rtld.c:502) > ==34448== by 0x4020AD7: ??? (in /usr/lib64/ld-linux-x86-64.so.2) > ==34448== by 0xE: ??? > ==34448== by 0x1FFEFFE352: ??? > ==34448== by 0x1FFEFFE35B: ??? > ==34448== by 0x1FFEFFE366: ??? > ==34448== by 0x1FFEFFE369: ??? > ==34448== by 0x1FFEFFE36E: ??? > ==34448== by 0x1FFEFFE39F: ??? > ==34448== by 0x1FFEFFE3A2: ??? > ==34448== by 0x1FFEFFE3A7: ??? > ==34448== by 0x1FFEFFE3AD: ??? > ==34448== by 0x1FFEFFE3CA: ??? > ==34448== by 0x1FFEFFE3D0: ??? > ==34448== by 0x1FFEFFE3EB: ??? > ==34448== by 0x1FFEFFE3F1: ??? > ==34448== by 0x1FFEFFE40C: ??? > ==34448== by 0x1FFEFFE412: ??? > > Note the address 0x4020AD7. Valgrind itself says that the address is somewhere inside "/usr/lib64/ld-linux-x86-64.so.2". Problem is, I *do* have the debuginfo package installed (with correct version) for that binary. The binary comes from "glibc-2.34-40.el9_1.1.x86_64", and I've got the matching "glibc-debuginfo-2.34-40.el9_1.1.x86_64" package installed. > > * Now, from that kind of (useless) backtrace, I have four instances in this test case log, in total. However, there's a different kind too (just one instance): > > ==34448== Conditional jump or move depends on uninitialised value(s) > ==34448== at 0x484A608: strlen (vg_replace_strmem.c:495) > ==34448== by 0x5443D32: strdup (strdup.c:41) > ==34448== by 0x4F09819: guestfs_int_copy_string_list (in /home/lacos/src/v2v/libguestfs/lib/.libs/libguestfs.so.0.513.0) > ==34448== by 0x4F091DD: guestfs_int_copy_environ (in /home/lacos/src/v2v/libguestfs/lib/.libs/libguestfs.so.0.513.0) > ==34448== by 0x4EB6B67: run_command (in /home/lacos/src/v2v/libguestfs/lib/.libs/libguestfs.so.0.513.0) > ==34448== by 0x4EB778D: guestfs_int_cmd_run (in /home/lacos/src/v2v/libguestfs/lib/.libs/libguestfs.so.0.513.0) > ==34448== by 0x4EC7B10: qemu_img_supports_U_option (in /home/lacos/src/v2v/libguestfs/lib/.libs/libguestfs.so.0.513.0) > ==34448== by 0x4EC775A: get_json_output (in /home/lacos/src/v2v/libguestfs/lib/.libs/libguestfs.so.0.513.0) > ==34448== by 0x4EC745D: guestfs_impl_disk_format (in /home/lacos/src/v2v/libguestfs/lib/.libs/libguestfs.so.0.513.0) > ==34448== by 0x4E8769C: guestfs_disk_format (in /home/lacos/src/v2v/libguestfs/lib/.libs/libguestfs.so.0.513.0) > ==34448== by 0x3B2A67: guestfs_int_ocaml_disk_format (in /home/lacos/src/v2v/virt-v2v/v2v/virt-v2v) > ==34448== by 0x31B9D6: camlGuestfs__fun_12954 (guestfs.ml:1186) > ==34448== by 0x334370: camlStdlib__list__map_233 (in /home/lacos/src/v2v/virt-v2v/v2v/virt-v2v) > ==34448== by 0x2AE27A: camlInput_disk__detect_local_input_format_217 (input_disk.ml:142) > ==34448== by 0x2ADE82: camlInput_disk__setup_216 (input_disk.ml:88) > ==34448== by 0x28E671: camlV2v__main_202 (v2v.ml:552) > ==34448== by 0x2DD3C1: camlTools_utils__run_main_and_handle_errors_510 (tools_utils.ml:228) > ==34448== by 0x290D07: camlV2v__entry (v2v.ml:700) > ==34448== by 0x27FB28: caml_program (in /home/lacos/src/v2v/virt-v2v/v2v/virt-v2v) > ==34448== by 0x41AD53: caml_start_program (in /home/lacos/src/v2v/virt-v2v/v2v/virt-v2v) > ==34448== by 0x41B166: caml_startup_common (in /home/lacos/src/v2v/virt-v2v/v2v/virt-v2v) > ==34448== by 0x41B1AC: caml_startup (in /home/lacos/src/v2v/virt-v2v/v2v/virt-v2v) > ==34448== by 0x27F16F: main (in /home/lacos/src/v2v/virt-v2v/v2v/virt-v2v) > > Here all addresses seem to be resolved, even those that point into my locally built libguestfs. What I don't understand however are the topmost two frames. I *think* those come from valgrind itself! So is valgrind complaining about... valgrind??? > > "vg_replace_strmem.c" is definitely a valgrind source file. I've cloned the upstream git repo and checked -- it is "shared/vg_replace_strmem.c", and that file has existed since November 2013. Yet, when I install valgrind-debugsource and valgrind-debuginfo (matching the installed valgrind version -- "valgrind-3.19.0-3.el9.x86_64"), *none* of the files in those packages are "vg_replace_strmem.c". > > After downloading the SRPM from Brew and build-prepping it, I find, in "shared/vg_replace_strmem.c": > > 476 /*---------------------- strlen ----------------------*/ > 477 > 478 // Note that this replacement often doesn't get used because gcc inlines > 479 // calls to strlen() with its own built-in version. This can be very > 480 // confusing if you aren't expecting it. Other small functions in > 481 // this file may also be inline by gcc. > 482 > 483 #define STRLEN(soname, fnname) \ > 484 SizeT VG_REPLACE_FUNCTION_EZU(20070,soname,fnname) \ > 485 ( const char* str ); \ > 486 SizeT VG_REPLACE_FUNCTION_EZU(20070,soname,fnname) \ > 487 ( const char* str ) \ > 488 { \ > 489 SizeT i = 0; \ > 490 while (str[i] != 0) i++; \ > 491 return i; \ > 492 } > 493 > 494 #if defined(VGO_linux) > 495 STRLEN(VG_Z_LIBC_SONAME, strlen) > > So basically valgrind tries to preempt the strlen() symbol from glibc with its own implementation. > > Then, "strdup.c" is not a valgrind source file, but I found it from the glibc debug packages -- "/usr/src/debug/glibc-2.34-40.el9_1.1.x86_64/string/strdup.c". (How *incredibly* useful of valgrind *not* to print the *full* pathname of a source file.) It goes like this: > > 37 /* Duplicate S, returning an identical malloc'd string. */ > 38 char * > 39 __strdup (const char *s) > 40 { > 41 size_t len = strlen (s) + 1; > 42 void *new = malloc (len); > 43 > 44 if (new == NULL) > 45 return NULL; > 46 > 47 return (char *) memcpy (new, s, len); > 48 } > > So guestfs_int_copy_string_list() calls strdup() calls strlen(), with strdup coming from glibc and strlen coming from valgrind itself. And then valgrind complains about its own strlen implementation (fun!), which is BTW an incorrect complaint, because the *C-language* code at lines 488-492 is proper. > > This whole thing looks completely busted. I'll try to fool around with glibc tunables. > > Laszlo-- Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones Read my programming and virtualization blog: http://rwmj.wordpress.com virt-top is 'top' for virtual machines. Tiny program with many powerful monitoring features, net stats, disk stats, logging, etc. http://people.redhat.com/~rjones/virt-top -------------- next part -------------- A non-text attachment was scrubbed... Name: test-suite.log.xz Type: application/x-xz Size: 7392 bytes Desc: not available URL: <http://listman.redhat.com/archives/libguestfs/attachments/20230516/7a74de0b/attachment-0001.xz>