Maxim Sobolev
2016-Jul-05 05:26 UTC
A faulty program corrupts some its data preventing correct core generation (Failed to write core file for process postgres (error 14))
Hi all, investigating some random postgresql-9.1.21 server crashes on FreeBSD 10.3, we've started seeing those after upgrading from postgres 9.1.18 on more than one system, so hardware (e.g. RAM issues) are very unlikely. I suspect that postgres is at fault, however I am also curious how could it be that kernel is not capable of generating core file when application does something silly? Is it that some ELF-related data structures got corrupted or something else? Are we protecting the page where ELF header is mapped with R/O flag? I am looking at possibly recreating this by poking around elf header(s), seeing if I can corrupt it in a similar manner reliably, any pointers or suggestions are appreciated. Jun 27 04:10:18 dal12 kernel: Failed to write core file for process postgres (error 14) Jun 27 04:10:18 dal12 kernel: pid 41361 (postgres), uid 70: exited on signal 11 Jul 1 05:21:46 dal12 kernel: Failed to write core file for process postgres (error 14) Jul 1 05:21:46 dal12 kernel: pid 1722 (postgres), uid 70: exited on signal 11 #define EFAULT 14 /* Bad address */ The resulting files are truncated and is not really usable for anything. We've seen the same issue -rw------- 1 pgsql wheel 1310720 Jun 27 04:10 postgres.41361.core -rw------- 1 pgsql wheel 1310720 Jul 1 05:21 postgres.1722.core [ssp-root at dal12 /var/tmp]$ sudo gdb711 postgres postgres.1722.core GNU gdb (GDB) 7.11 [GDB v7.11 for FreeBSD] Copyright (C) 2016 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "x86_64-portbld-freebsd10.3". Type "show configuration" for configuration details. For bug reporting instructions, please see: <http://www.gnu.org/software/gdb/bugs/>. Find the GDB manual and other documentation resources online at: <http://www.gnu.org/software/gdb/documentation/>. For help, type "help". Type "apropos word" to search for commands related to "word"... Reading symbols from postgres...(no debugging symbols found)...done. BFD: Warning: /var/tmp/postgres.1722.core is truncated: expected core file size >= 517120000, found: 1310720. [New LWP 100261] Core was generated by `postgres'. Program terminated with signal SIGSEGV, Segmentation fault. #0 0x0000000800cfba67 in ?? () from /lib/libthr.so.3 (gdb) where #0 0x0000000800cfba67 in ?? () from /lib/libthr.so.3 Backtrace stopped: Cannot access memory at address 0x7fffffffdd08 (gdb) q -Max
Konstantin Belousov
2016-Jul-05 11:48 UTC
A faulty program corrupts some its data preventing correct core generation (Failed to write core file for process postgres (error 14))
On Mon, Jul 04, 2016 at 10:26:25PM -0700, Maxim Sobolev wrote:> Hi all, investigating some random postgresql-9.1.21 server crashes on > FreeBSD 10.3, we've started seeing those after upgrading from postgres > 9.1.18 on more than one system, so hardware (e.g. RAM issues) are very > unlikely. I suspect that postgres is at fault, however I am also curious > how could it be that kernel is not capable of generating core file when > application does something silly? Is it that some ELF-related data > structures got corrupted or something else? Are we protecting the page > where ELF header is mapped with R/O flag? I am looking at possibly > recreating this by poking around elf header(s), seeing if I can corrupt it > in a similar manner reliably, any pointers or suggestions are appreciated. > > Jun 27 04:10:18 dal12 kernel: Failed to write core file for process > postgres (error 14) > Jun 27 04:10:18 dal12 kernel: pid 41361 (postgres), uid 70: exited on > signal 11 > Jul 1 05:21:46 dal12 kernel: Failed to write core file for process > postgres (error 14) > Jul 1 05:21:46 dal12 kernel: pid 1722 (postgres), uid 70: exited on signal > 11 > > #define EFAULT 14 /* Bad address */ > > The resulting files are truncated and is not really usable for anything. > We've seen the same issue > > -rw------- 1 pgsql wheel 1310720 Jun 27 04:10 postgres.41361.core > -rw------- 1 pgsql wheel 1310720 Jul 1 05:21 postgres.1722.core > > [ssp-root at dal12 /var/tmp]$ sudo gdb711 postgres postgres.1722.core > GNU gdb (GDB) 7.11 [GDB v7.11 for FreeBSD] > Copyright (C) 2016 Free Software Foundation, Inc. > License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html > > > This is free software: you are free to change and redistribute it. > There is NO WARRANTY, to the extent permitted by law. Type "show copying" > and "show warranty" for details. > This GDB was configured as "x86_64-portbld-freebsd10.3". > Type "show configuration" for configuration details. > For bug reporting instructions, please see: > <http://www.gnu.org/software/gdb/bugs/>. > Find the GDB manual and other documentation resources online at: > <http://www.gnu.org/software/gdb/documentation/>. > For help, type "help". > Type "apropos word" to search for commands related to "word"... > Reading symbols from postgres...(no debugging symbols found)...done. > BFD: Warning: /var/tmp/postgres.1722.core is truncated: expected core file > size >= 517120000, found: 1310720. > [New LWP 100261] > Core was generated by `postgres'. > Program terminated with signal SIGSEGV, Segmentation fault. > #0 0x0000000800cfba67 in ?? () from /lib/libthr.so.3 > (gdb) where > #0 0x0000000800cfba67 in ?? () from /lib/libthr.so.3 > Backtrace stopped: Cannot access memory at address 0x7fffffffdd08 > (gdb) q >https://lists.freebsd.org/pipermail/freebsd-stable/2016-June/084877.html