Hey folks,
I'm working on tracking down rogue QEMU segfaults in my infrastructure that
look to be dying due to gluster. The tips that I get is that the process is in
disk sleep when it dies and the process is backed only by gluster and the
segfault lends to io system issues. Unfortunately I haven't figured out how
to get a full crash dump so I can run it through apport-retrace to get exactly
what went wrong. The other interesting thing is this happens only when gluster
is under heavy load. Any tips about debugging further or getting this fixed up
would be appreciated.
Segfault:
Dec 30 20:42:56 HFMHVR3 kernel: [5976247.820875] qemu-system-x86[27730]:
segfault at 128 ip 00007f891f0cc82c sp 00007f89376846a0 error 4 in
qemu-system-x86_64 (deleted)[7f891ed42000+4af000]
Brick log:
[2014-12-30 20:42:56.797946] I [server.c:520:server_rpc_notify]
0-VMARRAY-server: disconnecting connectionfrom
HFMHVR3-27726-2014/11/29-00:42:11:436294-VMARRAY-client-0-0-0
[2014-12-30 20:42:56.798244] W [inodelk.c:392:pl_inodelk_log_cleanup]
0-VMARRAY-server: releasing lock on 6e640448-aa4c-4faa-b7ad-33e68aca0d3a held by
{client=0x7fe130776740, pid=0 lk-owner=ecb80
[2014-12-30 20:42:56.798287] I [server-helpers.c:289:do_fd_cleanup]
0-VMARRAY-server: fd cleanup on /HFMPCI0.img
[2014-12-30 20:42:56.798384] I [client_t.c:417:gf_client_unref]
0-VMARRAY-server: Shutting down connection
HFMHVR3-27726-2014/11/29-00:42:11:436294-VMARRAY-client-0-0-0
Nothing interesting in the VM log or around the segfault event in the hypervisor
log
Enviroment
Ubuntu 14.04 running stock QEMU 2.0.0 only modified for gfapi from
https://launchpad.net/~josh-boon/+archive/ubuntu/qemu-glusterfs running on top
of an SSD RAID0 array. The gluster volumes are connected over back-to-back 10G
fiber connects running in a bond using balance-rr.
Config
Filesystem mount
/dev/mapper/VG0-VAR on /var type xfs (rw,noatime,nodiratime,nobarrier)
Gluster config
Volume Name: VMARRAY
Type: Replicate
Volume ID: c0947aea-d07f-4ca0-bfcf-3b1c97cec247
Status: Started
Number of Bricks: 1 x 2 = 2
Transport-type: tcp
Bricks:
Brick1: 10.9.1.1:/var/lib/glusterfs
Brick2: 10.9.1.2:/var/lib/glusterfs
Options Reconfigured:
cluster.choose-local: true
storage.owner-gid: 112
storage.owner-uid: 107
cluster.server-quorum-type: none
cluster.quorum-type: none
network.remote-dio: enable
cluster.eager-lock: enable
performance.stat-prefetch: off
performance.io-cache: off
performance.read-ahead: off
performance.quick-read: off
server.allow-insecure: on
network.ping-timeout: 7
Machine Disk XML
<disk type='network' device='disk'>
<driver name='qemu' type='raw' cache='none'/>
<source protocol='gluster' name='VMARRAY/HFMPCI0.img'>
<host name='10.9.1.2'/>
</source>
<target dev='vda' bus='virtio'/>
<address type='pci' domain='0x0000' bus='0x00'
slot='0x05' function='0x0'/>
</disk>
Thanks,
Josh
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://www.gluster.org/pipermail/gluster-users/attachments/20141230/027616f8/attachment.html>