Hello, I'm trying to setup hastd on two servers and got error, which I can't understand. Box is running as primary, then i reboot it, another box get primary role by carp events, then 1st box at boot tries to set up primary role on own hast instance and fails with this: Jan 18 22:13:03 gw_chlb_2 hastd[1387]: [storage0] (primary) G_GATE_CMD_DONE failed: No such file or directory. Jan 18 22:13:08 gw_chlb_2 hastd[1004]: [storage0] (primary) Worker process exited ungracefully (pid=1387, exitcode=71). I thought that geom_gate module can be problem, so i compiled it in kernel. As you can see - it doesn't help. Both servers are FreeBSD9.0-stable, updated 1 week ago. Hastd use whole disk. More info from hastd: gw_chlb_2# hastd -dF -c /etc/hast.conf [INFO] Started successfully, running protocol version 1. [DEBUG][1] Listening on control address /var/run/hastctl. [INFO] Listening on address 192.168.0.1:8457. [INFO] [storage0] (init) Role changed to primary. [DEBUG][1] [storage0] (primary) Obtained info about /dev/ada2. [DEBUG][1] [storage0] (primary) Locked /dev/ada2. [INFO] [storage0] (primary) Device hast/storage0 created. [DEBUG][1] [storage0] (primary) Privileges successfully dropped using jail+setgid+setuid. [INFO] [storage0] (primary) Privileges successfully dropped. [INFO] [storage0] (primary) Connected to tcp4://192.168.0.2. [INFO] [storage0] (primary) Synchronization started. 6.0MB to go. [ERROR] [storage0] (primary) G_GATE_CMD_DONE failed: No such file or directory. [INFO] [storage0] (primary) Received cancel from the kernel, exiting. [DEBUG][1] Unable to receive event header: Socket is not connected. [ERROR] [storage0] (primary) Worker process exited ungracefully (pid=1452, exitcode=71). [INFO] [storage0] (primary) Changing resource role back to init. Any thoughts? --- With Best Regards / Yst?v?llisin terveisin Artem Kajalainen
Hi, On Wed, 18 Jan 2012 20:23:25 +0200 Artem Kajalainen wrote: AK> Hello, AK> I'm trying to setup hastd on two servers and got error, which I can't AK> understand. Box is running as primary, then i reboot it, another box AK> get primary role by carp events, then 1st box at boot tries to set up AK> primary role on own hast instance and fails with this: AK> Jan 18 22:13:03 gw_chlb_2 hastd[1387]: [storage0] (primary) AK> G_GATE_CMD_DONE failed: No such file or directory. AK> Jan 18 22:13:08 gw_chlb_2 hastd[1004]: [storage0] (primary) Worker AK> process exited ungracefully (pid=1387, exitcode=71). AK> I thought that geom_gate module can be problem, so i compiled it in AK> kernel. As you can see - it doesn't help. Both servers are AK> FreeBSD9.0-stable, updated 1 week ago. Hastd use whole disk. More info AK> from hastd: AK> gw_chlb_2# hastd -dF -c /etc/hast.conf AK> [INFO] Started successfully, running protocol version 1. AK> [DEBUG][1] Listening on control address /var/run/hastctl. AK> [INFO] Listening on address 192.168.0.1:8457. AK> [INFO] [storage0] (init) Role changed to primary. AK> [DEBUG][1] [storage0] (primary) Obtained info about /dev/ada2. AK> [DEBUG][1] [storage0] (primary) Locked /dev/ada2. AK> [INFO] [storage0] (primary) Device hast/storage0 created. AK> [DEBUG][1] [storage0] (primary) Privileges successfully dropped using AK> jail+setgid+setuid. AK> [INFO] [storage0] (primary) Privileges successfully dropped. AK> [INFO] [storage0] (primary) Connected to tcp4://192.168.0.2. AK> [INFO] [storage0] (primary) Synchronization started. 6.0MB to go. AK> [ERROR] [storage0] (primary) G_GATE_CMD_DONE failed: No such file or directory. AK> [INFO] [storage0] (primary) Received cancel from the kernel, exiting. AK> [DEBUG][1] Unable to receive event header: Socket is not connected. AK> [ERROR] [storage0] (primary) Worker process exited ungracefully AK> (pid=1452, exitcode=71). AK> [INFO] [storage0] (primary) Changing resource role back to init. AK> Any thoughts? Sorry, Artem, I read your email only today. Investigating, it looks after r226859, when 'async' mode was added, we have 2 issues with synchronization from secondary to master (rather very rear case normally): 1) When the synchronization from secondary to master is running and primary gets READ request, the request should be sent to the secondary but actually it is lost. As a result READ operation gets stuck. After the syncronization is complete the following READ requests, which now can be served by primary, work ok. 2) In async mode, for syncronization requests, write_complete() function, which sends G_GATE_CMD_DONE command to ggate, is called twice and the second call fails. Artem, did you run async mode? If you did then I suppose you observed the second issue. Could you please try the attached patch? -- Mikolaj Golub -------------- next part -------------- A non-text attachment was scrubbed... Name: hastd.remote_read.patch Type: text/x-patch Size: 795 bytes Desc: not available Url : http://lists.freebsd.org/pipermail/freebsd-stable/attachments/20120128/98062329/hastd.remote_read.bin
On Sun, Jan 29, 2012 at 12:35:35AM +0200, Mikolaj Golub wrote:> Investigating, it looks after r226859, when 'async' mode was added, we have 2 > issues with synchronization from secondary to master (rather very rear case > normally): > > 1) When the synchronization from secondary to master is running and primary > gets READ request, the request should be sent to the secondary but actually it > is lost. As a result READ operation gets stuck. After the syncronization is > complete the following READ requests, which now can be served by primary, work > ok. > > 2) In async mode, for syncronization requests, write_complete() function, > which sends G_GATE_CMD_DONE command to ggate, is called twice and the second > call fails. > > Artem, did you run async mode? If you did then I suppose you observed the > second issue. Could you please try the attached patch?The analysis and fixes look good to me, please go ahead and commit (small nits below).> Index: sbin/hastd/primary.c > ==================================================================> --- sbin/hastd/primary.c (revision 230661) > +++ sbin/hastd/primary.c (working copy) > @@ -1255,7 +1255,7 @@ ggate_recv_thread(void *arg) > pjdlog_debug(2, > "ggate_recv: (%p) Moving request to the send queues.", hio); > refcount_init(&hio->hio_countdown, ncomps); > - for (ii = ncomp; ii < ncomps; ii++) > + for (ii = ncomp; ncomps != 0; ncomps--, ii++)I'd prefer not to modify ncomps in the loop, maybe something like this: for (ii = ncomp; ii < ncomp + ncomps; ii++)> QUEUE_INSERT1(hio, send, ii); > } > /* NOTREACHED */ > @@ -1326,7 +1326,7 @@ local_send_thread(void *arg) > } else { > hio->hio_errors[ncomp] = 0; > if (hio->hio_replication => - HAST_REPLICATION_ASYNC) { > + HAST_REPLICATION_ASYNC && !ISSYNCREQ(hio)) {Could you move this additional check to separate line? Thanks! -- Pawel Jakub Dawidek http://www.wheelsystems.com FreeBSD committer http://www.FreeBSD.org Am I Evil? Yes, I Am! http://tupytaj.pl -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 196 bytes Desc: not available Url : http://lists.freebsd.org/pipermail/freebsd-stable/attachments/20120205/3b41cb28/attachment.pgp