Well, I bit the bullet and moved to using hast - all went beautifully, and I migrated the pool with no downtime. The one thing I do notice, however, is that the synchronisation with hast is much slower than the older ggate+gmirror combination. It's about half the speed in fact. When I orginaly setup my ggate configuration I did a lot of tweaks to get the speed good - these copnsisted of expanding the send and receive space for the sockets using sysctl.conf, and then providing large buffers to ggate. Is there a way to control this with hast ? I still have the sysctls set (as the machines have not rebooted) but I cant see any options in hast.conf which are equivalent to the "-S 262144 -R 262144" which I use with ggate Any advice, or am I barking up the wrong tree here ? cheers, -pete.
On Thu, 21 Oct 2010 13:25:34 +0100 Pete French wrote: PF> Well, I bit the bullet and moved to using hast - all went beautifully, PF> and I migrated the pool with no downtime. The one thing I do notice, PF> however, is that the synchronisation with hast is much slower PF> than the older ggate+gmirror combination. It's about half the PF> speed in fact. PF> When I orginaly setup my ggate configuration I did a lot of tweaks to PF> get the speed good - these copnsisted of expanding the send and PF> receive space for the sockets using sysctl.conf, and then providing PF> large buffers to ggate. Is there a way to control this with hast ? PF> I still have the sysctls set (as the machines have not rebooted) PF> but I cant see any options in hast.conf which are equivalent to the PF> "-S 262144 -R 262144" which I use with ggate PF> Any advice, or am I barking up the wrong tree here ? Currently there are no options in hast.conf to change send and receive buffer size. They are hardcoded in sbin/hastd/proto_tcp4.c: val = 131072; if (setsockopt(tctx->tc_fd, SOL_SOCKET, SO_SNDBUF, &val, sizeof(val)) == -1) { pjdlog_warning("Unable to set send buffer size on %s", addr); } val = 131072; if (setsockopt(tctx->tc_fd, SOL_SOCKET, SO_RCVBUF, &val, sizeof(val)) == -1) { pjdlog_warning("Unable to set receive buffer size on %s", addr); } You could change the values and recompile hastd :-). It would be interesting to know about the results of your experiment (if you do). Also note there is another hardcoded value in sbin/hastd/proto_common.c /* Maximum size of packet we want to use when sending data. */ #define MAX_SEND_SIZE 32768 that looks like might affect synchronization speed too. Previously we had 128kB here but this has been changed to 32Kb because it was reported about slow synchronization with MAX_SEND_SIZE=128kB. http://svn.freebsd.org/viewvc/base?view=revision&revision=211452 I wonder couldn't slow synchronization with MAX_SEND_SIZE=131072 be due to SO_SNDBUF/SO_RCVBUF be equal to this size? May be increasing SO_SNDBUF/SO_RCVBUF we could reach better performance with MAX_SEND_SIZE=128kB? -- Mikolaj Golub
> You can check if the queue size is an issue monitoring with netstat Recv-Q and > Send-Q for hastd connections during the test. Running something like below: > > while sleep 1; do netstat -na |grep '\.8457.*ESTAB'; doneInteresting - I ran those and started a complete resilvert (I do this by changing the secindary to 'init', running 'create' and then changing the role back to secondary) On primary I get... tcp4 0 0 10.17.18.1.10062 10.17.18.2.8457 ESTABLISHED tcp4 0 29872 10.17.18.1.10061 10.17.18.2.8457 ESTABLISHED tcp4 0 0 10.17.18.1.10062 10.17.18.2.8457 ESTABLISHED tcp4 0 115 10.17.18.1.10061 10.17.18.2.8457 ESTABLISHED tcp4 0 0 10.17.18.1.10062 10.17.18.2.8457 ESTABLISHED tcp4 0 0 10.17.18.1.10061 10.17.18.2.8457 ESTABLISHED tcp4 0 0 10.17.18.1.10062 10.17.18.2.8457 ESTABLISHED tcp4 0 80928 10.17.18.1.10061 10.17.18.2.8457 ESTABLISHED tcp4 0 0 10.17.18.1.10062 10.17.18.2.8457 ESTABLISHED tcp4 0 0 10.17.18.1.10061 10.17.18.2.8457 ESTABLISHED tcp4 0 0 10.17.18.1.10062 10.17.18.2.8457 ESTABLISHED tcp4 0 0 10.17.18.1.10061 10.17.18.2.8457 ESTABLISHED tcp4 0 0 10.17.18.1.10062 10.17.18.2.8457 ESTABLISHED tcp4 0 32883 10.17.18.1.10061 10.17.18.2.8457 ESTABLISHED tcp4 0 0 10.17.18.1.10062 10.17.18.2.8457 ESTABLISHED tcp4 0 0 10.17.18.1.10061 10.17.18.2.8457 ESTABLISHED tcp4 0 0 10.17.18.1.10062 10.17.18.2.8457 ESTABLISHED tcp4 0 0 10.17.18.1.10061 10.17.18.2.8457 ESTABLISHED tcp4 0 0 10.17.18.1.10062 10.17.18.2.8457 ESTABLISHED tcp4 0 115 10.17.18.1.10061 10.17.18.2.8457 ESTABLISHED tcp4 0 0 10.17.18.1.10062 10.17.18.2.8457 ESTABLISHED tcp4 0 0 10.17.18.1.10061 10.17.18.2.8457 ESTABLISHED And on the secondary.... tcp4 0 27 10.17.18.2.8457 10.17.18.1.10062 ESTABLISHED tcp4 0 0 10.17.18.2.8457 10.17.18.1.10061 ESTABLISHED tcp4 0 27 10.17.18.2.8457 10.17.18.1.10062 ESTABLISHED tcp4 0 0 10.17.18.2.8457 10.17.18.1.10061 ESTABLISHED tcp4 0 0 10.17.18.2.8457 10.17.18.1.10062 ESTABLISHED tcp4 0 0 10.17.18.2.8457 10.17.18.1.10061 ESTABLISHED tcp4 0 27 10.17.18.2.8457 10.17.18.1.10062 ESTABLISHED tcp4 0 0 10.17.18.2.8457 10.17.18.1.10061 ESTABLISHED tcp4 0 27 10.17.18.2.8457 10.17.18.1.10062 ESTABLISHED tcp4 105544 0 10.17.18.2.8457 10.17.18.1.10061 ESTABLISHED tcp4 0 0 10.17.18.2.8457 10.17.18.1.10062 ESTABLISHED tcp4 8688 0 10.17.18.2.8457 10.17.18.1.10061 ESTABLISHED tcp4 0 0 10.17.18.2.8457 10.17.18.1.10062 ESTABLISHED tcp4 0 0 10.17.18.2.8457 10.17.18.1.10061 ESTABLISHED tcp4 0 0 10.17.18.2.8457 10.17.18.1.10062 ESTABLISHED tcp4 0 0 10.17.18.2.8457 10.17.18.1.10061 ESTABLISHED tcp4 0 0 10.17.18.2.8457 10.17.18.1.10062 ESTABLISHED tcp4 84360 0 10.17.18.2.8457 10.17.18.1.10061 ESTABLISHED tcp4 0 0 10.17.18.2.8457 10.17.18.1.10062 ESTABLISHED tcp4 0 0 10.17.18.2.8457 10.17.18.1.10061 ESTABLISHED tcp4 0 0 10.17.18.2.8457 10.17.18.1.10062 ESTABLISHED tcp4 102648 0 10.17.18.2.8457 10.17.18.1.10061 ESTABLISHED tcp4 0 27 10.17.18.2.8457 10.17.18.1.10062 ESTABLISHED tcp4 17376 0 10.17.18.2.8457 10.17.18.1.10061 ESTABLISHED tcp4 0 0 10.17.18.2.8457 10.17.18.1.10062 ESTABLISHED tcp4 64088 0 10.17.18.2.8457 10.17.18.1.10061 ESTABLISHED tcp4 0 27 10.17.18.2.8457 10.17.18.1.10062 ESTABLISHED tcp4 0 0 10.17.18.2.8457 10.17.18.1.10061 ESTABLISHED tcp4 0 27 10.17.18.2.8457 10.17.18.1.10062 ESTABLISHED tcp4 34216 0 10.17.18.2.8457 10.17.18.1.10061 ESTABLISHED tcp4 0 27 10.17.18.2.8457 10.17.18.1.10062 ESTABLISHED tcp4 0 0 10.17.18.2.8457 10.17.18.1.10061 ESTABLISHED tcp4 0 27 10.17.18.2.8457 10.17.18.1.10062 ESTABLISHED Thats just an example - I see the same kind of behaviour throughout the sychronisation process. I cant comopare it to gmirror+ggated, but it looks far more bursty than I would expect. -pete.
Actually, I just llooked I dmesg on the secondary - it is full of messages thus: Oct 26 15:44:59 serpentine-passive hastd[10394]: [serp0] (secondary) Unable to receive request header: RPC version wrong. Oct 26 15:45:00 serpentine-passive hastd[782]: [serp0] (secondary) Worker process exited ungracefully (pid=10394, exitcode=75). Oct 26 15:46:59 serpentine-passive hastd[10421]: [serp0] (secondary) Unable to receive request header: RPC version wrong. Oct 26 15:47:04 serpentine-passive hastd[782]: [serp0] (secondary) Worker process exited ungracefully (pid=10421, exitcode=75). Does that help explain my issues ? I have the same OS build running on both machines, so I dont see how I can have a version mismatch. The ethernet here cnsists of a pair of bge devices, which are bundled using LACP and lagg. I didnt see this on my test setup, but that was using ethernet directly - could there be a difference there ? -pete.
Just to report back on this - I just tried the patches from last week, which fixed the sending of the keepalives in the different thread, but my original issue (the sychronisation speed) remains I'm afraid - so much for the theory that the corruption was causing the speed decrease. It's obviously good to have the threading issue fixed though. -pete.