Adding information to my post:
Before trying the HA translator, I said that the client does not recover and
reports an endless string of stale NFS file handles. Here's the relevant
parts of the log file from that scenario:
[... vol file ...]
7: ### Add client feature and attach to remote subvolume
8: volume master_root
9: type protocol/client
10: option transport-type tcp
11: option remote-host 192.168.1.99 #this is the virtual IP
12: option transport.socket.nodelay on
13: option ping-timeout 5
14: option remote-port 3399
15: option remote-subvolume threads1
16: end-volume
[..... starts off ok ......]
[2010-05-07 09:13:55] N [glusterfsd.c:1408:main] glusterfs: Successfully started
[2010-05-07 09:13:55] N [client-protocol.c:6246:client_setvolume_cbk]
master_root: Connected to 192.168.1.99:3399, attached to remote volume
'threads1'.
[..... some lookup errors, probably related to this being a shared root cluster?
.....]
[2010-05-07 09:16:59] W [fuse-bridge.c:491:fuse_entry_cbk] glusterfs-fuse:
LOOKUP(/sbin/mkfs.vfat) inode (ptr=0x90d6538, ino=35520686, gen=54
67131851820775419) found conflict (ptr=0x90d62f8, ino=35520686,
gen=5467131851820775419)
[2010-05-07 09:17:00] W [fuse-bridge.c:491:fuse_entry_cbk] glusterfs-fuse:
LOOKUP(/sbin/tune2fs) inode (ptr=0x90da030, ino=35520696, gen=5467
131851820775464) found conflict (ptr=0x90d35a8, ino=35520696,
gen=5467131851820775464)
[2010-05-07 09:17:20] W [fuse-bridge.c:1848:fuse_readv_cbk] glusterfs-fuse:
200101: READ => -1 (Invalid cross-device link)
[2010-05-07 09:17:20] W [fuse-bridge.c:1848:fuse_readv_cbk] glusterfs-fuse:
200104: READ => -1 (Invalid cross-device link)
[..... Now I kill the server and the IP fails over to an identical box .....]
[2010-05-07 09:20:45] E [client-protocol.c:415:client_ping_timer_expired]
master_root: Server 192.168.1.99:3399 has not responded in the last
5 seconds, disconnecting.
[..... briefly we have transport endpoint error .....]
[2010-05-07 09:20:47] W [fuse-bridge.c:722:fuse_attr_cbk] glusterfs-fuse:
209367: LOOKUP() / => -1 (Transport endpoint is not connected)
[2010-05-07 09:20:47] W [fuse-bridge.c:722:fuse_attr_cbk] glusterfs-fuse:
209368: LOOKUP() / => -1 (Transport endpoint is not connected)
[..... then the client process reconnects to other server which now has the .99
IP ......]
[2010-05-07 09:20:48] N [client-protocol.c:6246:client_setvolume_cbk]
master_root: Connected to 192.168.1.99:3399, attached to remote volume
'threads1'.
[2010-05-07 09:20:48] N [client-protocol.c:6246:client_setvolume_cbk]
master_root: Connected to 192.168.1.99:3399, attached to remote volume
'threads1'.
[...... every attempt to use the mountpoint now produces stale NFS error - in
prev versions this did not happen.....]
[2010-05-07 09:20:51] W [fuse-bridge.c:722:fuse_attr_cbk] glusterfs-fuse:
209369: LOOKUP() / => -1 (Stale NFS file handle)
[2010-05-07 09:20:51] W [fuse-bridge.c:722:fuse_attr_cbk] glusterfs-fuse:
209370: LOOKUP() / => -1 (Stale NFS file handle)
[2010-05-07 09:20:51] W [fuse-bridge.c:722:fuse_attr_cbk] glusterfs-fuse:
209371: LOOKUP() / => -1 (Stale NFS file handle)
[2010-05-07 09:20:51] W [fuse-bridge.c:722:fuse_attr_cbk] glusterfs-fuse:
209372: LOOKUP() / => -1 (Stale NFS file handle)
[root at node1 ~]# ls /shared_root
ls: /shared_root: Stale NFS file handle
----- "Christopher Hawkins" <chawkins at bplinux.com> wrote:
> Hello, I have a problem now that was previously solved. In a simple
> setup with two servers and one client, the way I had things configured
> was that the client connected to a virtual IP that could fail back and
> forth to whatever server was available. This used to work. But I have
> not tested since 2.09 until today... And now instead of recovering
> after a brief timeout, the client never recovers and reports endless
> Stale NFS File handle errors in its log (though there is no NFS
> involved, just native gluster client).
>
> So I tried the HA translator from testing. This also does not work.
> After I kill the primary server (listed first in the config file), an
> ls of the mount point hangs for a moment and then reports:
>
> [root at server2 glusterfs]# ls /mnt/test
> ls: /mnt/test: Input/output error
>
> Each attempted ls produces two errors in the client log as well, a
> "Transport endpoint is not connected" error followed by the
> "Input/output error".
>
>
> The client log shows this:
>
> [2010-05-07 12:03:44] N [glusterfsd.c:1408:main] glusterfs:
> Successfully started
> [2010-05-07 12:03:44] N [client-protocol.c:6246:client_setvolume_cbk]
> master2_root: Connected to 192.168.1.92:3399, attached to remote
> volume 'threads1'.
> [2010-05-07 12:03:44] N [client-protocol.c:6246:client_setvolume_cbk]
> master_root: Connected to 192.168.1.91:3399, attached to remote volume
> 'threads1'.
> [2010-05-07 12:03:44] N [client-protocol.c:6246:client_setvolume_cbk]
> master2_root: Connected to 192.168.1.92:3399, attached to remote
> volume 'threads1'.
> [2010-05-07 12:03:44] N [fuse-bridge.c:2950:fuse_init] glusterfs-fuse:
> FUSE inited with protocol versions: glusterfs 7.13 kernel 7.10
> [2010-05-07 12:03:44] N [client-protocol.c:6246:client_setvolume_cbk]
> master_root: Connected to 192.168.1.91:3399, attached to remote volume
> 'threads1'.
>
>
> [....here I killed the primary server....]
>
> [2010-05-07 12:06:17] E
> [client-protocol.c:415:client_ping_timer_expired] master_root: Server
> 192.168.1.91:3399 has not responded in the last 42 seconds,
> disconnecting.
> [2010-05-07 12:06:17] E [saved-frames.c:165:saved_frames_unwind]
> master_root: forced unwinding frame type(1) op(LOOKUP)
> [2010-05-07 12:06:17] E [ha.c:125:ha_lookup_cbk] ha:
> (child=master_root) (op_ret=-1 op_errno=Transport endpoint is not
> connected)
> [2010-05-07 12:06:17] W [fuse-bridge.c:722:fuse_attr_cbk]
> glusterfs-fuse: 10: LOOKUP() / => -1 (Input/output error)
> [2010-05-07 12:06:17] E [saved-frames.c:165:saved_frames_unwind]
> master_root: forced unwinding frame type(2) op(PING)
> [2010-05-07 12:06:17] N [client-protocol.c:6994:notify] master_root:
> disconnected
> [2010-05-07 12:06:17] E [socket.c:762:socket_connect_finish]
> master_root: connection to 192.168.1.91:3399 failed (No route to
> host)
> [2010-05-07 12:06:21] E [socket.c:762:socket_connect_finish]
> master_root: connection to 192.168.1.91:3399 failed (No route to
> host)
> [2010-05-07 12:06:21] E [ha.c:125:ha_lookup_cbk] ha:
> (child=master_root) (op_ret=-1 op_errno=Transport endpoint is not
> connected)
> [2010-05-07 12:06:21] W [fuse-bridge.c:722:fuse_attr_cbk]
> glusterfs-fuse: 11: LOOKUP() / => -1 (Input/output error)
> [2010-05-07 12:06:24] E [ha.c:125:ha_lookup_cbk] ha:
> (child=master_root) (op_ret=-1 op_errno=Transport endpoint is not
> connected)
> [2010-05-07 12:06:24] W [fuse-bridge.c:722:fuse_attr_cbk]
> glusterfs-fuse: 12: LOOKUP() / => -1 (Input/output error)
> [2010-05-07 12:06:26] E [ha.c:125:ha_lookup_cbk] ha:
> (child=master_root) (op_ret=-1 op_errno=Transport endpoint is not
> connected)
> [2010-05-07 12:06:26] W [fuse-bridge.c:722:fuse_attr_cbk]
> glusterfs-fuse: 13: LOOKUP() / => -1 (Input/output error)
> [2010-05-07 12:06:39] E [ha.c:125:ha_lookup_cbk] ha:
> (child=master_root) (op_ret=-1 op_errno=Transport endpoint is not
> connected)
> [2010-05-07 12:06:39] W [fuse-bridge.c:722:fuse_attr_cbk]
> glusterfs-fuse: 14: LOOKUP() / => -1 (Input/output error)
>
> [.... here I powered the primary server back on....]
>
> [2010-05-07 12:07:07] N [client-protocol.c:6246:client_setvolume_cbk]
> master_root: Connected to 192.168.1.91:3399, attached to remote volume
> 'threads1'.
> [2010-05-07 12:07:07] N [client-protocol.c:6246:client_setvolume_cbk]
> master_root: Connected to 192.168.1.91:3399, attached to remote volume
> 'threads1'.
> --------- end log ------------
>
> And after it came back, the client recovered and everything picked
> back up. But it seems I cannot get the client to consider any server
> other than the first one it connects to. I assume that if failing the
> primary servers IP address to another box doesn't work, then round
> robin DNS will also not work since they are essentially the same
> method (a different server with the same address). And since this used
> to work, this seems to be an unintended result.
>
> The server vol file has a single export and io-threads, and the client
> has just the two remote-subvolumes and the ha declaration like so:
>
> volume ha
> type cluster/ha
> subvolumes master_root master2_root
> end-volume
>
> Code base is Glusterfs version 3.04 compiled from source this morning.
> How can I troubleshoot?
>
> Christopher Hawkins
> _______________________________________________
> Gluster-users mailing list
> Gluster-users at gluster.org
> http://gluster.org/cgi-bin/mailman/listinfo/gluster-users