Alessandro De Salvo
2015-Jun-10 19:07 UTC
[Gluster-users] Questions on ganesha HA and shared storage size
Hi, by looking at the connections I also see a strange problem: # netstat -ltaupn | grep 2049 tcp6 4 0 :::2049 :::* LISTEN 32080/ganesha.nfsd tcp6 1 0 x.x.x.2:2049 x.x.x.2:33285 CLOSE_WAIT - tcp6 1 0 127.0.0.1:2049 127.0.0.1:39555 CLOSE_WAIT - udp6 0 0 :::2049 :::* 32080/ganesha.nfsd Why tcp6 is used with an IPv4 address? In another machine where ganesha 2.1.0 is running I see tcp is used, not tcp6. Could it be that the RPC are always trying to use IPv6? That would be wrong. Thanks, Alessandro On Wed, 2015-06-10 at 15:28 +0530, Soumya Koduri wrote:> > On 06/10/2015 05:49 AM, Alessandro De Salvo wrote: > > Hi, > > I have enabled the full debug already, but I see nothing special. Before exporting any volume the log shows no error, even when I do a showmount (the log is attached, ganesha.log.gz). If I do the same after exporting a volume nfs-ganesha does not even start, complaining for not being able to bind the IPv6 ruota socket, but in fact there is nothing listening on IPv6, so it should not happen: > > > > tcp6 0 0 :::111 :::* LISTEN 7433/rpcbind > > tcp6 0 0 :::2224 :::* LISTEN 9054/ruby > > tcp6 0 0 :::22 :::* LISTEN 1248/sshd > > udp6 0 0 :::111 :::* 7433/rpcbind > > udp6 0 0 fe80::8c2:27ff:fef2:123 :::* 31238/ntpd > > udp6 0 0 fe80::230:48ff:fed2:123 :::* 31238/ntpd > > udp6 0 0 fe80::230:48ff:fed2:123 :::* 31238/ntpd > > udp6 0 0 fe80::230:48ff:fed2:123 :::* 31238/ntpd > > udp6 0 0 ::1:123 :::* 31238/ntpd > > udp6 0 0 fe80::5484:7aff:fef:123 :::* 31238/ntpd > > udp6 0 0 :::123 :::* 31238/ntpd > > udp6 0 0 :::824 :::* 7433/rpcbind > > > > The error, as shown in the attached ganesha-after-export.log.gz logfile, is the following: > > > > > > 10/06/2015 02:07:47 : epoch 55777fb5 : node2 : ganesha.nfsd-26195[main] Bind_sockets_V6 :DISP :WARN :Cannot bind RQUOTA tcp6 socket, error 98 (Address already in use) > > 10/06/2015 02:07:47 : epoch 55777fb5 : node2 : ganesha.nfsd-26195[main] Bind_sockets :DISP :FATAL :Error binding to V6 interface. Cannot continue. > > 10/06/2015 02:07:48 : epoch 55777fb5 : node2 : ganesha.nfsd-26195[main] glusterfs_unload :FSAL :DEBUG :FSAL Gluster unloaded > > > > We have seen such issues with RPCBIND few times. NFS-Ganesha setup first > disables Gluster-NFS and then brings up NFS-Ganesha service. Sometimes, > there could be delay or issue with Gluster-NFS un-registering those > services and when NFS-Ganesha tries to register to the same port, it > throws this error. Please try registering Rquota to any random port > using below config option in "/etc/ganesha/ganesha.conf" > > NFS_Core_Param { > #Use a non-privileged port for RQuota > Rquota_Port = 4501; > } > > and cleanup '/var/cache/rpcbind/' directory before the setup. > > Thanks, > Soumya > > > > > Thanks, > > > > Alessandro > > > > > > > > > >> Il giorno 09/giu/2015, alle ore 18:37, Soumya Koduri <skoduri at redhat.com> ha scritto: > >> > >> > >> > >> On 06/09/2015 09:47 PM, Alessandro De Salvo wrote: > >>> Another update: the fact that I was unable to use vol set ganesha.enable > >>> was due to another bug in the ganesha scripts. In short, they are all > >>> using the following line to get the location of the conf file: > >>> > >>> CONF=$(cat /etc/sysconfig/ganesha | grep "CONFFILE" | cut -f 2 -d "=") > >>> > >>> First of all by default in /etc/sysconfig/ganesha there is no line > >>> CONFFILE, second there is a bug in that directive, as it works if I add > >>> in /etc/sysconfig/ganesha > >>> > >>> CONFFILE=/etc/ganesha/ganesha.conf > >>> > >>> but it fails if the same is quoted > >>> > >>> CONFFILE="/etc/ganesha/ganesha.conf" > >>> > >>> It would be much better to use the following, which has a default as > >>> well: > >>> > >>> eval $(grep -F CONFFILE= /etc/sysconfig/ganesha) > >>> CONF=${CONFFILE:/etc/ganesha/ganesha.conf} > >>> > >>> I'll update the bug report. > >>> Having said this... the last issue to tackle is the real problem with > >>> the ganesha.nfsd :-( > >> > >> Thanks. Could you try changing log level to NIV_FULL_DEBUG in '/etc/sysconfig/ganesha' and check if anything gets logged in '/var/log/ganesha.log' or '/ganesha.log'. > >> > >> Thanks, > >> Soumya > >> > >>> Cheers, > >>> > >>> Alessandro > >>> > >>> > >>> On Tue, 2015-06-09 at 14:25 +0200, Alessandro De Salvo wrote: > >>>> OK, I can confirm that the ganesha.nsfd process is actually not > >>>> answering to the calls. Here it is what I see: > >>>> > >>>> # rpcinfo -p > >>>> program vers proto port service > >>>> 100000 4 tcp 111 portmapper > >>>> 100000 3 tcp 111 portmapper > >>>> 100000 2 tcp 111 portmapper > >>>> 100000 4 udp 111 portmapper > >>>> 100000 3 udp 111 portmapper > >>>> 100000 2 udp 111 portmapper > >>>> 100024 1 udp 41594 status > >>>> 100024 1 tcp 53631 status > >>>> 100003 3 udp 2049 nfs > >>>> 100003 3 tcp 2049 nfs > >>>> 100003 4 udp 2049 nfs > >>>> 100003 4 tcp 2049 nfs > >>>> 100005 1 udp 58127 mountd > >>>> 100005 1 tcp 56301 mountd > >>>> 100005 3 udp 58127 mountd > >>>> 100005 3 tcp 56301 mountd > >>>> 100021 4 udp 46203 nlockmgr > >>>> 100021 4 tcp 41798 nlockmgr > >>>> 100011 1 udp 875 rquotad > >>>> 100011 1 tcp 875 rquotad > >>>> 100011 2 udp 875 rquotad > >>>> 100011 2 tcp 875 rquotad > >>>> > >>>> # netstat -lpn | grep ganesha > >>>> tcp6 14 0 :::2049 :::* > >>>> LISTEN 11937/ganesha.nfsd > >>>> tcp6 0 0 :::41798 :::* > >>>> LISTEN 11937/ganesha.nfsd > >>>> tcp6 0 0 :::875 :::* > >>>> LISTEN 11937/ganesha.nfsd > >>>> tcp6 10 0 :::56301 :::* > >>>> LISTEN 11937/ganesha.nfsd > >>>> tcp6 0 0 :::564 :::* > >>>> LISTEN 11937/ganesha.nfsd > >>>> udp6 0 0 :::2049 :::* > >>>> 11937/ganesha.nfsd > >>>> udp6 0 0 :::46203 :::* > >>>> 11937/ganesha.nfsd > >>>> udp6 0 0 :::58127 :::* > >>>> 11937/ganesha.nfsd > >>>> udp6 0 0 :::875 :::* > >>>> 11937/ganesha.nfsd > >>>> > >>>> I'm attaching the strace of a showmount from a node to the other. > >>>> This machinery was working with nfs-ganesha 2.1.0, so it must be > >>>> something introduced with 2.2.0. > >>>> Cheers, > >>>> > >>>> Alessandro > >>>> > >>>> > >>>> > >>>> On Tue, 2015-06-09 at 15:16 +0530, Soumya Koduri wrote: > >>>>> > >>>>> On 06/09/2015 02:48 PM, Alessandro De Salvo wrote: > >>>>>> Hi, > >>>>>> OK, the problem with the VIPs not starting is due to the ganesha_mon > >>>>>> heartbeat script looking for a pid file called > >>>>>> /var/run/ganesha.nfsd.pid, while by default ganesha.nfsd v.2.2.0 is > >>>>>> creating /var/run/ganesha.pid, this needs to be corrected. The file is > >>>>>> in glusterfs-ganesha-3.7.1-1.el7.x86_64, in my case. > >>>>>> For the moment I have created a symlink in this way and it works: > >>>>>> > >>>>>> ln -s /var/run/ganesha.pid /var/run/ganesha.nfsd.pid > >>>>>> > >>>>> Thanks. Please update this as well in the bug. > >>>>> > >>>>>> So far so good, the VIPs are up and pingable, but still there is the > >>>>>> problem of the hanging showmount (i.e. hanging RPC). > >>>>>> Still, I see a lot of errors like this in /var/log/messages: > >>>>>> > >>>>>> Jun 9 11:15:20 atlas-node1 lrmd[31221]: notice: operation_finished: > >>>>>> nfs-mon_monitor_10000:29292:stderr [ Error: Resource does not exist. ] > >>>>>> > >>>>>> While ganesha.log shows the server is not in grace: > >>>>>> > >>>>>> 09/06/2015 11:16:20 : epoch 5576aee4 : atlas-node1 : > >>>>>> ganesha.nfsd-29964[main] main :MAIN :EVENT :ganesha.nfsd Starting: > >>>>>> Ganesha Version /builddir/build/BUILD/nfs-ganesha-2.2.0/src, built at > >>>>>> May 18 2015 14:17:18 on buildhw-09.phx2.fedoraproject.org > >>>>>> <http://buildhw-09.phx2.fedoraproject.org> > >>>>>> 09/06/2015 11:16:20 : epoch 5576aee4 : atlas-node1 : > >>>>>> ganesha.nfsd-29965[main] nfs_set_param_from_conf :NFS STARTUP :EVENT > >>>>>> :Configuration file successfully parsed > >>>>>> 09/06/2015 11:16:20 : epoch 5576aee4 : atlas-node1 : > >>>>>> ganesha.nfsd-29965[main] init_server_pkgs :NFS STARTUP :EVENT > >>>>>> :Initializing ID Mapper. > >>>>>> 09/06/2015 11:16:20 : epoch 5576aee4 : atlas-node1 : > >>>>>> ganesha.nfsd-29965[main] init_server_pkgs :NFS STARTUP :EVENT :ID Mapper > >>>>>> successfully initialized. > >>>>>> 09/06/2015 11:16:20 : epoch 5576aee4 : atlas-node1 : > >>>>>> ganesha.nfsd-29965[main] main :NFS STARTUP :WARN :No export entries > >>>>>> found in configuration file !!! > >>>>>> 09/06/2015 11:16:20 : epoch 5576aee4 : atlas-node1 : > >>>>>> ganesha.nfsd-29965[main] config_errs_to_log :CONFIG :WARN :Config File > >>>>>> ((null):0): Empty configuration file > >>>>>> 09/06/2015 11:16:20 : epoch 5576aee4 : atlas-node1 : > >>>>>> ganesha.nfsd-29965[main] lower_my_caps :NFS STARTUP :EVENT > >>>>>> :CAP_SYS_RESOURCE was successfully removed for proper quota management > >>>>>> in FSAL > >>>>>> 09/06/2015 11:16:20 : epoch 5576aee4 : atlas-node1 : > >>>>>> ganesha.nfsd-29965[main] lower_my_caps :NFS STARTUP :EVENT :currenty set > >>>>>> capabilities are: > >>>>>> cap_chown,cap_dac_override,cap_dac_read_search,cap_fowner,cap_fsetid,cap_kill,cap_setgid,cap_setuid,cap_setpcap,cap_linux_immutable,cap_net_bind_service,cap_net_broadcast,cap_net_admin,cap_net_raw,cap_ipc_lock,cap_ipc_owner,cap_sys_module,cap_sys_rawio,cap_sys_chroot,cap_sys_ptrace,cap_sys_pacct,cap_sys_admin,cap_sys_boot,cap_sys_nice,cap_sys_time,cap_sys_tty_config,cap_mknod,cap_lease,cap_audit_write,cap_audit_control,cap_setfcap+ep > >>>>>> 09/06/2015 11:16:21 : epoch 5576aee4 : atlas-node1 : > >>>>>> ganesha.nfsd-29965[main] nfs_Init_svc :DISP :CRIT :Cannot acquire > >>>>>> credentials for principal nfs > >>>>>> 09/06/2015 11:16:21 : epoch 5576aee4 : atlas-node1 : > >>>>>> ganesha.nfsd-29965[main] nfs_Init_admin_thread :NFS CB :EVENT :Admin > >>>>>> thread initialized > >>>>>> 09/06/2015 11:16:21 : epoch 5576aee4 : atlas-node1 : > >>>>>> ganesha.nfsd-29965[main] nfs4_start_grace :STATE :EVENT :NFS Server Now > >>>>>> IN GRACE, duration 60 > >>>>>> 09/06/2015 11:16:21 : epoch 5576aee4 : atlas-node1 : > >>>>>> ganesha.nfsd-29965[main] nfs_rpc_cb_init_ccache :NFS STARTUP :EVENT > >>>>>> :Callback creds directory (/var/run/ganesha) already exists > >>>>>> 09/06/2015 11:16:21 : epoch 5576aee4 : atlas-node1 : > >>>>>> ganesha.nfsd-29965[main] nfs_rpc_cb_init_ccache :NFS STARTUP :WARN > >>>>>> :gssd_refresh_krb5_machine_credential failed (2:2) > >>>>>> 09/06/2015 11:16:21 : epoch 5576aee4 : atlas-node1 : > >>>>>> ganesha.nfsd-29965[main] nfs_Start_threads :THREAD :EVENT :Starting > >>>>>> delayed executor. > >>>>>> 09/06/2015 11:16:22 : epoch 5576aee4 : atlas-node1 : > >>>>>> ganesha.nfsd-29965[main] nfs_Start_threads :THREAD :EVENT :9P/TCP > >>>>>> dispatcher thread was started successfully > >>>>>> 09/06/2015 11:16:22 : epoch 5576aee4 : atlas-node1 : > >>>>>> ganesha.nfsd-29965[_9p_disp] _9p_dispatcher_thread :9P DISP :EVENT :9P > >>>>>> dispatcher started > >>>>>> 09/06/2015 11:16:22 : epoch 5576aee4 : atlas-node1 : > >>>>>> ganesha.nfsd-29965[main] nfs_Start_threads :THREAD :EVENT > >>>>>> :gsh_dbusthread was started successfully > >>>>>> 09/06/2015 11:16:22 : epoch 5576aee4 : atlas-node1 : > >>>>>> ganesha.nfsd-29965[main] nfs_Start_threads :THREAD :EVENT :admin thread > >>>>>> was started successfully > >>>>>> 09/06/2015 11:16:22 : epoch 5576aee4 : atlas-node1 : > >>>>>> ganesha.nfsd-29965[main] nfs_Start_threads :THREAD :EVENT :reaper thread > >>>>>> was started successfully > >>>>>> 09/06/2015 11:16:22 : epoch 5576aee4 : atlas-node1 : > >>>>>> ganesha.nfsd-29965[reaper] nfs_in_grace :STATE :EVENT :NFS Server Now IN > >>>>>> GRACE > >>>>>> 09/06/2015 11:16:22 : epoch 5576aee4 : atlas-node1 : > >>>>>> ganesha.nfsd-29965[main] nfs_Start_threads :THREAD :EVENT :General > >>>>>> fridge was started successfully > >>>>>> 09/06/2015 11:16:22 : epoch 5576aee4 : atlas-node1 : > >>>>>> ganesha.nfsd-29965[main] nfs_start :NFS STARTUP :EVENT > >>>>>> :------------------------------------------------- > >>>>>> 09/06/2015 11:16:22 : epoch 5576aee4 : atlas-node1 : > >>>>>> ganesha.nfsd-29965[main] nfs_start :NFS STARTUP :EVENT : NFS > >>>>>> SERVER INITIALIZED > >>>>>> 09/06/2015 11:16:22 : epoch 5576aee4 : atlas-node1 : > >>>>>> ganesha.nfsd-29965[main] nfs_start :NFS STARTUP :EVENT > >>>>>> :------------------------------------------------- > >>>>>> 09/06/2015 11:17:22 : epoch 5576aee4 : atlas-node1 : > >>>>>> ganesha.nfsd-29965[reaper] nfs_in_grace :STATE :EVENT :NFS Server Now > >>>>>> NOT IN GRACE > >>>>>> > >>>>>> > >>>>> Please check the status of nfs-ganesha > >>>>> $service nfs-ganesha status > >>>>> > >>>>> Could you try taking a packet trace (during showmount or mount) and > >>>>> check the server responses. > >>>>> > >>>>> Thanks, > >>>>> Soumya > >>>>> > >>>>>> Cheers, > >>>>>> > >>>>>> Alessandro > >>>>>> > >>>>>> > >>>>>>> Il giorno 09/giu/2015, alle ore 10:36, Alessandro De Salvo > >>>>>>> <alessandro.desalvo at roma1.infn.it > >>>>>>> <mailto:alessandro.desalvo at roma1.infn.it>> ha scritto: > >>>>>>> > >>>>>>> Hi Soumya, > >>>>>>> > >>>>>>>> Il giorno 09/giu/2015, alle ore 08:06, Soumya Koduri > >>>>>>>> <skoduri at redhat.com <mailto:skoduri at redhat.com>> ha scritto: > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> On 06/09/2015 01:31 AM, Alessandro De Salvo wrote: > >>>>>>>>> OK, I found at least one of the bugs. > >>>>>>>>> The /usr/libexec/ganesha/ganesha.sh has the following lines: > >>>>>>>>> > >>>>>>>>> if [ -e /etc/os-release ]; then > >>>>>>>>> RHEL6_PCS_CNAME_OPTION="" > >>>>>>>>> fi > >>>>>>>>> > >>>>>>>>> This is OK for RHEL < 7, but does not work for >= 7. I have changed > >>>>>>>>> it to the following, to make it working: > >>>>>>>>> > >>>>>>>>> if [ -e /etc/os-release ]; then > >>>>>>>>> eval $(grep -F "REDHAT_SUPPORT_PRODUCT=" /etc/os-release) > >>>>>>>>> [ "$REDHAT_SUPPORT_PRODUCT" == "Fedora" ] && > >>>>>>>>> RHEL6_PCS_CNAME_OPTION="" > >>>>>>>>> fi > >>>>>>>>> > >>>>>>>> Oh..Thanks for the fix. Could you please file a bug for the same (and > >>>>>>>> probably submit your fix as well). We shall have it corrected. > >>>>>>> > >>>>>>> Just did it,https://bugzilla.redhat.com/show_bug.cgi?id=1229601 > >>>>>>> > >>>>>>>> > >>>>>>>>> Apart from that, the VIP_<node> I was using were wrong, and I should > >>>>>>>>> have converted all the ?-? to underscores, maybe this could be > >>>>>>>>> mentioned in the documentation when you will have it ready. > >>>>>>>>> Now, the cluster starts, but the VIPs apparently not: > >>>>>>>>> > >>>>>>>> Sure. Thanks again for pointing it out. We shall make a note of it. > >>>>>>>> > >>>>>>>>> Online: [ atlas-node1 atlas-node2 ] > >>>>>>>>> > >>>>>>>>> Full list of resources: > >>>>>>>>> > >>>>>>>>> Clone Set: nfs-mon-clone [nfs-mon] > >>>>>>>>> Started: [ atlas-node1 atlas-node2 ] > >>>>>>>>> Clone Set: nfs-grace-clone [nfs-grace] > >>>>>>>>> Started: [ atlas-node1 atlas-node2 ] > >>>>>>>>> atlas-node1-cluster_ip-1 (ocf::heartbeat:IPaddr): Stopped > >>>>>>>>> atlas-node1-trigger_ip-1 (ocf::heartbeat:Dummy): Started atlas-node1 > >>>>>>>>> atlas-node2-cluster_ip-1 (ocf::heartbeat:IPaddr): Stopped > >>>>>>>>> atlas-node2-trigger_ip-1 (ocf::heartbeat:Dummy): Started atlas-node2 > >>>>>>>>> atlas-node1-dead_ip-1 (ocf::heartbeat:Dummy): Started atlas-node1 > >>>>>>>>> atlas-node2-dead_ip-1 (ocf::heartbeat:Dummy): Started atlas-node2 > >>>>>>>>> > >>>>>>>>> PCSD Status: > >>>>>>>>> atlas-node1: Online > >>>>>>>>> atlas-node2: Online > >>>>>>>>> > >>>>>>>>> Daemon Status: > >>>>>>>>> corosync: active/disabled > >>>>>>>>> pacemaker: active/disabled > >>>>>>>>> pcsd: active/enabled > >>>>>>>>> > >>>>>>>>> > >>>>>>>> Here corosync and pacemaker shows 'disabled' state. Can you check the > >>>>>>>> status of their services. They should be running prior to cluster > >>>>>>>> creation. We need to include that step in document as well. > >>>>>>> > >>>>>>> Ah, OK, you?re right, I have added it to my puppet modules (we install > >>>>>>> and configure ganesha via puppet, I?ll put the module on puppetforge > >>>>>>> soon, in case anyone is interested). > >>>>>>> > >>>>>>>> > >>>>>>>>> But the issue that is puzzling me more is the following: > >>>>>>>>> > >>>>>>>>> # showmount -e localhost > >>>>>>>>> rpc mount export: RPC: Timed out > >>>>>>>>> > >>>>>>>>> And when I try to enable the ganesha exports on a volume I get this > >>>>>>>>> error: > >>>>>>>>> > >>>>>>>>> # gluster volume set atlas-home-01 ganesha.enable on > >>>>>>>>> volume set: failed: Failed to create NFS-Ganesha export config file. > >>>>>>>>> > >>>>>>>>> But I see the file created in /etc/ganesha/exports/*.conf > >>>>>>>>> Still, showmount hangs and times out. > >>>>>>>>> Any help? > >>>>>>>>> Thanks, > >>>>>>>>> > >>>>>>>> Hmm that's strange. Sometimes, in case if there was no proper cleanup > >>>>>>>> done while trying to re-create the cluster, we have seen such issues. > >>>>>>>> > >>>>>>>> https://bugzilla.redhat.com/show_bug.cgi?id=1227709 > >>>>>>>> > >>>>>>>> http://review.gluster.org/#/c/11093/ > >>>>>>>> > >>>>>>>> Can you please unexport all the volumes, teardown the cluster using > >>>>>>>> 'gluster vol set <volname> ganesha.enable off? > >>>>>>> > >>>>>>> OK: > >>>>>>> > >>>>>>> # gluster vol set atlas-home-01 ganesha.enable off > >>>>>>> volume set: failed: ganesha.enable is already 'off'. > >>>>>>> > >>>>>>> # gluster vol set atlas-data-01 ganesha.enable off > >>>>>>> volume set: failed: ganesha.enable is already 'off'. > >>>>>>> > >>>>>>> > >>>>>>>> 'gluster ganesha disable' command. > >>>>>>> > >>>>>>> I?m assuming you wanted to write nfs-ganesha instead? > >>>>>>> > >>>>>>> # gluster nfs-ganesha disable > >>>>>>> ganesha enable : success > >>>>>>> > >>>>>>> > >>>>>>> A side note (not really important): it?s strange that when I do a > >>>>>>> disable the message is ?ganesha enable? :-) > >>>>>>> > >>>>>>>> > >>>>>>>> Verify if the following files have been deleted on all the nodes- > >>>>>>>> '/etc/cluster/cluster.conf? > >>>>>>> > >>>>>>> this file is not present at all, I think it?s not needed in CentOS 7 > >>>>>>> > >>>>>>>> '/etc/ganesha/ganesha.conf?, > >>>>>>> > >>>>>>> it?s still there, but empty, and I guess it should be OK, right? > >>>>>>> > >>>>>>>> '/etc/ganesha/exports/*? > >>>>>>> > >>>>>>> no more files there > >>>>>>> > >>>>>>>> '/var/lib/pacemaker/cib? > >>>>>>> > >>>>>>> it?s empty > >>>>>>> > >>>>>>>> > >>>>>>>> Verify if the ganesha service is stopped on all the nodes. > >>>>>>> > >>>>>>> nope, it?s still running, I will stop it. > >>>>>>> > >>>>>>>> > >>>>>>>> start/restart the services - corosync, pcs. > >>>>>>> > >>>>>>> In the node where I issued the nfs-ganesha disable there is no more > >>>>>>> any /etc/corosync/corosync.conf so corosync won?t start. The other > >>>>>>> node instead still has the file, it?s strange. > >>>>>>> > >>>>>>>> > >>>>>>>> And re-try the HA cluster creation > >>>>>>>> 'gluster ganesha enable? > >>>>>>> > >>>>>>> This time (repeated twice) it did not work at all: > >>>>>>> > >>>>>>> # pcs status > >>>>>>> Cluster name: ATLAS_GANESHA_01 > >>>>>>> Last updated: Tue Jun 9 10:13:43 2015 > >>>>>>> Last change: Tue Jun 9 10:13:22 2015 > >>>>>>> Stack: corosync > >>>>>>> Current DC: atlas-node1 (1) - partition with quorum > >>>>>>> Version: 1.1.12-a14efad > >>>>>>> 2 Nodes configured > >>>>>>> 6 Resources configured > >>>>>>> > >>>>>>> > >>>>>>> Online: [ atlas-node1 atlas-node2 ] > >>>>>>> > >>>>>>> Full list of resources: > >>>>>>> > >>>>>>> Clone Set: nfs-mon-clone [nfs-mon] > >>>>>>> Started: [ atlas-node1 atlas-node2 ] > >>>>>>> Clone Set: nfs-grace-clone [nfs-grace] > >>>>>>> Started: [ atlas-node1 atlas-node2 ] > >>>>>>> atlas-node2-dead_ip-1 (ocf::heartbeat:Dummy): Started atlas-node1 > >>>>>>> atlas-node1-dead_ip-1 (ocf::heartbeat:Dummy): Started atlas-node2 > >>>>>>> > >>>>>>> PCSD Status: > >>>>>>> atlas-node1: Online > >>>>>>> atlas-node2: Online > >>>>>>> > >>>>>>> Daemon Status: > >>>>>>> corosync: active/enabled > >>>>>>> pacemaker: active/enabled > >>>>>>> pcsd: active/enabled > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> I tried then "pcs cluster destroy" on both nodes, and then again > >>>>>>> nfs-ganesha enable, but now I?m back to the old problem: > >>>>>>> > >>>>>>> # pcs status > >>>>>>> Cluster name: ATLAS_GANESHA_01 > >>>>>>> Last updated: Tue Jun 9 10:22:27 2015 > >>>>>>> Last change: Tue Jun 9 10:17:00 2015 > >>>>>>> Stack: corosync > >>>>>>> Current DC: atlas-node2 (2) - partition with quorum > >>>>>>> Version: 1.1.12-a14efad > >>>>>>> 2 Nodes configured > >>>>>>> 10 Resources configured > >>>>>>> > >>>>>>> > >>>>>>> Online: [ atlas-node1 atlas-node2 ] > >>>>>>> > >>>>>>> Full list of resources: > >>>>>>> > >>>>>>> Clone Set: nfs-mon-clone [nfs-mon] > >>>>>>> Started: [ atlas-node1 atlas-node2 ] > >>>>>>> Clone Set: nfs-grace-clone [nfs-grace] > >>>>>>> Started: [ atlas-node1 atlas-node2 ] > >>>>>>> atlas-node1-cluster_ip-1 (ocf::heartbeat:IPaddr): Stopped > >>>>>>> atlas-node1-trigger_ip-1 (ocf::heartbeat:Dummy): Started atlas-node1 > >>>>>>> atlas-node2-cluster_ip-1 (ocf::heartbeat:IPaddr): Stopped > >>>>>>> atlas-node2-trigger_ip-1 (ocf::heartbeat:Dummy): Started atlas-node2 > >>>>>>> atlas-node1-dead_ip-1 (ocf::heartbeat:Dummy): Started atlas-node1 > >>>>>>> atlas-node2-dead_ip-1 (ocf::heartbeat:Dummy): Started atlas-node2 > >>>>>>> > >>>>>>> PCSD Status: > >>>>>>> atlas-node1: Online > >>>>>>> atlas-node2: Online > >>>>>>> > >>>>>>> Daemon Status: > >>>>>>> corosync: active/enabled > >>>>>>> pacemaker: active/enabled > >>>>>>> pcsd: active/enabled > >>>>>>> > >>>>>>> > >>>>>>> Cheers, > >>>>>>> > >>>>>>> Alessandro > >>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> Thanks, > >>>>>>>> Soumya > >>>>>>>> > >>>>>>>>> Alessandro > >>>>>>>>> > >>>>>>>>>> Il giorno 08/giu/2015, alle ore 20:00, Alessandro De Salvo > >>>>>>>>>> <Alessandro.DeSalvo at roma1.infn.it > >>>>>>>>>> <mailto:Alessandro.DeSalvo at roma1.infn.it>> ha scritto: > >>>>>>>>>> > >>>>>>>>>> Hi, > >>>>>>>>>> indeed, it does not work :-) > >>>>>>>>>> OK, this is what I did, with 2 machines, running CentOS 7.1, > >>>>>>>>>> Glusterfs 3.7.1 and nfs-ganesha 2.2.0: > >>>>>>>>>> > >>>>>>>>>> 1) ensured that the machines are able to resolve their IPs (but > >>>>>>>>>> this was already true since they were in the DNS); > >>>>>>>>>> 2) disabled NetworkManager and enabled network on both machines; > >>>>>>>>>> 3) created a gluster shared volume 'gluster_shared_storage' and > >>>>>>>>>> mounted it on '/run/gluster/shared_storage' on all the cluster > >>>>>>>>>> nodes using glusterfs native mount (on CentOS 7.1 there is a link > >>>>>>>>>> by default /var/run -> ../run) > >>>>>>>>>> 4) created an empty /etc/ganesha/ganesha.conf; > >>>>>>>>>> 5) installed pacemaker pcs resource-agents corosync on all cluster > >>>>>>>>>> machines; > >>>>>>>>>> 6) set the ?hacluster? user the same password on all machines; > >>>>>>>>>> 7) pcs cluster auth <hostname> -u hacluster -p <pass> on all the > >>>>>>>>>> nodes (on both nodes I issued the commands for both nodes) > >>>>>>>>>> 8) IPv6 is configured by default on all nodes, although the > >>>>>>>>>> infrastructure is not ready for IPv6 > >>>>>>>>>> 9) enabled pcsd and started it on all nodes > >>>>>>>>>> 10) populated /etc/ganesha/ganesha-ha.conf with the following > >>>>>>>>>> contents, one per machine: > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> ===> atlas-node1 > >>>>>>>>>> # Name of the HA cluster created. > >>>>>>>>>> HA_NAME="ATLAS_GANESHA_01" > >>>>>>>>>> # The server from which you intend to mount > >>>>>>>>>> # the shared volume. > >>>>>>>>>> HA_VOL_SERVER=?atlas-node1" > >>>>>>>>>> # The subset of nodes of the Gluster Trusted Pool > >>>>>>>>>> # that forms the ganesha HA cluster. IP/Hostname > >>>>>>>>>> # is specified. > >>>>>>>>>> HA_CLUSTER_NODES=?atlas-node1,atlas-node2" > >>>>>>>>>> # Virtual IPs of each of the nodes specified above. > >>>>>>>>>> VIP_atlas-node1=?x.x.x.1" > >>>>>>>>>> VIP_atlas-node2=?x.x.x.2" > >>>>>>>>>> > >>>>>>>>>> ===> atlas-node2 > >>>>>>>>>> # Name of the HA cluster created. > >>>>>>>>>> HA_NAME="ATLAS_GANESHA_01" > >>>>>>>>>> # The server from which you intend to mount > >>>>>>>>>> # the shared volume. > >>>>>>>>>> HA_VOL_SERVER=?atlas-node2" > >>>>>>>>>> # The subset of nodes of the Gluster Trusted Pool > >>>>>>>>>> # that forms the ganesha HA cluster. IP/Hostname > >>>>>>>>>> # is specified. > >>>>>>>>>> HA_CLUSTER_NODES=?atlas-node1,atlas-node2" > >>>>>>>>>> # Virtual IPs of each of the nodes specified above. > >>>>>>>>>> VIP_atlas-node1=?x.x.x.1" > >>>>>>>>>> VIP_atlas-node2=?x.x.x.2? > >>>>>>>>>> > >>>>>>>>>> 11) issued gluster nfs-ganesha enable, but it fails with a cryptic > >>>>>>>>>> message: > >>>>>>>>>> > >>>>>>>>>> # gluster nfs-ganesha enable > >>>>>>>>>> Enabling NFS-Ganesha requires Gluster-NFS to be disabled across the > >>>>>>>>>> trusted pool. Do you still want to continue? (y/n) y > >>>>>>>>>> nfs-ganesha: failed: Failed to set up HA config for NFS-Ganesha. > >>>>>>>>>> Please check the log file for details > >>>>>>>>>> > >>>>>>>>>> Looking at the logs I found nothing really special but this: > >>>>>>>>>> > >>>>>>>>>> ==> /var/log/glusterfs/etc-glusterfs-glusterd.vol.log <=> >>>>>>>>>> [2015-06-08 17:57:15.672844] I [MSGID: 106132] > >>>>>>>>>> [glusterd-proc-mgmt.c:83:glusterd_proc_stop] 0-management: nfs > >>>>>>>>>> already stopped > >>>>>>>>>> [2015-06-08 17:57:15.675395] I > >>>>>>>>>> [glusterd-ganesha.c:386:check_host_list] 0-management: ganesha host > >>>>>>>>>> found Hostname is atlas-node2 > >>>>>>>>>> [2015-06-08 17:57:15.720692] I > >>>>>>>>>> [glusterd-ganesha.c:386:check_host_list] 0-management: ganesha host > >>>>>>>>>> found Hostname is atlas-node2 > >>>>>>>>>> [2015-06-08 17:57:15.721161] I > >>>>>>>>>> [glusterd-ganesha.c:335:is_ganesha_host] 0-management: ganesha host > >>>>>>>>>> found Hostname is atlas-node2 > >>>>>>>>>> [2015-06-08 17:57:16.633048] E > >>>>>>>>>> [glusterd-ganesha.c:254:glusterd_op_set_ganesha] 0-management: > >>>>>>>>>> Initial NFS-Ganesha set up failed > >>>>>>>>>> [2015-06-08 17:57:16.641563] E > >>>>>>>>>> [glusterd-syncop.c:1396:gd_commit_op_phase] 0-management: Commit of > >>>>>>>>>> operation 'Volume (null)' failed on localhost : Failed to set up HA > >>>>>>>>>> config for NFS-Ganesha. Please check the log file for details > >>>>>>>>>> > >>>>>>>>>> ==> /var/log/glusterfs/cmd_history.log <=> >>>>>>>>>> [2015-06-08 17:57:16.643615] : nfs-ganesha enable : FAILED : > >>>>>>>>>> Failed to set up HA config for NFS-Ganesha. Please check the log > >>>>>>>>>> file for details > >>>>>>>>>> > >>>>>>>>>> ==> /var/log/glusterfs/cli.log <=> >>>>>>>>>> [2015-06-08 17:57:16.643839] I [input.c:36:cli_batch] 0-: Exiting > >>>>>>>>>> with: -1 > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> Also, pcs seems to be fine for the auth part, although it obviously > >>>>>>>>>> tells me the cluster is not running. > >>>>>>>>>> > >>>>>>>>>> I, [2015-06-08T19:57:16.305323 #7223] INFO -- : Running: > >>>>>>>>>> /usr/sbin/corosync-cmapctl totem.cluster_name > >>>>>>>>>> I, [2015-06-08T19:57:16.345457 #7223] INFO -- : Running: > >>>>>>>>>> /usr/sbin/pcs cluster token-nodes > >>>>>>>>>> ::ffff:141.108.38.46 - - [08/Jun/2015 19:57:16] "GET > >>>>>>>>>> /remote/check_auth HTTP/1.1" 200 68 0.1919 > >>>>>>>>>> ::ffff:141.108.38.46 - - [08/Jun/2015 19:57:16] "GET > >>>>>>>>>> /remote/check_auth HTTP/1.1" 200 68 0.1920 > >>>>>>>>>> atlas-node1.mydomain - - [08/Jun/2015:19:57:16 CEST] "GET > >>>>>>>>>> /remote/check_auth HTTP/1.1" 200 68 > >>>>>>>>>> - -> /remote/check_auth > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> What am I doing wrong? > >>>>>>>>>> Thanks, > >>>>>>>>>> > >>>>>>>>>> Alessandro > >>>>>>>>>> > >>>>>>>>>>> Il giorno 08/giu/2015, alle ore 19:30, Soumya Koduri > >>>>>>>>>>> <skoduri at redhat.com <mailto:skoduri at redhat.com>> ha scritto: > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> On 06/08/2015 08:20 PM, Alessandro De Salvo wrote: > >>>>>>>>>>>> Sorry, just another question: > >>>>>>>>>>>> > >>>>>>>>>>>> - in my installation of gluster 3.7.1 the command gluster > >>>>>>>>>>>> features.ganesha enable does not work: > >>>>>>>>>>>> > >>>>>>>>>>>> # gluster features.ganesha enable > >>>>>>>>>>>> unrecognized word: features.ganesha (position 0) > >>>>>>>>>>>> > >>>>>>>>>>>> Which version has full support for it? > >>>>>>>>>>> > >>>>>>>>>>> Sorry. This option has recently been changed. It is now > >>>>>>>>>>> > >>>>>>>>>>> $ gluster nfs-ganesha enable > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>> - in the documentation the ccs and cman packages are required, > >>>>>>>>>>>> but they seems not to be available anymore on CentOS 7 and > >>>>>>>>>>>> similar, I guess they are not really required anymore, as pcs > >>>>>>>>>>>> should do the full job > >>>>>>>>>>>> > >>>>>>>>>>>> Thanks, > >>>>>>>>>>>> > >>>>>>>>>>>> Alessandro > >>>>>>>>>>> > >>>>>>>>>>> Looks like so from http://clusterlabs.org/quickstart-redhat.html. > >>>>>>>>>>> Let us know if it doesn't work. > >>>>>>>>>>> > >>>>>>>>>>> Thanks, > >>>>>>>>>>> Soumya > >>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>>> Il giorno 08/giu/2015, alle ore 15:09, Alessandro De Salvo > >>>>>>>>>>>>> <alessandro.desalvo at roma1.infn.it > >>>>>>>>>>>>> <mailto:alessandro.desalvo at roma1.infn.it>> ha scritto: > >>>>>>>>>>>>> > >>>>>>>>>>>>> Great, many thanks Soumya! > >>>>>>>>>>>>> Cheers, > >>>>>>>>>>>>> > >>>>>>>>>>>>> Alessandro > >>>>>>>>>>>>> > >>>>>>>>>>>>>> Il giorno 08/giu/2015, alle ore 13:53, Soumya Koduri > >>>>>>>>>>>>>> <skoduri at redhat.com <mailto:skoduri at redhat.com>> ha scritto: > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> Hi, > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> Please find the slides of the demo video at [1] > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> We recommend to have a distributed replica volume as a shared > >>>>>>>>>>>>>> volume for better data-availability. > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> Size of the volume depends on the workload you may have. Since > >>>>>>>>>>>>>> it is used to maintain states of NLM/NFSv4 clients, you may > >>>>>>>>>>>>>> calculate the size of the volume to be minimum of aggregate of > >>>>>>>>>>>>>> (typical_size_of'/var/lib/nfs'_directory + > >>>>>>>>>>>>>> ~4k*no_of_clients_connected_to_each_of_the_nfs_servers_at_any_point) > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> We shall document about this feature sooner in the gluster docs > >>>>>>>>>>>>>> as well. > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> Thanks, > >>>>>>>>>>>>>> Soumya > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> [1] - http://www.slideshare.net/SoumyaKoduri/high-49117846 > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> On 06/08/2015 04:34 PM, Alessandro De Salvo wrote: > >>>>>>>>>>>>>>> Hi, > >>>>>>>>>>>>>>> I have seen the demo video on ganesha HA, > >>>>>>>>>>>>>>> https://www.youtube.com/watch?v=Z4mvTQC-efM > >>>>>>>>>>>>>>> However there is no advice on the appropriate size of the > >>>>>>>>>>>>>>> shared volume. How is it really used, and what should be a > >>>>>>>>>>>>>>> reasonable size for it? > >>>>>>>>>>>>>>> Also, are the slides from the video available somewhere, as > >>>>>>>>>>>>>>> well as a documentation on all this? I did not manage to find > >>>>>>>>>>>>>>> them. > >>>>>>>>>>>>>>> Thanks, > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> Alessandro > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> _______________________________________________ > >>>>>>>>>>>>>>> Gluster-users mailing list > >>>>>>>>>>>>>>> Gluster-users at gluster.org <mailto:Gluster-users at gluster.org> > >>>>>>>>>>>>>>> http://www.gluster.org/mailman/listinfo/gluster-users > >>>>>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> _______________________________________________ > >>>>>>>>>> Gluster-users mailing list > >>>>>>>>>> Gluster-users at gluster.org <mailto:Gluster-users at gluster.org> > >>>>>>>>>> http://www.gluster.org/mailman/listinfo/gluster-users > >>>>>>>>> > >>>>>>> > >>>>>>> _______________________________________________ > >>>>>>> Gluster-users mailing list > >>>>>>> Gluster-users at gluster.org <mailto:Gluster-users at gluster.org> > >>>>>>> http://www.gluster.org/mailman/listinfo/gluster-users > >>>>>> > >>>> > >>>> _______________________________________________ > >>>> Gluster-users mailing list > >>>> Gluster-users at gluster.org > >>>> http://www.gluster.org/mailman/listinfo/gluster-users > >>> > >>> > >
Alessandro De Salvo
2015-Jun-11 15:48 UTC
[Gluster-users] Questions on ganesha HA and shared storage size
Soumya, do you have any other idea of what to check on my side? Many thanks, Alessandro> Il giorno 10/giu/2015, alle ore 21:07, Alessandro De Salvo <alessandro.desalvo at roma1.infn.it> ha scritto: > > Hi, > by looking at the connections I also see a strange problem: > > # netstat -ltaupn | grep 2049 > tcp6 4 0 :::2049 :::* > LISTEN 32080/ganesha.nfsd > tcp6 1 0 x.x.x.2:2049 x.x.x.2:33285 CLOSE_WAIT > - > tcp6 1 0 127.0.0.1:2049 127.0.0.1:39555 > CLOSE_WAIT - > udp6 0 0 :::2049 :::* > 32080/ganesha.nfsd > > > Why tcp6 is used with an IPv4 address? > In another machine where ganesha 2.1.0 is running I see tcp is used, not > tcp6. > Could it be that the RPC are always trying to use IPv6? That would be > wrong. > Thanks, > > Alessandro > > On Wed, 2015-06-10 at 15:28 +0530, Soumya Koduri wrote: >> >> On 06/10/2015 05:49 AM, Alessandro De Salvo wrote: >>> Hi, >>> I have enabled the full debug already, but I see nothing special. Before exporting any volume the log shows no error, even when I do a showmount (the log is attached, ganesha.log.gz). If I do the same after exporting a volume nfs-ganesha does not even start, complaining for not being able to bind the IPv6 ruota socket, but in fact there is nothing listening on IPv6, so it should not happen: >>> >>> tcp6 0 0 :::111 :::* LISTEN 7433/rpcbind >>> tcp6 0 0 :::2224 :::* LISTEN 9054/ruby >>> tcp6 0 0 :::22 :::* LISTEN 1248/sshd >>> udp6 0 0 :::111 :::* 7433/rpcbind >>> udp6 0 0 fe80::8c2:27ff:fef2:123 :::* 31238/ntpd >>> udp6 0 0 fe80::230:48ff:fed2:123 :::* 31238/ntpd >>> udp6 0 0 fe80::230:48ff:fed2:123 :::* 31238/ntpd >>> udp6 0 0 fe80::230:48ff:fed2:123 :::* 31238/ntpd >>> udp6 0 0 ::1:123 :::* 31238/ntpd >>> udp6 0 0 fe80::5484:7aff:fef:123 :::* 31238/ntpd >>> udp6 0 0 :::123 :::* 31238/ntpd >>> udp6 0 0 :::824 :::* 7433/rpcbind >>> >>> The error, as shown in the attached ganesha-after-export.log.gz logfile, is the following: >>> >>> >>> 10/06/2015 02:07:47 : epoch 55777fb5 : node2 : ganesha.nfsd-26195[main] Bind_sockets_V6 :DISP :WARN :Cannot bind RQUOTA tcp6 socket, error 98 (Address already in use) >>> 10/06/2015 02:07:47 : epoch 55777fb5 : node2 : ganesha.nfsd-26195[main] Bind_sockets :DISP :FATAL :Error binding to V6 interface. Cannot continue. >>> 10/06/2015 02:07:48 : epoch 55777fb5 : node2 : ganesha.nfsd-26195[main] glusterfs_unload :FSAL :DEBUG :FSAL Gluster unloaded >>> >> >> We have seen such issues with RPCBIND few times. NFS-Ganesha setup first >> disables Gluster-NFS and then brings up NFS-Ganesha service. Sometimes, >> there could be delay or issue with Gluster-NFS un-registering those >> services and when NFS-Ganesha tries to register to the same port, it >> throws this error. Please try registering Rquota to any random port >> using below config option in "/etc/ganesha/ganesha.conf" >> >> NFS_Core_Param { >> #Use a non-privileged port for RQuota >> Rquota_Port = 4501; >> } >> >> and cleanup '/var/cache/rpcbind/' directory before the setup. >> >> Thanks, >> Soumya >> >>> >>> Thanks, >>> >>> Alessandro >>> >>> >>> >>> >>>> Il giorno 09/giu/2015, alle ore 18:37, Soumya Koduri <skoduri at redhat.com> ha scritto: >>>> >>>> >>>> >>>> On 06/09/2015 09:47 PM, Alessandro De Salvo wrote: >>>>> Another update: the fact that I was unable to use vol set ganesha.enable >>>>> was due to another bug in the ganesha scripts. In short, they are all >>>>> using the following line to get the location of the conf file: >>>>> >>>>> CONF=$(cat /etc/sysconfig/ganesha | grep "CONFFILE" | cut -f 2 -d "=") >>>>> >>>>> First of all by default in /etc/sysconfig/ganesha there is no line >>>>> CONFFILE, second there is a bug in that directive, as it works if I add >>>>> in /etc/sysconfig/ganesha >>>>> >>>>> CONFFILE=/etc/ganesha/ganesha.conf >>>>> >>>>> but it fails if the same is quoted >>>>> >>>>> CONFFILE="/etc/ganesha/ganesha.conf" >>>>> >>>>> It would be much better to use the following, which has a default as >>>>> well: >>>>> >>>>> eval $(grep -F CONFFILE= /etc/sysconfig/ganesha) >>>>> CONF=${CONFFILE:/etc/ganesha/ganesha.conf} >>>>> >>>>> I'll update the bug report. >>>>> Having said this... the last issue to tackle is the real problem with >>>>> the ganesha.nfsd :-( >>>> >>>> Thanks. Could you try changing log level to NIV_FULL_DEBUG in '/etc/sysconfig/ganesha' and check if anything gets logged in '/var/log/ganesha.log' or '/ganesha.log'. >>>> >>>> Thanks, >>>> Soumya >>>> >>>>> Cheers, >>>>> >>>>> Alessandro >>>>> >>>>> >>>>> On Tue, 2015-06-09 at 14:25 +0200, Alessandro De Salvo wrote: >>>>>> OK, I can confirm that the ganesha.nsfd process is actually not >>>>>> answering to the calls. Here it is what I see: >>>>>> >>>>>> # rpcinfo -p >>>>>> program vers proto port service >>>>>> 100000 4 tcp 111 portmapper >>>>>> 100000 3 tcp 111 portmapper >>>>>> 100000 2 tcp 111 portmapper >>>>>> 100000 4 udp 111 portmapper >>>>>> 100000 3 udp 111 portmapper >>>>>> 100000 2 udp 111 portmapper >>>>>> 100024 1 udp 41594 status >>>>>> 100024 1 tcp 53631 status >>>>>> 100003 3 udp 2049 nfs >>>>>> 100003 3 tcp 2049 nfs >>>>>> 100003 4 udp 2049 nfs >>>>>> 100003 4 tcp 2049 nfs >>>>>> 100005 1 udp 58127 mountd >>>>>> 100005 1 tcp 56301 mountd >>>>>> 100005 3 udp 58127 mountd >>>>>> 100005 3 tcp 56301 mountd >>>>>> 100021 4 udp 46203 nlockmgr >>>>>> 100021 4 tcp 41798 nlockmgr >>>>>> 100011 1 udp 875 rquotad >>>>>> 100011 1 tcp 875 rquotad >>>>>> 100011 2 udp 875 rquotad >>>>>> 100011 2 tcp 875 rquotad >>>>>> >>>>>> # netstat -lpn | grep ganesha >>>>>> tcp6 14 0 :::2049 :::* >>>>>> LISTEN 11937/ganesha.nfsd >>>>>> tcp6 0 0 :::41798 :::* >>>>>> LISTEN 11937/ganesha.nfsd >>>>>> tcp6 0 0 :::875 :::* >>>>>> LISTEN 11937/ganesha.nfsd >>>>>> tcp6 10 0 :::56301 :::* >>>>>> LISTEN 11937/ganesha.nfsd >>>>>> tcp6 0 0 :::564 :::* >>>>>> LISTEN 11937/ganesha.nfsd >>>>>> udp6 0 0 :::2049 :::* >>>>>> 11937/ganesha.nfsd >>>>>> udp6 0 0 :::46203 :::* >>>>>> 11937/ganesha.nfsd >>>>>> udp6 0 0 :::58127 :::* >>>>>> 11937/ganesha.nfsd >>>>>> udp6 0 0 :::875 :::* >>>>>> 11937/ganesha.nfsd >>>>>> >>>>>> I'm attaching the strace of a showmount from a node to the other. >>>>>> This machinery was working with nfs-ganesha 2.1.0, so it must be >>>>>> something introduced with 2.2.0. >>>>>> Cheers, >>>>>> >>>>>> Alessandro >>>>>> >>>>>> >>>>>> >>>>>> On Tue, 2015-06-09 at 15:16 +0530, Soumya Koduri wrote: >>>>>>> >>>>>>> On 06/09/2015 02:48 PM, Alessandro De Salvo wrote: >>>>>>>> Hi, >>>>>>>> OK, the problem with the VIPs not starting is due to the ganesha_mon >>>>>>>> heartbeat script looking for a pid file called >>>>>>>> /var/run/ganesha.nfsd.pid, while by default ganesha.nfsd v.2.2.0 is >>>>>>>> creating /var/run/ganesha.pid, this needs to be corrected. The file is >>>>>>>> in glusterfs-ganesha-3.7.1-1.el7.x86_64, in my case. >>>>>>>> For the moment I have created a symlink in this way and it works: >>>>>>>> >>>>>>>> ln -s /var/run/ganesha.pid /var/run/ganesha.nfsd.pid >>>>>>>> >>>>>>> Thanks. Please update this as well in the bug. >>>>>>> >>>>>>>> So far so good, the VIPs are up and pingable, but still there is the >>>>>>>> problem of the hanging showmount (i.e. hanging RPC). >>>>>>>> Still, I see a lot of errors like this in /var/log/messages: >>>>>>>> >>>>>>>> Jun 9 11:15:20 atlas-node1 lrmd[31221]: notice: operation_finished: >>>>>>>> nfs-mon_monitor_10000:29292:stderr [ Error: Resource does not exist. ] >>>>>>>> >>>>>>>> While ganesha.log shows the server is not in grace: >>>>>>>> >>>>>>>> 09/06/2015 11:16:20 : epoch 5576aee4 : atlas-node1 : >>>>>>>> ganesha.nfsd-29964[main] main :MAIN :EVENT :ganesha.nfsd Starting: >>>>>>>> Ganesha Version /builddir/build/BUILD/nfs-ganesha-2.2.0/src, built at >>>>>>>> May 18 2015 14:17:18 on buildhw-09.phx2.fedoraproject.org >>>>>>>> <http://buildhw-09.phx2.fedoraproject.org> >>>>>>>> 09/06/2015 11:16:20 : epoch 5576aee4 : atlas-node1 : >>>>>>>> ganesha.nfsd-29965[main] nfs_set_param_from_conf :NFS STARTUP :EVENT >>>>>>>> :Configuration file successfully parsed >>>>>>>> 09/06/2015 11:16:20 : epoch 5576aee4 : atlas-node1 : >>>>>>>> ganesha.nfsd-29965[main] init_server_pkgs :NFS STARTUP :EVENT >>>>>>>> :Initializing ID Mapper. >>>>>>>> 09/06/2015 11:16:20 : epoch 5576aee4 : atlas-node1 : >>>>>>>> ganesha.nfsd-29965[main] init_server_pkgs :NFS STARTUP :EVENT :ID Mapper >>>>>>>> successfully initialized. >>>>>>>> 09/06/2015 11:16:20 : epoch 5576aee4 : atlas-node1 : >>>>>>>> ganesha.nfsd-29965[main] main :NFS STARTUP :WARN :No export entries >>>>>>>> found in configuration file !!! >>>>>>>> 09/06/2015 11:16:20 : epoch 5576aee4 : atlas-node1 : >>>>>>>> ganesha.nfsd-29965[main] config_errs_to_log :CONFIG :WARN :Config File >>>>>>>> ((null):0): Empty configuration file >>>>>>>> 09/06/2015 11:16:20 : epoch 5576aee4 : atlas-node1 : >>>>>>>> ganesha.nfsd-29965[main] lower_my_caps :NFS STARTUP :EVENT >>>>>>>> :CAP_SYS_RESOURCE was successfully removed for proper quota management >>>>>>>> in FSAL >>>>>>>> 09/06/2015 11:16:20 : epoch 5576aee4 : atlas-node1 : >>>>>>>> ganesha.nfsd-29965[main] lower_my_caps :NFS STARTUP :EVENT :currenty set >>>>>>>> capabilities are: >>>>>>>> cap_chown,cap_dac_override,cap_dac_read_search,cap_fowner,cap_fsetid,cap_kill,cap_setgid,cap_setuid,cap_setpcap,cap_linux_immutable,cap_net_bind_service,cap_net_broadcast,cap_net_admin,cap_net_raw,cap_ipc_lock,cap_ipc_owner,cap_sys_module,cap_sys_rawio,cap_sys_chroot,cap_sys_ptrace,cap_sys_pacct,cap_sys_admin,cap_sys_boot,cap_sys_nice,cap_sys_time,cap_sys_tty_config,cap_mknod,cap_lease,cap_audit_write,cap_audit_control,cap_setfcap+ep >>>>>>>> 09/06/2015 11:16:21 : epoch 5576aee4 : atlas-node1 : >>>>>>>> ganesha.nfsd-29965[main] nfs_Init_svc :DISP :CRIT :Cannot acquire >>>>>>>> credentials for principal nfs >>>>>>>> 09/06/2015 11:16:21 : epoch 5576aee4 : atlas-node1 : >>>>>>>> ganesha.nfsd-29965[main] nfs_Init_admin_thread :NFS CB :EVENT :Admin >>>>>>>> thread initialized >>>>>>>> 09/06/2015 11:16:21 : epoch 5576aee4 : atlas-node1 : >>>>>>>> ganesha.nfsd-29965[main] nfs4_start_grace :STATE :EVENT :NFS Server Now >>>>>>>> IN GRACE, duration 60 >>>>>>>> 09/06/2015 11:16:21 : epoch 5576aee4 : atlas-node1 : >>>>>>>> ganesha.nfsd-29965[main] nfs_rpc_cb_init_ccache :NFS STARTUP :EVENT >>>>>>>> :Callback creds directory (/var/run/ganesha) already exists >>>>>>>> 09/06/2015 11:16:21 : epoch 5576aee4 : atlas-node1 : >>>>>>>> ganesha.nfsd-29965[main] nfs_rpc_cb_init_ccache :NFS STARTUP :WARN >>>>>>>> :gssd_refresh_krb5_machine_credential failed (2:2) >>>>>>>> 09/06/2015 11:16:21 : epoch 5576aee4 : atlas-node1 : >>>>>>>> ganesha.nfsd-29965[main] nfs_Start_threads :THREAD :EVENT :Starting >>>>>>>> delayed executor. >>>>>>>> 09/06/2015 11:16:22 : epoch 5576aee4 : atlas-node1 : >>>>>>>> ganesha.nfsd-29965[main] nfs_Start_threads :THREAD :EVENT :9P/TCP >>>>>>>> dispatcher thread was started successfully >>>>>>>> 09/06/2015 11:16:22 : epoch 5576aee4 : atlas-node1 : >>>>>>>> ganesha.nfsd-29965[_9p_disp] _9p_dispatcher_thread :9P DISP :EVENT :9P >>>>>>>> dispatcher started >>>>>>>> 09/06/2015 11:16:22 : epoch 5576aee4 : atlas-node1 : >>>>>>>> ganesha.nfsd-29965[main] nfs_Start_threads :THREAD :EVENT >>>>>>>> :gsh_dbusthread was started successfully >>>>>>>> 09/06/2015 11:16:22 : epoch 5576aee4 : atlas-node1 : >>>>>>>> ganesha.nfsd-29965[main] nfs_Start_threads :THREAD :EVENT :admin thread >>>>>>>> was started successfully >>>>>>>> 09/06/2015 11:16:22 : epoch 5576aee4 : atlas-node1 : >>>>>>>> ganesha.nfsd-29965[main] nfs_Start_threads :THREAD :EVENT :reaper thread >>>>>>>> was started successfully >>>>>>>> 09/06/2015 11:16:22 : epoch 5576aee4 : atlas-node1 : >>>>>>>> ganesha.nfsd-29965[reaper] nfs_in_grace :STATE :EVENT :NFS Server Now IN >>>>>>>> GRACE >>>>>>>> 09/06/2015 11:16:22 : epoch 5576aee4 : atlas-node1 : >>>>>>>> ganesha.nfsd-29965[main] nfs_Start_threads :THREAD :EVENT :General >>>>>>>> fridge was started successfully >>>>>>>> 09/06/2015 11:16:22 : epoch 5576aee4 : atlas-node1 : >>>>>>>> ganesha.nfsd-29965[main] nfs_start :NFS STARTUP :EVENT >>>>>>>> :------------------------------------------------- >>>>>>>> 09/06/2015 11:16:22 : epoch 5576aee4 : atlas-node1 : >>>>>>>> ganesha.nfsd-29965[main] nfs_start :NFS STARTUP :EVENT : NFS >>>>>>>> SERVER INITIALIZED >>>>>>>> 09/06/2015 11:16:22 : epoch 5576aee4 : atlas-node1 : >>>>>>>> ganesha.nfsd-29965[main] nfs_start :NFS STARTUP :EVENT >>>>>>>> :------------------------------------------------- >>>>>>>> 09/06/2015 11:17:22 : epoch 5576aee4 : atlas-node1 : >>>>>>>> ganesha.nfsd-29965[reaper] nfs_in_grace :STATE :EVENT :NFS Server Now >>>>>>>> NOT IN GRACE >>>>>>>> >>>>>>>> >>>>>>> Please check the status of nfs-ganesha >>>>>>> $service nfs-ganesha status >>>>>>> >>>>>>> Could you try taking a packet trace (during showmount or mount) and >>>>>>> check the server responses. >>>>>>> >>>>>>> Thanks, >>>>>>> Soumya >>>>>>> >>>>>>>> Cheers, >>>>>>>> >>>>>>>> Alessandro >>>>>>>> >>>>>>>> >>>>>>>>> Il giorno 09/giu/2015, alle ore 10:36, Alessandro De Salvo >>>>>>>>> <alessandro.desalvo at roma1.infn.it >>>>>>>>> <mailto:alessandro.desalvo at roma1.infn.it>> ha scritto: >>>>>>>>> >>>>>>>>> Hi Soumya, >>>>>>>>> >>>>>>>>>> Il giorno 09/giu/2015, alle ore 08:06, Soumya Koduri >>>>>>>>>> <skoduri at redhat.com <mailto:skoduri at redhat.com>> ha scritto: >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On 06/09/2015 01:31 AM, Alessandro De Salvo wrote: >>>>>>>>>>> OK, I found at least one of the bugs. >>>>>>>>>>> The /usr/libexec/ganesha/ganesha.sh has the following lines: >>>>>>>>>>> >>>>>>>>>>> if [ -e /etc/os-release ]; then >>>>>>>>>>> RHEL6_PCS_CNAME_OPTION="" >>>>>>>>>>> fi >>>>>>>>>>> >>>>>>>>>>> This is OK for RHEL < 7, but does not work for >= 7. I have changed >>>>>>>>>>> it to the following, to make it working: >>>>>>>>>>> >>>>>>>>>>> if [ -e /etc/os-release ]; then >>>>>>>>>>> eval $(grep -F "REDHAT_SUPPORT_PRODUCT=" /etc/os-release) >>>>>>>>>>> [ "$REDHAT_SUPPORT_PRODUCT" == "Fedora" ] && >>>>>>>>>>> RHEL6_PCS_CNAME_OPTION="" >>>>>>>>>>> fi >>>>>>>>>>> >>>>>>>>>> Oh..Thanks for the fix. Could you please file a bug for the same (and >>>>>>>>>> probably submit your fix as well). We shall have it corrected. >>>>>>>>> >>>>>>>>> Just did it,https://bugzilla.redhat.com/show_bug.cgi?id=1229601 >>>>>>>>> >>>>>>>>>> >>>>>>>>>>> Apart from that, the VIP_<node> I was using were wrong, and I should >>>>>>>>>>> have converted all the ?-? to underscores, maybe this could be >>>>>>>>>>> mentioned in the documentation when you will have it ready. >>>>>>>>>>> Now, the cluster starts, but the VIPs apparently not: >>>>>>>>>>> >>>>>>>>>> Sure. Thanks again for pointing it out. We shall make a note of it. >>>>>>>>>> >>>>>>>>>>> Online: [ atlas-node1 atlas-node2 ] >>>>>>>>>>> >>>>>>>>>>> Full list of resources: >>>>>>>>>>> >>>>>>>>>>> Clone Set: nfs-mon-clone [nfs-mon] >>>>>>>>>>> Started: [ atlas-node1 atlas-node2 ] >>>>>>>>>>> Clone Set: nfs-grace-clone [nfs-grace] >>>>>>>>>>> Started: [ atlas-node1 atlas-node2 ] >>>>>>>>>>> atlas-node1-cluster_ip-1 (ocf::heartbeat:IPaddr): Stopped >>>>>>>>>>> atlas-node1-trigger_ip-1 (ocf::heartbeat:Dummy): Started atlas-node1 >>>>>>>>>>> atlas-node2-cluster_ip-1 (ocf::heartbeat:IPaddr): Stopped >>>>>>>>>>> atlas-node2-trigger_ip-1 (ocf::heartbeat:Dummy): Started atlas-node2 >>>>>>>>>>> atlas-node1-dead_ip-1 (ocf::heartbeat:Dummy): Started atlas-node1 >>>>>>>>>>> atlas-node2-dead_ip-1 (ocf::heartbeat:Dummy): Started atlas-node2 >>>>>>>>>>> >>>>>>>>>>> PCSD Status: >>>>>>>>>>> atlas-node1: Online >>>>>>>>>>> atlas-node2: Online >>>>>>>>>>> >>>>>>>>>>> Daemon Status: >>>>>>>>>>> corosync: active/disabled >>>>>>>>>>> pacemaker: active/disabled >>>>>>>>>>> pcsd: active/enabled >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> Here corosync and pacemaker shows 'disabled' state. Can you check the >>>>>>>>>> status of their services. They should be running prior to cluster >>>>>>>>>> creation. We need to include that step in document as well. >>>>>>>>> >>>>>>>>> Ah, OK, you?re right, I have added it to my puppet modules (we install >>>>>>>>> and configure ganesha via puppet, I?ll put the module on puppetforge >>>>>>>>> soon, in case anyone is interested). >>>>>>>>> >>>>>>>>>> >>>>>>>>>>> But the issue that is puzzling me more is the following: >>>>>>>>>>> >>>>>>>>>>> # showmount -e localhost >>>>>>>>>>> rpc mount export: RPC: Timed out >>>>>>>>>>> >>>>>>>>>>> And when I try to enable the ganesha exports on a volume I get this >>>>>>>>>>> error: >>>>>>>>>>> >>>>>>>>>>> # gluster volume set atlas-home-01 ganesha.enable on >>>>>>>>>>> volume set: failed: Failed to create NFS-Ganesha export config file. >>>>>>>>>>> >>>>>>>>>>> But I see the file created in /etc/ganesha/exports/*.conf >>>>>>>>>>> Still, showmount hangs and times out. >>>>>>>>>>> Any help? >>>>>>>>>>> Thanks, >>>>>>>>>>> >>>>>>>>>> Hmm that's strange. Sometimes, in case if there was no proper cleanup >>>>>>>>>> done while trying to re-create the cluster, we have seen such issues. >>>>>>>>>> >>>>>>>>>> https://bugzilla.redhat.com/show_bug.cgi?id=1227709 >>>>>>>>>> >>>>>>>>>> http://review.gluster.org/#/c/11093/ >>>>>>>>>> >>>>>>>>>> Can you please unexport all the volumes, teardown the cluster using >>>>>>>>>> 'gluster vol set <volname> ganesha.enable off? >>>>>>>>> >>>>>>>>> OK: >>>>>>>>> >>>>>>>>> # gluster vol set atlas-home-01 ganesha.enable off >>>>>>>>> volume set: failed: ganesha.enable is already 'off'. >>>>>>>>> >>>>>>>>> # gluster vol set atlas-data-01 ganesha.enable off >>>>>>>>> volume set: failed: ganesha.enable is already 'off'. >>>>>>>>> >>>>>>>>> >>>>>>>>>> 'gluster ganesha disable' command. >>>>>>>>> >>>>>>>>> I?m assuming you wanted to write nfs-ganesha instead? >>>>>>>>> >>>>>>>>> # gluster nfs-ganesha disable >>>>>>>>> ganesha enable : success >>>>>>>>> >>>>>>>>> >>>>>>>>> A side note (not really important): it?s strange that when I do a >>>>>>>>> disable the message is ?ganesha enable? :-) >>>>>>>>> >>>>>>>>>> >>>>>>>>>> Verify if the following files have been deleted on all the nodes- >>>>>>>>>> '/etc/cluster/cluster.conf? >>>>>>>>> >>>>>>>>> this file is not present at all, I think it?s not needed in CentOS 7 >>>>>>>>> >>>>>>>>>> '/etc/ganesha/ganesha.conf?, >>>>>>>>> >>>>>>>>> it?s still there, but empty, and I guess it should be OK, right? >>>>>>>>> >>>>>>>>>> '/etc/ganesha/exports/*? >>>>>>>>> >>>>>>>>> no more files there >>>>>>>>> >>>>>>>>>> '/var/lib/pacemaker/cib? >>>>>>>>> >>>>>>>>> it?s empty >>>>>>>>> >>>>>>>>>> >>>>>>>>>> Verify if the ganesha service is stopped on all the nodes. >>>>>>>>> >>>>>>>>> nope, it?s still running, I will stop it. >>>>>>>>> >>>>>>>>>> >>>>>>>>>> start/restart the services - corosync, pcs. >>>>>>>>> >>>>>>>>> In the node where I issued the nfs-ganesha disable there is no more >>>>>>>>> any /etc/corosync/corosync.conf so corosync won?t start. The other >>>>>>>>> node instead still has the file, it?s strange. >>>>>>>>> >>>>>>>>>> >>>>>>>>>> And re-try the HA cluster creation >>>>>>>>>> 'gluster ganesha enable? >>>>>>>>> >>>>>>>>> This time (repeated twice) it did not work at all: >>>>>>>>> >>>>>>>>> # pcs status >>>>>>>>> Cluster name: ATLAS_GANESHA_01 >>>>>>>>> Last updated: Tue Jun 9 10:13:43 2015 >>>>>>>>> Last change: Tue Jun 9 10:13:22 2015 >>>>>>>>> Stack: corosync >>>>>>>>> Current DC: atlas-node1 (1) - partition with quorum >>>>>>>>> Version: 1.1.12-a14efad >>>>>>>>> 2 Nodes configured >>>>>>>>> 6 Resources configured >>>>>>>>> >>>>>>>>> >>>>>>>>> Online: [ atlas-node1 atlas-node2 ] >>>>>>>>> >>>>>>>>> Full list of resources: >>>>>>>>> >>>>>>>>> Clone Set: nfs-mon-clone [nfs-mon] >>>>>>>>> Started: [ atlas-node1 atlas-node2 ] >>>>>>>>> Clone Set: nfs-grace-clone [nfs-grace] >>>>>>>>> Started: [ atlas-node1 atlas-node2 ] >>>>>>>>> atlas-node2-dead_ip-1 (ocf::heartbeat:Dummy): Started atlas-node1 >>>>>>>>> atlas-node1-dead_ip-1 (ocf::heartbeat:Dummy): Started atlas-node2 >>>>>>>>> >>>>>>>>> PCSD Status: >>>>>>>>> atlas-node1: Online >>>>>>>>> atlas-node2: Online >>>>>>>>> >>>>>>>>> Daemon Status: >>>>>>>>> corosync: active/enabled >>>>>>>>> pacemaker: active/enabled >>>>>>>>> pcsd: active/enabled >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> I tried then "pcs cluster destroy" on both nodes, and then again >>>>>>>>> nfs-ganesha enable, but now I?m back to the old problem: >>>>>>>>> >>>>>>>>> # pcs status >>>>>>>>> Cluster name: ATLAS_GANESHA_01 >>>>>>>>> Last updated: Tue Jun 9 10:22:27 2015 >>>>>>>>> Last change: Tue Jun 9 10:17:00 2015 >>>>>>>>> Stack: corosync >>>>>>>>> Current DC: atlas-node2 (2) - partition with quorum >>>>>>>>> Version: 1.1.12-a14efad >>>>>>>>> 2 Nodes configured >>>>>>>>> 10 Resources configured >>>>>>>>> >>>>>>>>> >>>>>>>>> Online: [ atlas-node1 atlas-node2 ] >>>>>>>>> >>>>>>>>> Full list of resources: >>>>>>>>> >>>>>>>>> Clone Set: nfs-mon-clone [nfs-mon] >>>>>>>>> Started: [ atlas-node1 atlas-node2 ] >>>>>>>>> Clone Set: nfs-grace-clone [nfs-grace] >>>>>>>>> Started: [ atlas-node1 atlas-node2 ] >>>>>>>>> atlas-node1-cluster_ip-1 (ocf::heartbeat:IPaddr): Stopped >>>>>>>>> atlas-node1-trigger_ip-1 (ocf::heartbeat:Dummy): Started atlas-node1 >>>>>>>>> atlas-node2-cluster_ip-1 (ocf::heartbeat:IPaddr): Stopped >>>>>>>>> atlas-node2-trigger_ip-1 (ocf::heartbeat:Dummy): Started atlas-node2 >>>>>>>>> atlas-node1-dead_ip-1 (ocf::heartbeat:Dummy): Started atlas-node1 >>>>>>>>> atlas-node2-dead_ip-1 (ocf::heartbeat:Dummy): Started atlas-node2 >>>>>>>>> >>>>>>>>> PCSD Status: >>>>>>>>> atlas-node1: Online >>>>>>>>> atlas-node2: Online >>>>>>>>> >>>>>>>>> Daemon Status: >>>>>>>>> corosync: active/enabled >>>>>>>>> pacemaker: active/enabled >>>>>>>>> pcsd: active/enabled >>>>>>>>> >>>>>>>>> >>>>>>>>> Cheers, >>>>>>>>> >>>>>>>>> Alessandro >>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Thanks, >>>>>>>>>> Soumya >>>>>>>>>> >>>>>>>>>>> Alessandro >>>>>>>>>>> >>>>>>>>>>>> Il giorno 08/giu/2015, alle ore 20:00, Alessandro De Salvo >>>>>>>>>>>> <Alessandro.DeSalvo at roma1.infn.it >>>>>>>>>>>> <mailto:Alessandro.DeSalvo at roma1.infn.it>> ha scritto: >>>>>>>>>>>> >>>>>>>>>>>> Hi, >>>>>>>>>>>> indeed, it does not work :-) >>>>>>>>>>>> OK, this is what I did, with 2 machines, running CentOS 7.1, >>>>>>>>>>>> Glusterfs 3.7.1 and nfs-ganesha 2.2.0: >>>>>>>>>>>> >>>>>>>>>>>> 1) ensured that the machines are able to resolve their IPs (but >>>>>>>>>>>> this was already true since they were in the DNS); >>>>>>>>>>>> 2) disabled NetworkManager and enabled network on both machines; >>>>>>>>>>>> 3) created a gluster shared volume 'gluster_shared_storage' and >>>>>>>>>>>> mounted it on '/run/gluster/shared_storage' on all the cluster >>>>>>>>>>>> nodes using glusterfs native mount (on CentOS 7.1 there is a link >>>>>>>>>>>> by default /var/run -> ../run) >>>>>>>>>>>> 4) created an empty /etc/ganesha/ganesha.conf; >>>>>>>>>>>> 5) installed pacemaker pcs resource-agents corosync on all cluster >>>>>>>>>>>> machines; >>>>>>>>>>>> 6) set the ?hacluster? user the same password on all machines; >>>>>>>>>>>> 7) pcs cluster auth <hostname> -u hacluster -p <pass> on all the >>>>>>>>>>>> nodes (on both nodes I issued the commands for both nodes) >>>>>>>>>>>> 8) IPv6 is configured by default on all nodes, although the >>>>>>>>>>>> infrastructure is not ready for IPv6 >>>>>>>>>>>> 9) enabled pcsd and started it on all nodes >>>>>>>>>>>> 10) populated /etc/ganesha/ganesha-ha.conf with the following >>>>>>>>>>>> contents, one per machine: >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> ===> atlas-node1 >>>>>>>>>>>> # Name of the HA cluster created. >>>>>>>>>>>> HA_NAME="ATLAS_GANESHA_01" >>>>>>>>>>>> # The server from which you intend to mount >>>>>>>>>>>> # the shared volume. >>>>>>>>>>>> HA_VOL_SERVER=?atlas-node1" >>>>>>>>>>>> # The subset of nodes of the Gluster Trusted Pool >>>>>>>>>>>> # that forms the ganesha HA cluster. IP/Hostname >>>>>>>>>>>> # is specified. >>>>>>>>>>>> HA_CLUSTER_NODES=?atlas-node1,atlas-node2" >>>>>>>>>>>> # Virtual IPs of each of the nodes specified above. >>>>>>>>>>>> VIP_atlas-node1=?x.x.x.1" >>>>>>>>>>>> VIP_atlas-node2=?x.x.x.2" >>>>>>>>>>>> >>>>>>>>>>>> ===> atlas-node2 >>>>>>>>>>>> # Name of the HA cluster created. >>>>>>>>>>>> HA_NAME="ATLAS_GANESHA_01" >>>>>>>>>>>> # The server from which you intend to mount >>>>>>>>>>>> # the shared volume. >>>>>>>>>>>> HA_VOL_SERVER=?atlas-node2" >>>>>>>>>>>> # The subset of nodes of the Gluster Trusted Pool >>>>>>>>>>>> # that forms the ganesha HA cluster. IP/Hostname >>>>>>>>>>>> # is specified. >>>>>>>>>>>> HA_CLUSTER_NODES=?atlas-node1,atlas-node2" >>>>>>>>>>>> # Virtual IPs of each of the nodes specified above. >>>>>>>>>>>> VIP_atlas-node1=?x.x.x.1" >>>>>>>>>>>> VIP_atlas-node2=?x.x.x.2? >>>>>>>>>>>> >>>>>>>>>>>> 11) issued gluster nfs-ganesha enable, but it fails with a cryptic >>>>>>>>>>>> message: >>>>>>>>>>>> >>>>>>>>>>>> # gluster nfs-ganesha enable >>>>>>>>>>>> Enabling NFS-Ganesha requires Gluster-NFS to be disabled across the >>>>>>>>>>>> trusted pool. Do you still want to continue? (y/n) y >>>>>>>>>>>> nfs-ganesha: failed: Failed to set up HA config for NFS-Ganesha. >>>>>>>>>>>> Please check the log file for details >>>>>>>>>>>> >>>>>>>>>>>> Looking at the logs I found nothing really special but this: >>>>>>>>>>>> >>>>>>>>>>>> ==> /var/log/glusterfs/etc-glusterfs-glusterd.vol.log <=>>>>>>>>>>>> [2015-06-08 17:57:15.672844] I [MSGID: 106132] >>>>>>>>>>>> [glusterd-proc-mgmt.c:83:glusterd_proc_stop] 0-management: nfs >>>>>>>>>>>> already stopped >>>>>>>>>>>> [2015-06-08 17:57:15.675395] I >>>>>>>>>>>> [glusterd-ganesha.c:386:check_host_list] 0-management: ganesha host >>>>>>>>>>>> found Hostname is atlas-node2 >>>>>>>>>>>> [2015-06-08 17:57:15.720692] I >>>>>>>>>>>> [glusterd-ganesha.c:386:check_host_list] 0-management: ganesha host >>>>>>>>>>>> found Hostname is atlas-node2 >>>>>>>>>>>> [2015-06-08 17:57:15.721161] I >>>>>>>>>>>> [glusterd-ganesha.c:335:is_ganesha_host] 0-management: ganesha host >>>>>>>>>>>> found Hostname is atlas-node2 >>>>>>>>>>>> [2015-06-08 17:57:16.633048] E >>>>>>>>>>>> [glusterd-ganesha.c:254:glusterd_op_set_ganesha] 0-management: >>>>>>>>>>>> Initial NFS-Ganesha set up failed >>>>>>>>>>>> [2015-06-08 17:57:16.641563] E >>>>>>>>>>>> [glusterd-syncop.c:1396:gd_commit_op_phase] 0-management: Commit of >>>>>>>>>>>> operation 'Volume (null)' failed on localhost : Failed to set up HA >>>>>>>>>>>> config for NFS-Ganesha. Please check the log file for details >>>>>>>>>>>> >>>>>>>>>>>> ==> /var/log/glusterfs/cmd_history.log <=>>>>>>>>>>>> [2015-06-08 17:57:16.643615] : nfs-ganesha enable : FAILED : >>>>>>>>>>>> Failed to set up HA config for NFS-Ganesha. Please check the log >>>>>>>>>>>> file for details >>>>>>>>>>>> >>>>>>>>>>>> ==> /var/log/glusterfs/cli.log <=>>>>>>>>>>>> [2015-06-08 17:57:16.643839] I [input.c:36:cli_batch] 0-: Exiting >>>>>>>>>>>> with: -1 >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> Also, pcs seems to be fine for the auth part, although it obviously >>>>>>>>>>>> tells me the cluster is not running. >>>>>>>>>>>> >>>>>>>>>>>> I, [2015-06-08T19:57:16.305323 #7223] INFO -- : Running: >>>>>>>>>>>> /usr/sbin/corosync-cmapctl totem.cluster_name >>>>>>>>>>>> I, [2015-06-08T19:57:16.345457 #7223] INFO -- : Running: >>>>>>>>>>>> /usr/sbin/pcs cluster token-nodes >>>>>>>>>>>> ::ffff:141.108.38.46 - - [08/Jun/2015 19:57:16] "GET >>>>>>>>>>>> /remote/check_auth HTTP/1.1" 200 68 0.1919 >>>>>>>>>>>> ::ffff:141.108.38.46 - - [08/Jun/2015 19:57:16] "GET >>>>>>>>>>>> /remote/check_auth HTTP/1.1" 200 68 0.1920 >>>>>>>>>>>> atlas-node1.mydomain - - [08/Jun/2015:19:57:16 CEST] "GET >>>>>>>>>>>> /remote/check_auth HTTP/1.1" 200 68 >>>>>>>>>>>> - -> /remote/check_auth >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> What am I doing wrong? >>>>>>>>>>>> Thanks, >>>>>>>>>>>> >>>>>>>>>>>> Alessandro >>>>>>>>>>>> >>>>>>>>>>>>> Il giorno 08/giu/2015, alle ore 19:30, Soumya Koduri >>>>>>>>>>>>> <skoduri at redhat.com <mailto:skoduri at redhat.com>> ha scritto: >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> On 06/08/2015 08:20 PM, Alessandro De Salvo wrote: >>>>>>>>>>>>>> Sorry, just another question: >>>>>>>>>>>>>> >>>>>>>>>>>>>> - in my installation of gluster 3.7.1 the command gluster >>>>>>>>>>>>>> features.ganesha enable does not work: >>>>>>>>>>>>>> >>>>>>>>>>>>>> # gluster features.ganesha enable >>>>>>>>>>>>>> unrecognized word: features.ganesha (position 0) >>>>>>>>>>>>>> >>>>>>>>>>>>>> Which version has full support for it? >>>>>>>>>>>>> >>>>>>>>>>>>> Sorry. This option has recently been changed. It is now >>>>>>>>>>>>> >>>>>>>>>>>>> $ gluster nfs-ganesha enable >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> - in the documentation the ccs and cman packages are required, >>>>>>>>>>>>>> but they seems not to be available anymore on CentOS 7 and >>>>>>>>>>>>>> similar, I guess they are not really required anymore, as pcs >>>>>>>>>>>>>> should do the full job >>>>>>>>>>>>>> >>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>> >>>>>>>>>>>>>> Alessandro >>>>>>>>>>>>> >>>>>>>>>>>>> Looks like so from http://clusterlabs.org/quickstart-redhat.html. >>>>>>>>>>>>> Let us know if it doesn't work. >>>>>>>>>>>>> >>>>>>>>>>>>> Thanks, >>>>>>>>>>>>> Soumya >>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>>> Il giorno 08/giu/2015, alle ore 15:09, Alessandro De Salvo >>>>>>>>>>>>>>> <alessandro.desalvo at roma1.infn.it >>>>>>>>>>>>>>> <mailto:alessandro.desalvo at roma1.infn.it>> ha scritto: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Great, many thanks Soumya! >>>>>>>>>>>>>>> Cheers, >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Alessandro >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Il giorno 08/giu/2015, alle ore 13:53, Soumya Koduri >>>>>>>>>>>>>>>> <skoduri at redhat.com <mailto:skoduri at redhat.com>> ha scritto: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Hi, >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Please find the slides of the demo video at [1] >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> We recommend to have a distributed replica volume as a shared >>>>>>>>>>>>>>>> volume for better data-availability. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Size of the volume depends on the workload you may have. Since >>>>>>>>>>>>>>>> it is used to maintain states of NLM/NFSv4 clients, you may >>>>>>>>>>>>>>>> calculate the size of the volume to be minimum of aggregate of >>>>>>>>>>>>>>>> (typical_size_of'/var/lib/nfs'_directory + >>>>>>>>>>>>>>>> ~4k*no_of_clients_connected_to_each_of_the_nfs_servers_at_any_point) >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> We shall document about this feature sooner in the gluster docs >>>>>>>>>>>>>>>> as well. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>>>> Soumya >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> [1] - http://www.slideshare.net/SoumyaKoduri/high-49117846 >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> On 06/08/2015 04:34 PM, Alessandro De Salvo wrote: >>>>>>>>>>>>>>>>> Hi, >>>>>>>>>>>>>>>>> I have seen the demo video on ganesha HA, >>>>>>>>>>>>>>>>> https://www.youtube.com/watch?v=Z4mvTQC-efM >>>>>>>>>>>>>>>>> However there is no advice on the appropriate size of the >>>>>>>>>>>>>>>>> shared volume. How is it really used, and what should be a >>>>>>>>>>>>>>>>> reasonable size for it? >>>>>>>>>>>>>>>>> Also, are the slides from the video available somewhere, as >>>>>>>>>>>>>>>>> well as a documentation on all this? I did not manage to find >>>>>>>>>>>>>>>>> them. >>>>>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Alessandro >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>>>>>> Gluster-users mailing list >>>>>>>>>>>>>>>>> Gluster-users at gluster.org <mailto:Gluster-users at gluster.org> >>>>>>>>>>>>>>>>> http://www.gluster.org/mailman/listinfo/gluster-users >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>> Gluster-users mailing list >>>>>>>>>>>> Gluster-users at gluster.org <mailto:Gluster-users at gluster.org> >>>>>>>>>>>> http://www.gluster.org/mailman/listinfo/gluster-users >>>>>>>>>>> >>>>>>>>> >>>>>>>>> _______________________________________________ >>>>>>>>> Gluster-users mailing list >>>>>>>>> Gluster-users at gluster.org <mailto:Gluster-users at gluster.org> >>>>>>>>> http://www.gluster.org/mailman/listinfo/gluster-users >>>>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> Gluster-users mailing list >>>>>> Gluster-users at gluster.org >>>>>> http://www.gluster.org/mailman/listinfo/gluster-users >>>>> >>>>> >>> > > > _______________________________________________ > Gluster-users mailing list > Gluster-users at gluster.org > http://www.gluster.org/mailman/listinfo/gluster-users >-------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 1770 bytes Desc: not available URL: <http://www.gluster.org/pipermail/gluster-users/attachments/20150611/c92cc441/attachment.p7s>
Soumya Koduri
2015-Jun-11 16:16 UTC
[Gluster-users] Questions on ganesha HA and shared storage size
CCin ganesha-devel to get more inputs. In case of ipv6 enabled, only v6 interfaces are used by NFS-Ganesha. commit - git show 'd7e8f255' , which got added in v2.2 has more details. > # netstat -ltaupn | grep 2049 > tcp6 4 0 :::2049 :::* > LISTEN 32080/ganesha.nfsd > tcp6 1 0 x.x.x.2:2049 x.x.x.2:33285 CLOSE_WAIT > - > tcp6 1 0 127.0.0.1:2049 127.0.0.1:39555 > CLOSE_WAIT - > udp6 0 0 :::2049 :::* > 32080/ganesha.nfsd > Looks like (even from the logs and the netstat output), there was a shutdown request even before the server has come out of grace period. 10/06/2015 01:58:53 : epoch 55777da1 : node2 : ganesha.nfsd-20696[work-6] nfs_rpc_dequeue_req :DISP :F_DBG :dequeue_req try qpair REQ_Q_LOW_LATENCY 0x7fdf8dc67b00:0x7fdf8dc67b68 10/06/2015 01:58:53 : epoch 55777da1 : node2 : ganesha.nfsd-20696[reaper] nfs_in_grace :STATE :EVENT :NFS Server Now IN GRACE ...... 10/06/2015 01:58:55 : epoch 55777da1 : node2 : ganesha.nfsd-20696[dbus_heartbeat] gsh_dbus_thread :DBUS :F_DBG :top of poll loop 10/06/2015 01:58:55 : epoch 55777da1 : node2 : ganesha.nfsd-20696[main] nfs_start :NFS STARTUP :EVENT : NFS SERVER INITIALIZED 10/06/2015 01:58:55 : epoch 55777da1 : node2 : ganesha.nfsd-20696[work-12] nfs_rpc_consume_req :DISP :F_DBG :try splice, qpair REQ_Q_LOW_LATENCY consumer qsize=0 producer qsize=0 ...... 10/06/2015 01:59:52 : epoch 55777da1 : node2 : ganesha.nfsd-20696[dbus_heartbeat] gsh_dbus_thread :DBUS :F_DBG :top of poll loop 10/06/2015 01:59:52 : epoch 55777da1 : node2 : ganesha.nfsd-20696[Admin] do_shutdown :MAIN :EVENT :NFS EXIT: stopping NFS service ....... 10/06/2015 02:00:00 : epoch 55777da1 : node2 : ganesha.nfsd-20696[reaper] nfs_in_grace :STATE :EVENT :NFS Server Now NOT IN GRACE 10/06/2015 02:00:00 : epoch 55777da1 : node2 : ganesha.nfsd-20696[dbus_heartbeat] gsh_dbus_thread :DBUS :F_DBG :top of poll loop When you observe the hang, please take 'gstack <ganesha_pid>' output and post it in the mail. Thanks, Soumya On 06/11/2015 12:37 AM, Alessandro De Salvo wrote:> Hi, > by looking at the connections I also see a strange problem: > > # netstat -ltaupn | grep 2049 > tcp6 4 0 :::2049 :::* > LISTEN 32080/ganesha.nfsd > tcp6 1 0 x.x.x.2:2049 x.x.x.2:33285 CLOSE_WAIT > - > tcp6 1 0 127.0.0.1:2049 127.0.0.1:39555 > CLOSE_WAIT - > udp6 0 0 :::2049 :::* > 32080/ganesha.nfsd > > > Why tcp6 is used with an IPv4 address? > In another machine where ganesha 2.1.0 is running I see tcp is used, not > tcp6. > Could it be that the RPC are always trying to use IPv6? That would be > wrong. > Thanks, > > Alessandro > > On Wed, 2015-06-10 at 15:28 +0530, Soumya Koduri wrote: >> >> On 06/10/2015 05:49 AM, Alessandro De Salvo wrote: >>> Hi, >>> I have enabled the full debug already, but I see nothing special. Before exporting any volume the log shows no error, even when I do a showmount (the log is attached, ganesha.log.gz). If I do the same after exporting a volume nfs-ganesha does not even start, complaining for not being able to bind the IPv6 ruota socket, but in fact there is nothing listening on IPv6, so it should not happen: >>> >>> tcp6 0 0 :::111 :::* LISTEN 7433/rpcbind >>> tcp6 0 0 :::2224 :::* LISTEN 9054/ruby >>> tcp6 0 0 :::22 :::* LISTEN 1248/sshd >>> udp6 0 0 :::111 :::* 7433/rpcbind >>> udp6 0 0 fe80::8c2:27ff:fef2:123 :::* 31238/ntpd >>> udp6 0 0 fe80::230:48ff:fed2:123 :::* 31238/ntpd >>> udp6 0 0 fe80::230:48ff:fed2:123 :::* 31238/ntpd >>> udp6 0 0 fe80::230:48ff:fed2:123 :::* 31238/ntpd >>> udp6 0 0 ::1:123 :::* 31238/ntpd >>> udp6 0 0 fe80::5484:7aff:fef:123 :::* 31238/ntpd >>> udp6 0 0 :::123 :::* 31238/ntpd >>> udp6 0 0 :::824 :::* 7433/rpcbind >>> >>> The error, as shown in the attached ganesha-after-export.log.gz logfile, is the following: >>> >>> >>> 10/06/2015 02:07:47 : epoch 55777fb5 : node2 : ganesha.nfsd-26195[main] Bind_sockets_V6 :DISP :WARN :Cannot bind RQUOTA tcp6 socket, error 98 (Address already in use) >>> 10/06/2015 02:07:47 : epoch 55777fb5 : node2 : ganesha.nfsd-26195[main] Bind_sockets :DISP :FATAL :Error binding to V6 interface. Cannot continue. >>> 10/06/2015 02:07:48 : epoch 55777fb5 : node2 : ganesha.nfsd-26195[main] glusterfs_unload :FSAL :DEBUG :FSAL Gluster unloaded >>> >> >> We have seen such issues with RPCBIND few times. NFS-Ganesha setup first >> disables Gluster-NFS and then brings up NFS-Ganesha service. Sometimes, >> there could be delay or issue with Gluster-NFS un-registering those >> services and when NFS-Ganesha tries to register to the same port, it >> throws this error. Please try registering Rquota to any random port >> using below config option in "/etc/ganesha/ganesha.conf" >> >> NFS_Core_Param { >> #Use a non-privileged port for RQuota >> Rquota_Port = 4501; >> } >> >> and cleanup '/var/cache/rpcbind/' directory before the setup. >> >> Thanks, >> Soumya >> >>> >>> Thanks, >>> >>> Alessandro >>> >>> >>> >>> >>>> Il giorno 09/giu/2015, alle ore 18:37, Soumya Koduri <skoduri at redhat.com> ha scritto: >>>> >>>> >>>> >>>> On 06/09/2015 09:47 PM, Alessandro De Salvo wrote: >>>>> Another update: the fact that I was unable to use vol set ganesha.enable >>>>> was due to another bug in the ganesha scripts. In short, they are all >>>>> using the following line to get the location of the conf file: >>>>> >>>>> CONF=$(cat /etc/sysconfig/ganesha | grep "CONFFILE" | cut -f 2 -d "=") >>>>> >>>>> First of all by default in /etc/sysconfig/ganesha there is no line >>>>> CONFFILE, second there is a bug in that directive, as it works if I add >>>>> in /etc/sysconfig/ganesha >>>>> >>>>> CONFFILE=/etc/ganesha/ganesha.conf >>>>> >>>>> but it fails if the same is quoted >>>>> >>>>> CONFFILE="/etc/ganesha/ganesha.conf" >>>>> >>>>> It would be much better to use the following, which has a default as >>>>> well: >>>>> >>>>> eval $(grep -F CONFFILE= /etc/sysconfig/ganesha) >>>>> CONF=${CONFFILE:/etc/ganesha/ganesha.conf} >>>>> >>>>> I'll update the bug report. >>>>> Having said this... the last issue to tackle is the real problem with >>>>> the ganesha.nfsd :-( >>>> >>>> Thanks. Could you try changing log level to NIV_FULL_DEBUG in '/etc/sysconfig/ganesha' and check if anything gets logged in '/var/log/ganesha.log' or '/ganesha.log'. >>>> >>>> Thanks, >>>> Soumya >>>> >>>>> Cheers, >>>>> >>>>> Alessandro >>>>> >>>>> >>>>> On Tue, 2015-06-09 at 14:25 +0200, Alessandro De Salvo wrote: >>>>>> OK, I can confirm that the ganesha.nsfd process is actually not >>>>>> answering to the calls. Here it is what I see: >>>>>> >>>>>> # rpcinfo -p >>>>>> program vers proto port service >>>>>> 100000 4 tcp 111 portmapper >>>>>> 100000 3 tcp 111 portmapper >>>>>> 100000 2 tcp 111 portmapper >>>>>> 100000 4 udp 111 portmapper >>>>>> 100000 3 udp 111 portmapper >>>>>> 100000 2 udp 111 portmapper >>>>>> 100024 1 udp 41594 status >>>>>> 100024 1 tcp 53631 status >>>>>> 100003 3 udp 2049 nfs >>>>>> 100003 3 tcp 2049 nfs >>>>>> 100003 4 udp 2049 nfs >>>>>> 100003 4 tcp 2049 nfs >>>>>> 100005 1 udp 58127 mountd >>>>>> 100005 1 tcp 56301 mountd >>>>>> 100005 3 udp 58127 mountd >>>>>> 100005 3 tcp 56301 mountd >>>>>> 100021 4 udp 46203 nlockmgr >>>>>> 100021 4 tcp 41798 nlockmgr >>>>>> 100011 1 udp 875 rquotad >>>>>> 100011 1 tcp 875 rquotad >>>>>> 100011 2 udp 875 rquotad >>>>>> 100011 2 tcp 875 rquotad >>>>>> >>>>>> # netstat -lpn | grep ganesha >>>>>> tcp6 14 0 :::2049 :::* >>>>>> LISTEN 11937/ganesha.nfsd >>>>>> tcp6 0 0 :::41798 :::* >>>>>> LISTEN 11937/ganesha.nfsd >>>>>> tcp6 0 0 :::875 :::* >>>>>> LISTEN 11937/ganesha.nfsd >>>>>> tcp6 10 0 :::56301 :::* >>>>>> LISTEN 11937/ganesha.nfsd >>>>>> tcp6 0 0 :::564 :::* >>>>>> LISTEN 11937/ganesha.nfsd >>>>>> udp6 0 0 :::2049 :::* >>>>>> 11937/ganesha.nfsd >>>>>> udp6 0 0 :::46203 :::* >>>>>> 11937/ganesha.nfsd >>>>>> udp6 0 0 :::58127 :::* >>>>>> 11937/ganesha.nfsd >>>>>> udp6 0 0 :::875 :::* >>>>>> 11937/ganesha.nfsd >>>>>> >>>>>> I'm attaching the strace of a showmount from a node to the other. >>>>>> This machinery was working with nfs-ganesha 2.1.0, so it must be >>>>>> something introduced with 2.2.0. >>>>>> Cheers, >>>>>> >>>>>> Alessandro >>>>>> >>>>>> >>>>>> >>>>>> On Tue, 2015-06-09 at 15:16 +0530, Soumya Koduri wrote: >>>>>>> >>>>>>> On 06/09/2015 02:48 PM, Alessandro De Salvo wrote: >>>>>>>> Hi, >>>>>>>> OK, the problem with the VIPs not starting is due to the ganesha_mon >>>>>>>> heartbeat script looking for a pid file called >>>>>>>> /var/run/ganesha.nfsd.pid, while by default ganesha.nfsd v.2.2.0 is >>>>>>>> creating /var/run/ganesha.pid, this needs to be corrected. The file is >>>>>>>> in glusterfs-ganesha-3.7.1-1.el7.x86_64, in my case. >>>>>>>> For the moment I have created a symlink in this way and it works: >>>>>>>> >>>>>>>> ln -s /var/run/ganesha.pid /var/run/ganesha.nfsd.pid >>>>>>>> >>>>>>> Thanks. Please update this as well in the bug. >>>>>>> >>>>>>>> So far so good, the VIPs are up and pingable, but still there is the >>>>>>>> problem of the hanging showmount (i.e. hanging RPC). >>>>>>>> Still, I see a lot of errors like this in /var/log/messages: >>>>>>>> >>>>>>>> Jun 9 11:15:20 atlas-node1 lrmd[31221]: notice: operation_finished: >>>>>>>> nfs-mon_monitor_10000:29292:stderr [ Error: Resource does not exist. ] >>>>>>>> >>>>>>>> While ganesha.log shows the server is not in grace: >>>>>>>> >>>>>>>> 09/06/2015 11:16:20 : epoch 5576aee4 : atlas-node1 : >>>>>>>> ganesha.nfsd-29964[main] main :MAIN :EVENT :ganesha.nfsd Starting: >>>>>>>> Ganesha Version /builddir/build/BUILD/nfs-ganesha-2.2.0/src, built at >>>>>>>> May 18 2015 14:17:18 on buildhw-09.phx2.fedoraproject.org >>>>>>>> <http://buildhw-09.phx2.fedoraproject.org> >>>>>>>> 09/06/2015 11:16:20 : epoch 5576aee4 : atlas-node1 : >>>>>>>> ganesha.nfsd-29965[main] nfs_set_param_from_conf :NFS STARTUP :EVENT >>>>>>>> :Configuration file successfully parsed >>>>>>>> 09/06/2015 11:16:20 : epoch 5576aee4 : atlas-node1 : >>>>>>>> ganesha.nfsd-29965[main] init_server_pkgs :NFS STARTUP :EVENT >>>>>>>> :Initializing ID Mapper. >>>>>>>> 09/06/2015 11:16:20 : epoch 5576aee4 : atlas-node1 : >>>>>>>> ganesha.nfsd-29965[main] init_server_pkgs :NFS STARTUP :EVENT :ID Mapper >>>>>>>> successfully initialized. >>>>>>>> 09/06/2015 11:16:20 : epoch 5576aee4 : atlas-node1 : >>>>>>>> ganesha.nfsd-29965[main] main :NFS STARTUP :WARN :No export entries >>>>>>>> found in configuration file !!! >>>>>>>> 09/06/2015 11:16:20 : epoch 5576aee4 : atlas-node1 : >>>>>>>> ganesha.nfsd-29965[main] config_errs_to_log :CONFIG :WARN :Config File >>>>>>>> ((null):0): Empty configuration file >>>>>>>> 09/06/2015 11:16:20 : epoch 5576aee4 : atlas-node1 : >>>>>>>> ganesha.nfsd-29965[main] lower_my_caps :NFS STARTUP :EVENT >>>>>>>> :CAP_SYS_RESOURCE was successfully removed for proper quota management >>>>>>>> in FSAL >>>>>>>> 09/06/2015 11:16:20 : epoch 5576aee4 : atlas-node1 : >>>>>>>> ganesha.nfsd-29965[main] lower_my_caps :NFS STARTUP :EVENT :currenty set >>>>>>>> capabilities are: >>>>>>>> cap_chown,cap_dac_override,cap_dac_read_search,cap_fowner,cap_fsetid,cap_kill,cap_setgid,cap_setuid,cap_setpcap,cap_linux_immutable,cap_net_bind_service,cap_net_broadcast,cap_net_admin,cap_net_raw,cap_ipc_lock,cap_ipc_owner,cap_sys_module,cap_sys_rawio,cap_sys_chroot,cap_sys_ptrace,cap_sys_pacct,cap_sys_admin,cap_sys_boot,cap_sys_nice,cap_sys_time,cap_sys_tty_config,cap_mknod,cap_lease,cap_audit_write,cap_audit_control,cap_setfcap+ep >>>>>>>> 09/06/2015 11:16:21 : epoch 5576aee4 : atlas-node1 : >>>>>>>> ganesha.nfsd-29965[main] nfs_Init_svc :DISP :CRIT :Cannot acquire >>>>>>>> credentials for principal nfs >>>>>>>> 09/06/2015 11:16:21 : epoch 5576aee4 : atlas-node1 : >>>>>>>> ganesha.nfsd-29965[main] nfs_Init_admin_thread :NFS CB :EVENT :Admin >>>>>>>> thread initialized >>>>>>>> 09/06/2015 11:16:21 : epoch 5576aee4 : atlas-node1 : >>>>>>>> ganesha.nfsd-29965[main] nfs4_start_grace :STATE :EVENT :NFS Server Now >>>>>>>> IN GRACE, duration 60 >>>>>>>> 09/06/2015 11:16:21 : epoch 5576aee4 : atlas-node1 : >>>>>>>> ganesha.nfsd-29965[main] nfs_rpc_cb_init_ccache :NFS STARTUP :EVENT >>>>>>>> :Callback creds directory (/var/run/ganesha) already exists >>>>>>>> 09/06/2015 11:16:21 : epoch 5576aee4 : atlas-node1 : >>>>>>>> ganesha.nfsd-29965[main] nfs_rpc_cb_init_ccache :NFS STARTUP :WARN >>>>>>>> :gssd_refresh_krb5_machine_credential failed (2:2) >>>>>>>> 09/06/2015 11:16:21 : epoch 5576aee4 : atlas-node1 : >>>>>>>> ganesha.nfsd-29965[main] nfs_Start_threads :THREAD :EVENT :Starting >>>>>>>> delayed executor. >>>>>>>> 09/06/2015 11:16:22 : epoch 5576aee4 : atlas-node1 : >>>>>>>> ganesha.nfsd-29965[main] nfs_Start_threads :THREAD :EVENT :9P/TCP >>>>>>>> dispatcher thread was started successfully >>>>>>>> 09/06/2015 11:16:22 : epoch 5576aee4 : atlas-node1 : >>>>>>>> ganesha.nfsd-29965[_9p_disp] _9p_dispatcher_thread :9P DISP :EVENT :9P >>>>>>>> dispatcher started >>>>>>>> 09/06/2015 11:16:22 : epoch 5576aee4 : atlas-node1 : >>>>>>>> ganesha.nfsd-29965[main] nfs_Start_threads :THREAD :EVENT >>>>>>>> :gsh_dbusthread was started successfully >>>>>>>> 09/06/2015 11:16:22 : epoch 5576aee4 : atlas-node1 : >>>>>>>> ganesha.nfsd-29965[main] nfs_Start_threads :THREAD :EVENT :admin thread >>>>>>>> was started successfully >>>>>>>> 09/06/2015 11:16:22 : epoch 5576aee4 : atlas-node1 : >>>>>>>> ganesha.nfsd-29965[main] nfs_Start_threads :THREAD :EVENT :reaper thread >>>>>>>> was started successfully >>>>>>>> 09/06/2015 11:16:22 : epoch 5576aee4 : atlas-node1 : >>>>>>>> ganesha.nfsd-29965[reaper] nfs_in_grace :STATE :EVENT :NFS Server Now IN >>>>>>>> GRACE >>>>>>>> 09/06/2015 11:16:22 : epoch 5576aee4 : atlas-node1 : >>>>>>>> ganesha.nfsd-29965[main] nfs_Start_threads :THREAD :EVENT :General >>>>>>>> fridge was started successfully >>>>>>>> 09/06/2015 11:16:22 : epoch 5576aee4 : atlas-node1 : >>>>>>>> ganesha.nfsd-29965[main] nfs_start :NFS STARTUP :EVENT >>>>>>>> :------------------------------------------------- >>>>>>>> 09/06/2015 11:16:22 : epoch 5576aee4 : atlas-node1 : >>>>>>>> ganesha.nfsd-29965[main] nfs_start :NFS STARTUP :EVENT : NFS >>>>>>>> SERVER INITIALIZED >>>>>>>> 09/06/2015 11:16:22 : epoch 5576aee4 : atlas-node1 : >>>>>>>> ganesha.nfsd-29965[main] nfs_start :NFS STARTUP :EVENT >>>>>>>> :------------------------------------------------- >>>>>>>> 09/06/2015 11:17:22 : epoch 5576aee4 : atlas-node1 : >>>>>>>> ganesha.nfsd-29965[reaper] nfs_in_grace :STATE :EVENT :NFS Server Now >>>>>>>> NOT IN GRACE >>>>>>>> >>>>>>>> >>>>>>> Please check the status of nfs-ganesha >>>>>>> $service nfs-ganesha status >>>>>>> >>>>>>> Could you try taking a packet trace (during showmount or mount) and >>>>>>> check the server responses. >>>>>>> >>>>>>> Thanks, >>>>>>> Soumya >>>>>>> >>>>>>>> Cheers, >>>>>>>> >>>>>>>> Alessandro >>>>>>>> >>>>>>>> >>>>>>>>> Il giorno 09/giu/2015, alle ore 10:36, Alessandro De Salvo >>>>>>>>> <alessandro.desalvo at roma1.infn.it >>>>>>>>> <mailto:alessandro.desalvo at roma1.infn.it>> ha scritto: >>>>>>>>> >>>>>>>>> Hi Soumya, >>>>>>>>> >>>>>>>>>> Il giorno 09/giu/2015, alle ore 08:06, Soumya Koduri >>>>>>>>>> <skoduri at redhat.com <mailto:skoduri at redhat.com>> ha scritto: >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On 06/09/2015 01:31 AM, Alessandro De Salvo wrote: >>>>>>>>>>> OK, I found at least one of the bugs. >>>>>>>>>>> The /usr/libexec/ganesha/ganesha.sh has the following lines: >>>>>>>>>>> >>>>>>>>>>> if [ -e /etc/os-release ]; then >>>>>>>>>>> RHEL6_PCS_CNAME_OPTION="" >>>>>>>>>>> fi >>>>>>>>>>> >>>>>>>>>>> This is OK for RHEL < 7, but does not work for >= 7. I have changed >>>>>>>>>>> it to the following, to make it working: >>>>>>>>>>> >>>>>>>>>>> if [ -e /etc/os-release ]; then >>>>>>>>>>> eval $(grep -F "REDHAT_SUPPORT_PRODUCT=" /etc/os-release) >>>>>>>>>>> [ "$REDHAT_SUPPORT_PRODUCT" == "Fedora" ] && >>>>>>>>>>> RHEL6_PCS_CNAME_OPTION="" >>>>>>>>>>> fi >>>>>>>>>>> >>>>>>>>>> Oh..Thanks for the fix. Could you please file a bug for the same (and >>>>>>>>>> probably submit your fix as well). We shall have it corrected. >>>>>>>>> >>>>>>>>> Just did it,https://bugzilla.redhat.com/show_bug.cgi?id=1229601 >>>>>>>>> >>>>>>>>>> >>>>>>>>>>> Apart from that, the VIP_<node> I was using were wrong, and I should >>>>>>>>>>> have converted all the ?-? to underscores, maybe this could be >>>>>>>>>>> mentioned in the documentation when you will have it ready. >>>>>>>>>>> Now, the cluster starts, but the VIPs apparently not: >>>>>>>>>>> >>>>>>>>>> Sure. Thanks again for pointing it out. We shall make a note of it. >>>>>>>>>> >>>>>>>>>>> Online: [ atlas-node1 atlas-node2 ] >>>>>>>>>>> >>>>>>>>>>> Full list of resources: >>>>>>>>>>> >>>>>>>>>>> Clone Set: nfs-mon-clone [nfs-mon] >>>>>>>>>>> Started: [ atlas-node1 atlas-node2 ] >>>>>>>>>>> Clone Set: nfs-grace-clone [nfs-grace] >>>>>>>>>>> Started: [ atlas-node1 atlas-node2 ] >>>>>>>>>>> atlas-node1-cluster_ip-1 (ocf::heartbeat:IPaddr): Stopped >>>>>>>>>>> atlas-node1-trigger_ip-1 (ocf::heartbeat:Dummy): Started atlas-node1 >>>>>>>>>>> atlas-node2-cluster_ip-1 (ocf::heartbeat:IPaddr): Stopped >>>>>>>>>>> atlas-node2-trigger_ip-1 (ocf::heartbeat:Dummy): Started atlas-node2 >>>>>>>>>>> atlas-node1-dead_ip-1 (ocf::heartbeat:Dummy): Started atlas-node1 >>>>>>>>>>> atlas-node2-dead_ip-1 (ocf::heartbeat:Dummy): Started atlas-node2 >>>>>>>>>>> >>>>>>>>>>> PCSD Status: >>>>>>>>>>> atlas-node1: Online >>>>>>>>>>> atlas-node2: Online >>>>>>>>>>> >>>>>>>>>>> Daemon Status: >>>>>>>>>>> corosync: active/disabled >>>>>>>>>>> pacemaker: active/disabled >>>>>>>>>>> pcsd: active/enabled >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> Here corosync and pacemaker shows 'disabled' state. Can you check the >>>>>>>>>> status of their services. They should be running prior to cluster >>>>>>>>>> creation. We need to include that step in document as well. >>>>>>>>> >>>>>>>>> Ah, OK, you?re right, I have added it to my puppet modules (we install >>>>>>>>> and configure ganesha via puppet, I?ll put the module on puppetforge >>>>>>>>> soon, in case anyone is interested). >>>>>>>>> >>>>>>>>>> >>>>>>>>>>> But the issue that is puzzling me more is the following: >>>>>>>>>>> >>>>>>>>>>> # showmount -e localhost >>>>>>>>>>> rpc mount export: RPC: Timed out >>>>>>>>>>> >>>>>>>>>>> And when I try to enable the ganesha exports on a volume I get this >>>>>>>>>>> error: >>>>>>>>>>> >>>>>>>>>>> # gluster volume set atlas-home-01 ganesha.enable on >>>>>>>>>>> volume set: failed: Failed to create NFS-Ganesha export config file. >>>>>>>>>>> >>>>>>>>>>> But I see the file created in /etc/ganesha/exports/*.conf >>>>>>>>>>> Still, showmount hangs and times out. >>>>>>>>>>> Any help? >>>>>>>>>>> Thanks, >>>>>>>>>>> >>>>>>>>>> Hmm that's strange. Sometimes, in case if there was no proper cleanup >>>>>>>>>> done while trying to re-create the cluster, we have seen such issues. >>>>>>>>>> >>>>>>>>>> https://bugzilla.redhat.com/show_bug.cgi?id=1227709 >>>>>>>>>> >>>>>>>>>> http://review.gluster.org/#/c/11093/ >>>>>>>>>> >>>>>>>>>> Can you please unexport all the volumes, teardown the cluster using >>>>>>>>>> 'gluster vol set <volname> ganesha.enable off? >>>>>>>>> >>>>>>>>> OK: >>>>>>>>> >>>>>>>>> # gluster vol set atlas-home-01 ganesha.enable off >>>>>>>>> volume set: failed: ganesha.enable is already 'off'. >>>>>>>>> >>>>>>>>> # gluster vol set atlas-data-01 ganesha.enable off >>>>>>>>> volume set: failed: ganesha.enable is already 'off'. >>>>>>>>> >>>>>>>>> >>>>>>>>>> 'gluster ganesha disable' command. >>>>>>>>> >>>>>>>>> I?m assuming you wanted to write nfs-ganesha instead? >>>>>>>>> >>>>>>>>> # gluster nfs-ganesha disable >>>>>>>>> ganesha enable : success >>>>>>>>> >>>>>>>>> >>>>>>>>> A side note (not really important): it?s strange that when I do a >>>>>>>>> disable the message is ?ganesha enable? :-) >>>>>>>>> >>>>>>>>>> >>>>>>>>>> Verify if the following files have been deleted on all the nodes- >>>>>>>>>> '/etc/cluster/cluster.conf? >>>>>>>>> >>>>>>>>> this file is not present at all, I think it?s not needed in CentOS 7 >>>>>>>>> >>>>>>>>>> '/etc/ganesha/ganesha.conf?, >>>>>>>>> >>>>>>>>> it?s still there, but empty, and I guess it should be OK, right? >>>>>>>>> >>>>>>>>>> '/etc/ganesha/exports/*? >>>>>>>>> >>>>>>>>> no more files there >>>>>>>>> >>>>>>>>>> '/var/lib/pacemaker/cib? >>>>>>>>> >>>>>>>>> it?s empty >>>>>>>>> >>>>>>>>>> >>>>>>>>>> Verify if the ganesha service is stopped on all the nodes. >>>>>>>>> >>>>>>>>> nope, it?s still running, I will stop it. >>>>>>>>> >>>>>>>>>> >>>>>>>>>> start/restart the services - corosync, pcs. >>>>>>>>> >>>>>>>>> In the node where I issued the nfs-ganesha disable there is no more >>>>>>>>> any /etc/corosync/corosync.conf so corosync won?t start. The other >>>>>>>>> node instead still has the file, it?s strange. >>>>>>>>> >>>>>>>>>> >>>>>>>>>> And re-try the HA cluster creation >>>>>>>>>> 'gluster ganesha enable? >>>>>>>>> >>>>>>>>> This time (repeated twice) it did not work at all: >>>>>>>>> >>>>>>>>> # pcs status >>>>>>>>> Cluster name: ATLAS_GANESHA_01 >>>>>>>>> Last updated: Tue Jun 9 10:13:43 2015 >>>>>>>>> Last change: Tue Jun 9 10:13:22 2015 >>>>>>>>> Stack: corosync >>>>>>>>> Current DC: atlas-node1 (1) - partition with quorum >>>>>>>>> Version: 1.1.12-a14efad >>>>>>>>> 2 Nodes configured >>>>>>>>> 6 Resources configured >>>>>>>>> >>>>>>>>> >>>>>>>>> Online: [ atlas-node1 atlas-node2 ] >>>>>>>>> >>>>>>>>> Full list of resources: >>>>>>>>> >>>>>>>>> Clone Set: nfs-mon-clone [nfs-mon] >>>>>>>>> Started: [ atlas-node1 atlas-node2 ] >>>>>>>>> Clone Set: nfs-grace-clone [nfs-grace] >>>>>>>>> Started: [ atlas-node1 atlas-node2 ] >>>>>>>>> atlas-node2-dead_ip-1 (ocf::heartbeat:Dummy): Started atlas-node1 >>>>>>>>> atlas-node1-dead_ip-1 (ocf::heartbeat:Dummy): Started atlas-node2 >>>>>>>>> >>>>>>>>> PCSD Status: >>>>>>>>> atlas-node1: Online >>>>>>>>> atlas-node2: Online >>>>>>>>> >>>>>>>>> Daemon Status: >>>>>>>>> corosync: active/enabled >>>>>>>>> pacemaker: active/enabled >>>>>>>>> pcsd: active/enabled >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> I tried then "pcs cluster destroy" on both nodes, and then again >>>>>>>>> nfs-ganesha enable, but now I?m back to the old problem: >>>>>>>>> >>>>>>>>> # pcs status >>>>>>>>> Cluster name: ATLAS_GANESHA_01 >>>>>>>>> Last updated: Tue Jun 9 10:22:27 2015 >>>>>>>>> Last change: Tue Jun 9 10:17:00 2015 >>>>>>>>> Stack: corosync >>>>>>>>> Current DC: atlas-node2 (2) - partition with quorum >>>>>>>>> Version: 1.1.12-a14efad >>>>>>>>> 2 Nodes configured >>>>>>>>> 10 Resources configured >>>>>>>>> >>>>>>>>> >>>>>>>>> Online: [ atlas-node1 atlas-node2 ] >>>>>>>>> >>>>>>>>> Full list of resources: >>>>>>>>> >>>>>>>>> Clone Set: nfs-mon-clone [nfs-mon] >>>>>>>>> Started: [ atlas-node1 atlas-node2 ] >>>>>>>>> Clone Set: nfs-grace-clone [nfs-grace] >>>>>>>>> Started: [ atlas-node1 atlas-node2 ] >>>>>>>>> atlas-node1-cluster_ip-1 (ocf::heartbeat:IPaddr): Stopped >>>>>>>>> atlas-node1-trigger_ip-1 (ocf::heartbeat:Dummy): Started atlas-node1 >>>>>>>>> atlas-node2-cluster_ip-1 (ocf::heartbeat:IPaddr): Stopped >>>>>>>>> atlas-node2-trigger_ip-1 (ocf::heartbeat:Dummy): Started atlas-node2 >>>>>>>>> atlas-node1-dead_ip-1 (ocf::heartbeat:Dummy): Started atlas-node1 >>>>>>>>> atlas-node2-dead_ip-1 (ocf::heartbeat:Dummy): Started atlas-node2 >>>>>>>>> >>>>>>>>> PCSD Status: >>>>>>>>> atlas-node1: Online >>>>>>>>> atlas-node2: Online >>>>>>>>> >>>>>>>>> Daemon Status: >>>>>>>>> corosync: active/enabled >>>>>>>>> pacemaker: active/enabled >>>>>>>>> pcsd: active/enabled >>>>>>>>> >>>>>>>>> >>>>>>>>> Cheers, >>>>>>>>> >>>>>>>>> Alessandro >>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Thanks, >>>>>>>>>> Soumya >>>>>>>>>> >>>>>>>>>>> Alessandro >>>>>>>>>>> >>>>>>>>>>>> Il giorno 08/giu/2015, alle ore 20:00, Alessandro De Salvo >>>>>>>>>>>> <Alessandro.DeSalvo at roma1.infn.it >>>>>>>>>>>> <mailto:Alessandro.DeSalvo at roma1.infn.it>> ha scritto: >>>>>>>>>>>> >>>>>>>>>>>> Hi, >>>>>>>>>>>> indeed, it does not work :-) >>>>>>>>>>>> OK, this is what I did, with 2 machines, running CentOS 7.1, >>>>>>>>>>>> Glusterfs 3.7.1 and nfs-ganesha 2.2.0: >>>>>>>>>>>> >>>>>>>>>>>> 1) ensured that the machines are able to resolve their IPs (but >>>>>>>>>>>> this was already true since they were in the DNS); >>>>>>>>>>>> 2) disabled NetworkManager and enabled network on both machines; >>>>>>>>>>>> 3) created a gluster shared volume 'gluster_shared_storage' and >>>>>>>>>>>> mounted it on '/run/gluster/shared_storage' on all the cluster >>>>>>>>>>>> nodes using glusterfs native mount (on CentOS 7.1 there is a link >>>>>>>>>>>> by default /var/run -> ../run) >>>>>>>>>>>> 4) created an empty /etc/ganesha/ganesha.conf; >>>>>>>>>>>> 5) installed pacemaker pcs resource-agents corosync on all cluster >>>>>>>>>>>> machines; >>>>>>>>>>>> 6) set the ?hacluster? user the same password on all machines; >>>>>>>>>>>> 7) pcs cluster auth <hostname> -u hacluster -p <pass> on all the >>>>>>>>>>>> nodes (on both nodes I issued the commands for both nodes) >>>>>>>>>>>> 8) IPv6 is configured by default on all nodes, although the >>>>>>>>>>>> infrastructure is not ready for IPv6 >>>>>>>>>>>> 9) enabled pcsd and started it on all nodes >>>>>>>>>>>> 10) populated /etc/ganesha/ganesha-ha.conf with the following >>>>>>>>>>>> contents, one per machine: >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> ===> atlas-node1 >>>>>>>>>>>> # Name of the HA cluster created. >>>>>>>>>>>> HA_NAME="ATLAS_GANESHA_01" >>>>>>>>>>>> # The server from which you intend to mount >>>>>>>>>>>> # the shared volume. >>>>>>>>>>>> HA_VOL_SERVER=?atlas-node1" >>>>>>>>>>>> # The subset of nodes of the Gluster Trusted Pool >>>>>>>>>>>> # that forms the ganesha HA cluster. IP/Hostname >>>>>>>>>>>> # is specified. >>>>>>>>>>>> HA_CLUSTER_NODES=?atlas-node1,atlas-node2" >>>>>>>>>>>> # Virtual IPs of each of the nodes specified above. >>>>>>>>>>>> VIP_atlas-node1=?x.x.x.1" >>>>>>>>>>>> VIP_atlas-node2=?x.x.x.2" >>>>>>>>>>>> >>>>>>>>>>>> ===> atlas-node2 >>>>>>>>>>>> # Name of the HA cluster created. >>>>>>>>>>>> HA_NAME="ATLAS_GANESHA_01" >>>>>>>>>>>> # The server from which you intend to mount >>>>>>>>>>>> # the shared volume. >>>>>>>>>>>> HA_VOL_SERVER=?atlas-node2" >>>>>>>>>>>> # The subset of nodes of the Gluster Trusted Pool >>>>>>>>>>>> # that forms the ganesha HA cluster. IP/Hostname >>>>>>>>>>>> # is specified. >>>>>>>>>>>> HA_CLUSTER_NODES=?atlas-node1,atlas-node2" >>>>>>>>>>>> # Virtual IPs of each of the nodes specified above. >>>>>>>>>>>> VIP_atlas-node1=?x.x.x.1" >>>>>>>>>>>> VIP_atlas-node2=?x.x.x.2? >>>>>>>>>>>> >>>>>>>>>>>> 11) issued gluster nfs-ganesha enable, but it fails with a cryptic >>>>>>>>>>>> message: >>>>>>>>>>>> >>>>>>>>>>>> # gluster nfs-ganesha enable >>>>>>>>>>>> Enabling NFS-Ganesha requires Gluster-NFS to be disabled across the >>>>>>>>>>>> trusted pool. Do you still want to continue? (y/n) y >>>>>>>>>>>> nfs-ganesha: failed: Failed to set up HA config for NFS-Ganesha. >>>>>>>>>>>> Please check the log file for details >>>>>>>>>>>> >>>>>>>>>>>> Looking at the logs I found nothing really special but this: >>>>>>>>>>>> >>>>>>>>>>>> ==> /var/log/glusterfs/etc-glusterfs-glusterd.vol.log <=>>>>>>>>>>>> [2015-06-08 17:57:15.672844] I [MSGID: 106132] >>>>>>>>>>>> [glusterd-proc-mgmt.c:83:glusterd_proc_stop] 0-management: nfs >>>>>>>>>>>> already stopped >>>>>>>>>>>> [2015-06-08 17:57:15.675395] I >>>>>>>>>>>> [glusterd-ganesha.c:386:check_host_list] 0-management: ganesha host >>>>>>>>>>>> found Hostname is atlas-node2 >>>>>>>>>>>> [2015-06-08 17:57:15.720692] I >>>>>>>>>>>> [glusterd-ganesha.c:386:check_host_list] 0-management: ganesha host >>>>>>>>>>>> found Hostname is atlas-node2 >>>>>>>>>>>> [2015-06-08 17:57:15.721161] I >>>>>>>>>>>> [glusterd-ganesha.c:335:is_ganesha_host] 0-management: ganesha host >>>>>>>>>>>> found Hostname is atlas-node2 >>>>>>>>>>>> [2015-06-08 17:57:16.633048] E >>>>>>>>>>>> [glusterd-ganesha.c:254:glusterd_op_set_ganesha] 0-management: >>>>>>>>>>>> Initial NFS-Ganesha set up failed >>>>>>>>>>>> [2015-06-08 17:57:16.641563] E >>>>>>>>>>>> [glusterd-syncop.c:1396:gd_commit_op_phase] 0-management: Commit of >>>>>>>>>>>> operation 'Volume (null)' failed on localhost : Failed to set up HA >>>>>>>>>>>> config for NFS-Ganesha. Please check the log file for details >>>>>>>>>>>> >>>>>>>>>>>> ==> /var/log/glusterfs/cmd_history.log <=>>>>>>>>>>>> [2015-06-08 17:57:16.643615] : nfs-ganesha enable : FAILED : >>>>>>>>>>>> Failed to set up HA config for NFS-Ganesha. Please check the log >>>>>>>>>>>> file for details >>>>>>>>>>>> >>>>>>>>>>>> ==> /var/log/glusterfs/cli.log <=>>>>>>>>>>>> [2015-06-08 17:57:16.643839] I [input.c:36:cli_batch] 0-: Exiting >>>>>>>>>>>> with: -1 >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> Also, pcs seems to be fine for the auth part, although it obviously >>>>>>>>>>>> tells me the cluster is not running. >>>>>>>>>>>> >>>>>>>>>>>> I, [2015-06-08T19:57:16.305323 #7223] INFO -- : Running: >>>>>>>>>>>> /usr/sbin/corosync-cmapctl totem.cluster_name >>>>>>>>>>>> I, [2015-06-08T19:57:16.345457 #7223] INFO -- : Running: >>>>>>>>>>>> /usr/sbin/pcs cluster token-nodes >>>>>>>>>>>> ::ffff:141.108.38.46 - - [08/Jun/2015 19:57:16] "GET >>>>>>>>>>>> /remote/check_auth HTTP/1.1" 200 68 0.1919 >>>>>>>>>>>> ::ffff:141.108.38.46 - - [08/Jun/2015 19:57:16] "GET >>>>>>>>>>>> /remote/check_auth HTTP/1.1" 200 68 0.1920 >>>>>>>>>>>> atlas-node1.mydomain - - [08/Jun/2015:19:57:16 CEST] "GET >>>>>>>>>>>> /remote/check_auth HTTP/1.1" 200 68 >>>>>>>>>>>> - -> /remote/check_auth >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> What am I doing wrong? >>>>>>>>>>>> Thanks, >>>>>>>>>>>> >>>>>>>>>>>> Alessandro >>>>>>>>>>>> >>>>>>>>>>>>> Il giorno 08/giu/2015, alle ore 19:30, Soumya Koduri >>>>>>>>>>>>> <skoduri at redhat.com <mailto:skoduri at redhat.com>> ha scritto: >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> On 06/08/2015 08:20 PM, Alessandro De Salvo wrote: >>>>>>>>>>>>>> Sorry, just another question: >>>>>>>>>>>>>> >>>>>>>>>>>>>> - in my installation of gluster 3.7.1 the command gluster >>>>>>>>>>>>>> features.ganesha enable does not work: >>>>>>>>>>>>>> >>>>>>>>>>>>>> # gluster features.ganesha enable >>>>>>>>>>>>>> unrecognized word: features.ganesha (position 0) >>>>>>>>>>>>>> >>>>>>>>>>>>>> Which version has full support for it? >>>>>>>>>>>>> >>>>>>>>>>>>> Sorry. This option has recently been changed. It is now >>>>>>>>>>>>> >>>>>>>>>>>>> $ gluster nfs-ganesha enable >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> - in the documentation the ccs and cman packages are required, >>>>>>>>>>>>>> but they seems not to be available anymore on CentOS 7 and >>>>>>>>>>>>>> similar, I guess they are not really required anymore, as pcs >>>>>>>>>>>>>> should do the full job >>>>>>>>>>>>>> >>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>> >>>>>>>>>>>>>> Alessandro >>>>>>>>>>>>> >>>>>>>>>>>>> Looks like so from http://clusterlabs.org/quickstart-redhat.html. >>>>>>>>>>>>> Let us know if it doesn't work. >>>>>>>>>>>>> >>>>>>>>>>>>> Thanks, >>>>>>>>>>>>> Soumya >>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>>> Il giorno 08/giu/2015, alle ore 15:09, Alessandro De Salvo >>>>>>>>>>>>>>> <alessandro.desalvo at roma1.infn.it >>>>>>>>>>>>>>> <mailto:alessandro.desalvo at roma1.infn.it>> ha scritto: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Great, many thanks Soumya! >>>>>>>>>>>>>>> Cheers, >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Alessandro >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Il giorno 08/giu/2015, alle ore 13:53, Soumya Koduri >>>>>>>>>>>>>>>> <skoduri at redhat.com <mailto:skoduri at redhat.com>> ha scritto: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Hi, >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Please find the slides of the demo video at [1] >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> We recommend to have a distributed replica volume as a shared >>>>>>>>>>>>>>>> volume for better data-availability. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Size of the volume depends on the workload you may have. Since >>>>>>>>>>>>>>>> it is used to maintain states of NLM/NFSv4 clients, you may >>>>>>>>>>>>>>>> calculate the size of the volume to be minimum of aggregate of >>>>>>>>>>>>>>>> (typical_size_of'/var/lib/nfs'_directory + >>>>>>>>>>>>>>>> ~4k*no_of_clients_connected_to_each_of_the_nfs_servers_at_any_point) >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> We shall document about this feature sooner in the gluster docs >>>>>>>>>>>>>>>> as well. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>>>> Soumya >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> [1] - http://www.slideshare.net/SoumyaKoduri/high-49117846 >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> On 06/08/2015 04:34 PM, Alessandro De Salvo wrote: >>>>>>>>>>>>>>>>> Hi, >>>>>>>>>>>>>>>>> I have seen the demo video on ganesha HA, >>>>>>>>>>>>>>>>> https://www.youtube.com/watch?v=Z4mvTQC-efM >>>>>>>>>>>>>>>>> However there is no advice on the appropriate size of the >>>>>>>>>>>>>>>>> shared volume. How is it really used, and what should be a >>>>>>>>>>>>>>>>> reasonable size for it? >>>>>>>>>>>>>>>>> Also, are the slides from the video available somewhere, as >>>>>>>>>>>>>>>>> well as a documentation on all this? I did not manage to find >>>>>>>>>>>>>>>>> them. >>>>>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Alessandro >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>>>>>> Gluster-users mailing list >>>>>>>>>>>>>>>>> Gluster-users at gluster.org <mailto:Gluster-users at gluster.org> >>>>>>>>>>>>>>>>> http://www.gluster.org/mailman/listinfo/gluster-users >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>> Gluster-users mailing list >>>>>>>>>>>> Gluster-users at gluster.org <mailto:Gluster-users at gluster.org> >>>>>>>>>>>> http://www.gluster.org/mailman/listinfo/gluster-users >>>>>>>>>>> >>>>>>>>> >>>>>>>>> _______________________________________________ >>>>>>>>> Gluster-users mailing list >>>>>>>>> Gluster-users at gluster.org <mailto:Gluster-users at gluster.org> >>>>>>>>> http://www.gluster.org/mailman/listinfo/gluster-users >>>>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> Gluster-users mailing list >>>>>> Gluster-users at gluster.org >>>>>> http://www.gluster.org/mailman/listinfo/gluster-users >>>>> >>>>> >>> > >