Soumya Koduri
2015-Jun-09 09:46 UTC
[Gluster-users] Questions on ganesha HA and shared storage size
On 06/09/2015 02:48 PM, Alessandro De Salvo wrote:> Hi, > OK, the problem with the VIPs not starting is due to the ganesha_mon > heartbeat script looking for a pid file called > /var/run/ganesha.nfsd.pid, while by default ganesha.nfsd v.2.2.0 is > creating /var/run/ganesha.pid, this needs to be corrected. The file is > in glusterfs-ganesha-3.7.1-1.el7.x86_64, in my case. > For the moment I have created a symlink in this way and it works: > > ln -s /var/run/ganesha.pid /var/run/ganesha.nfsd.pid >Thanks. Please update this as well in the bug.> So far so good, the VIPs are up and pingable, but still there is the > problem of the hanging showmount (i.e. hanging RPC). > Still, I see a lot of errors like this in /var/log/messages: > > Jun 9 11:15:20 atlas-node1 lrmd[31221]: notice: operation_finished: > nfs-mon_monitor_10000:29292:stderr [ Error: Resource does not exist. ] > > While ganesha.log shows the server is not in grace: > > 09/06/2015 11:16:20 : epoch 5576aee4 : atlas-node1 : > ganesha.nfsd-29964[main] main :MAIN :EVENT :ganesha.nfsd Starting: > Ganesha Version /builddir/build/BUILD/nfs-ganesha-2.2.0/src, built at > May 18 2015 14:17:18 on buildhw-09.phx2.fedoraproject.org > <http://buildhw-09.phx2.fedoraproject.org> > 09/06/2015 11:16:20 : epoch 5576aee4 : atlas-node1 : > ganesha.nfsd-29965[main] nfs_set_param_from_conf :NFS STARTUP :EVENT > :Configuration file successfully parsed > 09/06/2015 11:16:20 : epoch 5576aee4 : atlas-node1 : > ganesha.nfsd-29965[main] init_server_pkgs :NFS STARTUP :EVENT > :Initializing ID Mapper. > 09/06/2015 11:16:20 : epoch 5576aee4 : atlas-node1 : > ganesha.nfsd-29965[main] init_server_pkgs :NFS STARTUP :EVENT :ID Mapper > successfully initialized. > 09/06/2015 11:16:20 : epoch 5576aee4 : atlas-node1 : > ganesha.nfsd-29965[main] main :NFS STARTUP :WARN :No export entries > found in configuration file !!! > 09/06/2015 11:16:20 : epoch 5576aee4 : atlas-node1 : > ganesha.nfsd-29965[main] config_errs_to_log :CONFIG :WARN :Config File > ((null):0): Empty configuration file > 09/06/2015 11:16:20 : epoch 5576aee4 : atlas-node1 : > ganesha.nfsd-29965[main] lower_my_caps :NFS STARTUP :EVENT > :CAP_SYS_RESOURCE was successfully removed for proper quota management > in FSAL > 09/06/2015 11:16:20 : epoch 5576aee4 : atlas-node1 : > ganesha.nfsd-29965[main] lower_my_caps :NFS STARTUP :EVENT :currenty set > capabilities are: > cap_chown,cap_dac_override,cap_dac_read_search,cap_fowner,cap_fsetid,cap_kill,cap_setgid,cap_setuid,cap_setpcap,cap_linux_immutable,cap_net_bind_service,cap_net_broadcast,cap_net_admin,cap_net_raw,cap_ipc_lock,cap_ipc_owner,cap_sys_module,cap_sys_rawio,cap_sys_chroot,cap_sys_ptrace,cap_sys_pacct,cap_sys_admin,cap_sys_boot,cap_sys_nice,cap_sys_time,cap_sys_tty_config,cap_mknod,cap_lease,cap_audit_write,cap_audit_control,cap_setfcap+ep > 09/06/2015 11:16:21 : epoch 5576aee4 : atlas-node1 : > ganesha.nfsd-29965[main] nfs_Init_svc :DISP :CRIT :Cannot acquire > credentials for principal nfs > 09/06/2015 11:16:21 : epoch 5576aee4 : atlas-node1 : > ganesha.nfsd-29965[main] nfs_Init_admin_thread :NFS CB :EVENT :Admin > thread initialized > 09/06/2015 11:16:21 : epoch 5576aee4 : atlas-node1 : > ganesha.nfsd-29965[main] nfs4_start_grace :STATE :EVENT :NFS Server Now > IN GRACE, duration 60 > 09/06/2015 11:16:21 : epoch 5576aee4 : atlas-node1 : > ganesha.nfsd-29965[main] nfs_rpc_cb_init_ccache :NFS STARTUP :EVENT > :Callback creds directory (/var/run/ganesha) already exists > 09/06/2015 11:16:21 : epoch 5576aee4 : atlas-node1 : > ganesha.nfsd-29965[main] nfs_rpc_cb_init_ccache :NFS STARTUP :WARN > :gssd_refresh_krb5_machine_credential failed (2:2) > 09/06/2015 11:16:21 : epoch 5576aee4 : atlas-node1 : > ganesha.nfsd-29965[main] nfs_Start_threads :THREAD :EVENT :Starting > delayed executor. > 09/06/2015 11:16:22 : epoch 5576aee4 : atlas-node1 : > ganesha.nfsd-29965[main] nfs_Start_threads :THREAD :EVENT :9P/TCP > dispatcher thread was started successfully > 09/06/2015 11:16:22 : epoch 5576aee4 : atlas-node1 : > ganesha.nfsd-29965[_9p_disp] _9p_dispatcher_thread :9P DISP :EVENT :9P > dispatcher started > 09/06/2015 11:16:22 : epoch 5576aee4 : atlas-node1 : > ganesha.nfsd-29965[main] nfs_Start_threads :THREAD :EVENT > :gsh_dbusthread was started successfully > 09/06/2015 11:16:22 : epoch 5576aee4 : atlas-node1 : > ganesha.nfsd-29965[main] nfs_Start_threads :THREAD :EVENT :admin thread > was started successfully > 09/06/2015 11:16:22 : epoch 5576aee4 : atlas-node1 : > ganesha.nfsd-29965[main] nfs_Start_threads :THREAD :EVENT :reaper thread > was started successfully > 09/06/2015 11:16:22 : epoch 5576aee4 : atlas-node1 : > ganesha.nfsd-29965[reaper] nfs_in_grace :STATE :EVENT :NFS Server Now IN > GRACE > 09/06/2015 11:16:22 : epoch 5576aee4 : atlas-node1 : > ganesha.nfsd-29965[main] nfs_Start_threads :THREAD :EVENT :General > fridge was started successfully > 09/06/2015 11:16:22 : epoch 5576aee4 : atlas-node1 : > ganesha.nfsd-29965[main] nfs_start :NFS STARTUP :EVENT > :------------------------------------------------- > 09/06/2015 11:16:22 : epoch 5576aee4 : atlas-node1 : > ganesha.nfsd-29965[main] nfs_start :NFS STARTUP :EVENT : NFS > SERVER INITIALIZED > 09/06/2015 11:16:22 : epoch 5576aee4 : atlas-node1 : > ganesha.nfsd-29965[main] nfs_start :NFS STARTUP :EVENT > :------------------------------------------------- > 09/06/2015 11:17:22 : epoch 5576aee4 : atlas-node1 : > ganesha.nfsd-29965[reaper] nfs_in_grace :STATE :EVENT :NFS Server Now > NOT IN GRACE > >Please check the status of nfs-ganesha $service nfs-ganesha status Could you try taking a packet trace (during showmount or mount) and check the server responses. Thanks, Soumya> Cheers, > > Alessandro > > >> Il giorno 09/giu/2015, alle ore 10:36, Alessandro De Salvo >> <alessandro.desalvo at roma1.infn.it >> <mailto:alessandro.desalvo at roma1.infn.it>> ha scritto: >> >> Hi Soumya, >> >>> Il giorno 09/giu/2015, alle ore 08:06, Soumya Koduri >>> <skoduri at redhat.com <mailto:skoduri at redhat.com>> ha scritto: >>> >>> >>> >>> On 06/09/2015 01:31 AM, Alessandro De Salvo wrote: >>>> OK, I found at least one of the bugs. >>>> The /usr/libexec/ganesha/ganesha.sh has the following lines: >>>> >>>> if [ -e /etc/os-release ]; then >>>> RHEL6_PCS_CNAME_OPTION="" >>>> fi >>>> >>>> This is OK for RHEL < 7, but does not work for >= 7. I have changed >>>> it to the following, to make it working: >>>> >>>> if [ -e /etc/os-release ]; then >>>> eval $(grep -F "REDHAT_SUPPORT_PRODUCT=" /etc/os-release) >>>> [ "$REDHAT_SUPPORT_PRODUCT" == "Fedora" ] && >>>> RHEL6_PCS_CNAME_OPTION="" >>>> fi >>>> >>> Oh..Thanks for the fix. Could you please file a bug for the same (and >>> probably submit your fix as well). We shall have it corrected. >> >> Just did it,https://bugzilla.redhat.com/show_bug.cgi?id=1229601 >> >>> >>>> Apart from that, the VIP_<node> I was using were wrong, and I should >>>> have converted all the ?-? to underscores, maybe this could be >>>> mentioned in the documentation when you will have it ready. >>>> Now, the cluster starts, but the VIPs apparently not: >>>> >>> Sure. Thanks again for pointing it out. We shall make a note of it. >>> >>>> Online: [ atlas-node1 atlas-node2 ] >>>> >>>> Full list of resources: >>>> >>>> Clone Set: nfs-mon-clone [nfs-mon] >>>> Started: [ atlas-node1 atlas-node2 ] >>>> Clone Set: nfs-grace-clone [nfs-grace] >>>> Started: [ atlas-node1 atlas-node2 ] >>>> atlas-node1-cluster_ip-1 (ocf::heartbeat:IPaddr): Stopped >>>> atlas-node1-trigger_ip-1 (ocf::heartbeat:Dummy): Started atlas-node1 >>>> atlas-node2-cluster_ip-1 (ocf::heartbeat:IPaddr): Stopped >>>> atlas-node2-trigger_ip-1 (ocf::heartbeat:Dummy): Started atlas-node2 >>>> atlas-node1-dead_ip-1 (ocf::heartbeat:Dummy): Started atlas-node1 >>>> atlas-node2-dead_ip-1 (ocf::heartbeat:Dummy): Started atlas-node2 >>>> >>>> PCSD Status: >>>> atlas-node1: Online >>>> atlas-node2: Online >>>> >>>> Daemon Status: >>>> corosync: active/disabled >>>> pacemaker: active/disabled >>>> pcsd: active/enabled >>>> >>>> >>> Here corosync and pacemaker shows 'disabled' state. Can you check the >>> status of their services. They should be running prior to cluster >>> creation. We need to include that step in document as well. >> >> Ah, OK, you?re right, I have added it to my puppet modules (we install >> and configure ganesha via puppet, I?ll put the module on puppetforge >> soon, in case anyone is interested). >> >>> >>>> But the issue that is puzzling me more is the following: >>>> >>>> # showmount -e localhost >>>> rpc mount export: RPC: Timed out >>>> >>>> And when I try to enable the ganesha exports on a volume I get this >>>> error: >>>> >>>> # gluster volume set atlas-home-01 ganesha.enable on >>>> volume set: failed: Failed to create NFS-Ganesha export config file. >>>> >>>> But I see the file created in /etc/ganesha/exports/*.conf >>>> Still, showmount hangs and times out. >>>> Any help? >>>> Thanks, >>>> >>> Hmm that's strange. Sometimes, in case if there was no proper cleanup >>> done while trying to re-create the cluster, we have seen such issues. >>> >>> https://bugzilla.redhat.com/show_bug.cgi?id=1227709 >>> >>> http://review.gluster.org/#/c/11093/ >>> >>> Can you please unexport all the volumes, teardown the cluster using >>> 'gluster vol set <volname> ganesha.enable off? >> >> OK: >> >> # gluster vol set atlas-home-01 ganesha.enable off >> volume set: failed: ganesha.enable is already 'off'. >> >> # gluster vol set atlas-data-01 ganesha.enable off >> volume set: failed: ganesha.enable is already 'off'. >> >> >>> 'gluster ganesha disable' command. >> >> I?m assuming you wanted to write nfs-ganesha instead? >> >> # gluster nfs-ganesha disable >> ganesha enable : success >> >> >> A side note (not really important): it?s strange that when I do a >> disable the message is ?ganesha enable? :-) >> >>> >>> Verify if the following files have been deleted on all the nodes- >>> '/etc/cluster/cluster.conf? >> >> this file is not present at all, I think it?s not needed in CentOS 7 >> >>> '/etc/ganesha/ganesha.conf?, >> >> it?s still there, but empty, and I guess it should be OK, right? >> >>> '/etc/ganesha/exports/*? >> >> no more files there >> >>> '/var/lib/pacemaker/cib? >> >> it?s empty >> >>> >>> Verify if the ganesha service is stopped on all the nodes. >> >> nope, it?s still running, I will stop it. >> >>> >>> start/restart the services - corosync, pcs. >> >> In the node where I issued the nfs-ganesha disable there is no more >> any /etc/corosync/corosync.conf so corosync won?t start. The other >> node instead still has the file, it?s strange. >> >>> >>> And re-try the HA cluster creation >>> 'gluster ganesha enable? >> >> This time (repeated twice) it did not work at all: >> >> # pcs status >> Cluster name: ATLAS_GANESHA_01 >> Last updated: Tue Jun 9 10:13:43 2015 >> Last change: Tue Jun 9 10:13:22 2015 >> Stack: corosync >> Current DC: atlas-node1 (1) - partition with quorum >> Version: 1.1.12-a14efad >> 2 Nodes configured >> 6 Resources configured >> >> >> Online: [ atlas-node1 atlas-node2 ] >> >> Full list of resources: >> >> Clone Set: nfs-mon-clone [nfs-mon] >> Started: [ atlas-node1 atlas-node2 ] >> Clone Set: nfs-grace-clone [nfs-grace] >> Started: [ atlas-node1 atlas-node2 ] >> atlas-node2-dead_ip-1 (ocf::heartbeat:Dummy): Started atlas-node1 >> atlas-node1-dead_ip-1 (ocf::heartbeat:Dummy): Started atlas-node2 >> >> PCSD Status: >> atlas-node1: Online >> atlas-node2: Online >> >> Daemon Status: >> corosync: active/enabled >> pacemaker: active/enabled >> pcsd: active/enabled >> >> >> >> I tried then "pcs cluster destroy" on both nodes, and then again >> nfs-ganesha enable, but now I?m back to the old problem: >> >> # pcs status >> Cluster name: ATLAS_GANESHA_01 >> Last updated: Tue Jun 9 10:22:27 2015 >> Last change: Tue Jun 9 10:17:00 2015 >> Stack: corosync >> Current DC: atlas-node2 (2) - partition with quorum >> Version: 1.1.12-a14efad >> 2 Nodes configured >> 10 Resources configured >> >> >> Online: [ atlas-node1 atlas-node2 ] >> >> Full list of resources: >> >> Clone Set: nfs-mon-clone [nfs-mon] >> Started: [ atlas-node1 atlas-node2 ] >> Clone Set: nfs-grace-clone [nfs-grace] >> Started: [ atlas-node1 atlas-node2 ] >> atlas-node1-cluster_ip-1 (ocf::heartbeat:IPaddr): Stopped >> atlas-node1-trigger_ip-1 (ocf::heartbeat:Dummy): Started atlas-node1 >> atlas-node2-cluster_ip-1 (ocf::heartbeat:IPaddr): Stopped >> atlas-node2-trigger_ip-1 (ocf::heartbeat:Dummy): Started atlas-node2 >> atlas-node1-dead_ip-1 (ocf::heartbeat:Dummy): Started atlas-node1 >> atlas-node2-dead_ip-1 (ocf::heartbeat:Dummy): Started atlas-node2 >> >> PCSD Status: >> atlas-node1: Online >> atlas-node2: Online >> >> Daemon Status: >> corosync: active/enabled >> pacemaker: active/enabled >> pcsd: active/enabled >> >> >> Cheers, >> >> Alessandro >> >>> >>> >>> Thanks, >>> Soumya >>> >>>> Alessandro >>>> >>>>> Il giorno 08/giu/2015, alle ore 20:00, Alessandro De Salvo >>>>> <Alessandro.DeSalvo at roma1.infn.it >>>>> <mailto:Alessandro.DeSalvo at roma1.infn.it>> ha scritto: >>>>> >>>>> Hi, >>>>> indeed, it does not work :-) >>>>> OK, this is what I did, with 2 machines, running CentOS 7.1, >>>>> Glusterfs 3.7.1 and nfs-ganesha 2.2.0: >>>>> >>>>> 1) ensured that the machines are able to resolve their IPs (but >>>>> this was already true since they were in the DNS); >>>>> 2) disabled NetworkManager and enabled network on both machines; >>>>> 3) created a gluster shared volume 'gluster_shared_storage' and >>>>> mounted it on '/run/gluster/shared_storage' on all the cluster >>>>> nodes using glusterfs native mount (on CentOS 7.1 there is a link >>>>> by default /var/run -> ../run) >>>>> 4) created an empty /etc/ganesha/ganesha.conf; >>>>> 5) installed pacemaker pcs resource-agents corosync on all cluster >>>>> machines; >>>>> 6) set the ?hacluster? user the same password on all machines; >>>>> 7) pcs cluster auth <hostname> -u hacluster -p <pass> on all the >>>>> nodes (on both nodes I issued the commands for both nodes) >>>>> 8) IPv6 is configured by default on all nodes, although the >>>>> infrastructure is not ready for IPv6 >>>>> 9) enabled pcsd and started it on all nodes >>>>> 10) populated /etc/ganesha/ganesha-ha.conf with the following >>>>> contents, one per machine: >>>>> >>>>> >>>>> ===> atlas-node1 >>>>> # Name of the HA cluster created. >>>>> HA_NAME="ATLAS_GANESHA_01" >>>>> # The server from which you intend to mount >>>>> # the shared volume. >>>>> HA_VOL_SERVER=?atlas-node1" >>>>> # The subset of nodes of the Gluster Trusted Pool >>>>> # that forms the ganesha HA cluster. IP/Hostname >>>>> # is specified. >>>>> HA_CLUSTER_NODES=?atlas-node1,atlas-node2" >>>>> # Virtual IPs of each of the nodes specified above. >>>>> VIP_atlas-node1=?x.x.x.1" >>>>> VIP_atlas-node2=?x.x.x.2" >>>>> >>>>> ===> atlas-node2 >>>>> # Name of the HA cluster created. >>>>> HA_NAME="ATLAS_GANESHA_01" >>>>> # The server from which you intend to mount >>>>> # the shared volume. >>>>> HA_VOL_SERVER=?atlas-node2" >>>>> # The subset of nodes of the Gluster Trusted Pool >>>>> # that forms the ganesha HA cluster. IP/Hostname >>>>> # is specified. >>>>> HA_CLUSTER_NODES=?atlas-node1,atlas-node2" >>>>> # Virtual IPs of each of the nodes specified above. >>>>> VIP_atlas-node1=?x.x.x.1" >>>>> VIP_atlas-node2=?x.x.x.2? >>>>> >>>>> 11) issued gluster nfs-ganesha enable, but it fails with a cryptic >>>>> message: >>>>> >>>>> # gluster nfs-ganesha enable >>>>> Enabling NFS-Ganesha requires Gluster-NFS to be disabled across the >>>>> trusted pool. Do you still want to continue? (y/n) y >>>>> nfs-ganesha: failed: Failed to set up HA config for NFS-Ganesha. >>>>> Please check the log file for details >>>>> >>>>> Looking at the logs I found nothing really special but this: >>>>> >>>>> ==> /var/log/glusterfs/etc-glusterfs-glusterd.vol.log <=>>>>> [2015-06-08 17:57:15.672844] I [MSGID: 106132] >>>>> [glusterd-proc-mgmt.c:83:glusterd_proc_stop] 0-management: nfs >>>>> already stopped >>>>> [2015-06-08 17:57:15.675395] I >>>>> [glusterd-ganesha.c:386:check_host_list] 0-management: ganesha host >>>>> found Hostname is atlas-node2 >>>>> [2015-06-08 17:57:15.720692] I >>>>> [glusterd-ganesha.c:386:check_host_list] 0-management: ganesha host >>>>> found Hostname is atlas-node2 >>>>> [2015-06-08 17:57:15.721161] I >>>>> [glusterd-ganesha.c:335:is_ganesha_host] 0-management: ganesha host >>>>> found Hostname is atlas-node2 >>>>> [2015-06-08 17:57:16.633048] E >>>>> [glusterd-ganesha.c:254:glusterd_op_set_ganesha] 0-management: >>>>> Initial NFS-Ganesha set up failed >>>>> [2015-06-08 17:57:16.641563] E >>>>> [glusterd-syncop.c:1396:gd_commit_op_phase] 0-management: Commit of >>>>> operation 'Volume (null)' failed on localhost : Failed to set up HA >>>>> config for NFS-Ganesha. Please check the log file for details >>>>> >>>>> ==> /var/log/glusterfs/cmd_history.log <=>>>>> [2015-06-08 17:57:16.643615] : nfs-ganesha enable : FAILED : >>>>> Failed to set up HA config for NFS-Ganesha. Please check the log >>>>> file for details >>>>> >>>>> ==> /var/log/glusterfs/cli.log <=>>>>> [2015-06-08 17:57:16.643839] I [input.c:36:cli_batch] 0-: Exiting >>>>> with: -1 >>>>> >>>>> >>>>> Also, pcs seems to be fine for the auth part, although it obviously >>>>> tells me the cluster is not running. >>>>> >>>>> I, [2015-06-08T19:57:16.305323 #7223] INFO -- : Running: >>>>> /usr/sbin/corosync-cmapctl totem.cluster_name >>>>> I, [2015-06-08T19:57:16.345457 #7223] INFO -- : Running: >>>>> /usr/sbin/pcs cluster token-nodes >>>>> ::ffff:141.108.38.46 - - [08/Jun/2015 19:57:16] "GET >>>>> /remote/check_auth HTTP/1.1" 200 68 0.1919 >>>>> ::ffff:141.108.38.46 - - [08/Jun/2015 19:57:16] "GET >>>>> /remote/check_auth HTTP/1.1" 200 68 0.1920 >>>>> atlas-node1.mydomain - - [08/Jun/2015:19:57:16 CEST] "GET >>>>> /remote/check_auth HTTP/1.1" 200 68 >>>>> - -> /remote/check_auth >>>>> >>>>> >>>>> What am I doing wrong? >>>>> Thanks, >>>>> >>>>> Alessandro >>>>> >>>>>> Il giorno 08/giu/2015, alle ore 19:30, Soumya Koduri >>>>>> <skoduri at redhat.com <mailto:skoduri at redhat.com>> ha scritto: >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> On 06/08/2015 08:20 PM, Alessandro De Salvo wrote: >>>>>>> Sorry, just another question: >>>>>>> >>>>>>> - in my installation of gluster 3.7.1 the command gluster >>>>>>> features.ganesha enable does not work: >>>>>>> >>>>>>> # gluster features.ganesha enable >>>>>>> unrecognized word: features.ganesha (position 0) >>>>>>> >>>>>>> Which version has full support for it? >>>>>> >>>>>> Sorry. This option has recently been changed. It is now >>>>>> >>>>>> $ gluster nfs-ganesha enable >>>>>> >>>>>> >>>>>>> >>>>>>> - in the documentation the ccs and cman packages are required, >>>>>>> but they seems not to be available anymore on CentOS 7 and >>>>>>> similar, I guess they are not really required anymore, as pcs >>>>>>> should do the full job >>>>>>> >>>>>>> Thanks, >>>>>>> >>>>>>> Alessandro >>>>>> >>>>>> Looks like so from http://clusterlabs.org/quickstart-redhat.html. >>>>>> Let us know if it doesn't work. >>>>>> >>>>>> Thanks, >>>>>> Soumya >>>>>> >>>>>>> >>>>>>>> Il giorno 08/giu/2015, alle ore 15:09, Alessandro De Salvo >>>>>>>> <alessandro.desalvo at roma1.infn.it >>>>>>>> <mailto:alessandro.desalvo at roma1.infn.it>> ha scritto: >>>>>>>> >>>>>>>> Great, many thanks Soumya! >>>>>>>> Cheers, >>>>>>>> >>>>>>>> Alessandro >>>>>>>> >>>>>>>>> Il giorno 08/giu/2015, alle ore 13:53, Soumya Koduri >>>>>>>>> <skoduri at redhat.com <mailto:skoduri at redhat.com>> ha scritto: >>>>>>>>> >>>>>>>>> Hi, >>>>>>>>> >>>>>>>>> Please find the slides of the demo video at [1] >>>>>>>>> >>>>>>>>> We recommend to have a distributed replica volume as a shared >>>>>>>>> volume for better data-availability. >>>>>>>>> >>>>>>>>> Size of the volume depends on the workload you may have. Since >>>>>>>>> it is used to maintain states of NLM/NFSv4 clients, you may >>>>>>>>> calculate the size of the volume to be minimum of aggregate of >>>>>>>>> (typical_size_of'/var/lib/nfs'_directory + >>>>>>>>> ~4k*no_of_clients_connected_to_each_of_the_nfs_servers_at_any_point) >>>>>>>>> >>>>>>>>> We shall document about this feature sooner in the gluster docs >>>>>>>>> as well. >>>>>>>>> >>>>>>>>> Thanks, >>>>>>>>> Soumya >>>>>>>>> >>>>>>>>> [1] - http://www.slideshare.net/SoumyaKoduri/high-49117846 >>>>>>>>> >>>>>>>>> On 06/08/2015 04:34 PM, Alessandro De Salvo wrote: >>>>>>>>>> Hi, >>>>>>>>>> I have seen the demo video on ganesha HA, >>>>>>>>>> https://www.youtube.com/watch?v=Z4mvTQC-efM >>>>>>>>>> However there is no advice on the appropriate size of the >>>>>>>>>> shared volume. How is it really used, and what should be a >>>>>>>>>> reasonable size for it? >>>>>>>>>> Also, are the slides from the video available somewhere, as >>>>>>>>>> well as a documentation on all this? I did not manage to find >>>>>>>>>> them. >>>>>>>>>> Thanks, >>>>>>>>>> >>>>>>>>>> Alessandro >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> _______________________________________________ >>>>>>>>>> Gluster-users mailing list >>>>>>>>>> Gluster-users at gluster.org <mailto:Gluster-users at gluster.org> >>>>>>>>>> http://www.gluster.org/mailman/listinfo/gluster-users >>>>>>>>>> >>>>>>>> >>>>>>> >>>>> >>>>> _______________________________________________ >>>>> Gluster-users mailing list >>>>> Gluster-users at gluster.org <mailto:Gluster-users at gluster.org> >>>>> http://www.gluster.org/mailman/listinfo/gluster-users >>>> >> >> _______________________________________________ >> Gluster-users mailing list >> Gluster-users at gluster.org <mailto:Gluster-users at gluster.org> >> http://www.gluster.org/mailman/listinfo/gluster-users >
Alessandro De Salvo
2015-Jun-09 10:27 UTC
[Gluster-users] Questions on ganesha HA and shared storage size
Hi,> Il giorno 09/giu/2015, alle ore 11:46, Soumya Koduri <skoduri at redhat.com> ha scritto: > > > > On 06/09/2015 02:48 PM, Alessandro De Salvo wrote: >> Hi, >> OK, the problem with the VIPs not starting is due to the ganesha_mon >> heartbeat script looking for a pid file called >> /var/run/ganesha.nfsd.pid, while by default ganesha.nfsd v.2.2.0 is >> creating /var/run/ganesha.pid, this needs to be corrected. The file is >> in glusterfs-ganesha-3.7.1-1.el7.x86_64, in my case. >> For the moment I have created a symlink in this way and it works: >> >> ln -s /var/run/ganesha.pid /var/run/ganesha.nfsd.pid >> > Thanks. Please update this as well in the bug.Done :-)> >> So far so good, the VIPs are up and pingable, but still there is the >> problem of the hanging showmount (i.e. hanging RPC). >> Still, I see a lot of errors like this in /var/log/messages: >> >> Jun 9 11:15:20 atlas-node1 lrmd[31221]: notice: operation_finished: >> nfs-mon_monitor_10000:29292:stderr [ Error: Resource does not exist. ] >> >> While ganesha.log shows the server is not in grace: >> >> 09/06/2015 11:16:20 : epoch 5576aee4 : atlas-node1 : >> ganesha.nfsd-29964[main] main :MAIN :EVENT :ganesha.nfsd Starting: >> Ganesha Version /builddir/build/BUILD/nfs-ganesha-2.2.0/src, built at >> May 18 2015 14:17:18 on buildhw-09.phx2.fedoraproject.org >> <http://buildhw-09.phx2.fedoraproject.org> >> 09/06/2015 11:16:20 : epoch 5576aee4 : atlas-node1 : >> ganesha.nfsd-29965[main] nfs_set_param_from_conf :NFS STARTUP :EVENT >> :Configuration file successfully parsed >> 09/06/2015 11:16:20 : epoch 5576aee4 : atlas-node1 : >> ganesha.nfsd-29965[main] init_server_pkgs :NFS STARTUP :EVENT >> :Initializing ID Mapper. >> 09/06/2015 11:16:20 : epoch 5576aee4 : atlas-node1 : >> ganesha.nfsd-29965[main] init_server_pkgs :NFS STARTUP :EVENT :ID Mapper >> successfully initialized. >> 09/06/2015 11:16:20 : epoch 5576aee4 : atlas-node1 : >> ganesha.nfsd-29965[main] main :NFS STARTUP :WARN :No export entries >> found in configuration file !!! >> 09/06/2015 11:16:20 : epoch 5576aee4 : atlas-node1 : >> ganesha.nfsd-29965[main] config_errs_to_log :CONFIG :WARN :Config File >> ((null):0): Empty configuration file >> 09/06/2015 11:16:20 : epoch 5576aee4 : atlas-node1 : >> ganesha.nfsd-29965[main] lower_my_caps :NFS STARTUP :EVENT >> :CAP_SYS_RESOURCE was successfully removed for proper quota management >> in FSAL >> 09/06/2015 11:16:20 : epoch 5576aee4 : atlas-node1 : >> ganesha.nfsd-29965[main] lower_my_caps :NFS STARTUP :EVENT :currenty set >> capabilities are: >> cap_chown,cap_dac_override,cap_dac_read_search,cap_fowner,cap_fsetid,cap_kill,cap_setgid,cap_setuid,cap_setpcap,cap_linux_immutable,cap_net_bind_service,cap_net_broadcast,cap_net_admin,cap_net_raw,cap_ipc_lock,cap_ipc_owner,cap_sys_module,cap_sys_rawio,cap_sys_chroot,cap_sys_ptrace,cap_sys_pacct,cap_sys_admin,cap_sys_boot,cap_sys_nice,cap_sys_time,cap_sys_tty_config,cap_mknod,cap_lease,cap_audit_write,cap_audit_control,cap_setfcap+ep >> 09/06/2015 11:16:21 : epoch 5576aee4 : atlas-node1 : >> ganesha.nfsd-29965[main] nfs_Init_svc :DISP :CRIT :Cannot acquire >> credentials for principal nfs >> 09/06/2015 11:16:21 : epoch 5576aee4 : atlas-node1 : >> ganesha.nfsd-29965[main] nfs_Init_admin_thread :NFS CB :EVENT :Admin >> thread initialized >> 09/06/2015 11:16:21 : epoch 5576aee4 : atlas-node1 : >> ganesha.nfsd-29965[main] nfs4_start_grace :STATE :EVENT :NFS Server Now >> IN GRACE, duration 60 >> 09/06/2015 11:16:21 : epoch 5576aee4 : atlas-node1 : >> ganesha.nfsd-29965[main] nfs_rpc_cb_init_ccache :NFS STARTUP :EVENT >> :Callback creds directory (/var/run/ganesha) already exists >> 09/06/2015 11:16:21 : epoch 5576aee4 : atlas-node1 : >> ganesha.nfsd-29965[main] nfs_rpc_cb_init_ccache :NFS STARTUP :WARN >> :gssd_refresh_krb5_machine_credential failed (2:2) >> 09/06/2015 11:16:21 : epoch 5576aee4 : atlas-node1 : >> ganesha.nfsd-29965[main] nfs_Start_threads :THREAD :EVENT :Starting >> delayed executor. >> 09/06/2015 11:16:22 : epoch 5576aee4 : atlas-node1 : >> ganesha.nfsd-29965[main] nfs_Start_threads :THREAD :EVENT :9P/TCP >> dispatcher thread was started successfully >> 09/06/2015 11:16:22 : epoch 5576aee4 : atlas-node1 : >> ganesha.nfsd-29965[_9p_disp] _9p_dispatcher_thread :9P DISP :EVENT :9P >> dispatcher started >> 09/06/2015 11:16:22 : epoch 5576aee4 : atlas-node1 : >> ganesha.nfsd-29965[main] nfs_Start_threads :THREAD :EVENT >> :gsh_dbusthread was started successfully >> 09/06/2015 11:16:22 : epoch 5576aee4 : atlas-node1 : >> ganesha.nfsd-29965[main] nfs_Start_threads :THREAD :EVENT :admin thread >> was started successfully >> 09/06/2015 11:16:22 : epoch 5576aee4 : atlas-node1 : >> ganesha.nfsd-29965[main] nfs_Start_threads :THREAD :EVENT :reaper thread >> was started successfully >> 09/06/2015 11:16:22 : epoch 5576aee4 : atlas-node1 : >> ganesha.nfsd-29965[reaper] nfs_in_grace :STATE :EVENT :NFS Server Now IN >> GRACE >> 09/06/2015 11:16:22 : epoch 5576aee4 : atlas-node1 : >> ganesha.nfsd-29965[main] nfs_Start_threads :THREAD :EVENT :General >> fridge was started successfully >> 09/06/2015 11:16:22 : epoch 5576aee4 : atlas-node1 : >> ganesha.nfsd-29965[main] nfs_start :NFS STARTUP :EVENT >> :------------------------------------------------- >> 09/06/2015 11:16:22 : epoch 5576aee4 : atlas-node1 : >> ganesha.nfsd-29965[main] nfs_start :NFS STARTUP :EVENT : NFS >> SERVER INITIALIZED >> 09/06/2015 11:16:22 : epoch 5576aee4 : atlas-node1 : >> ganesha.nfsd-29965[main] nfs_start :NFS STARTUP :EVENT >> :------------------------------------------------- >> 09/06/2015 11:17:22 : epoch 5576aee4 : atlas-node1 : >> ganesha.nfsd-29965[reaper] nfs_in_grace :STATE :EVENT :NFS Server Now >> NOT IN GRACE >> >> > Please check the status of nfs-ganesha > $service nfs-ganesha statusIt?s fine: # service nfs-ganesha status Redirecting to /bin/systemctl status nfs-ganesha.service nfs-ganesha.service - NFS-Ganesha file server Loaded: loaded (/usr/lib/systemd/system/nfs-ganesha.service; enabled) Active: active (running) since Tue 2015-06-09 11:54:39 CEST; 32min ago Docs: http://github.com/nfs-ganesha/nfs-ganesha/wiki Process: 28081 ExecStop=/bin/dbus-send --system --dest=org.ganesha.nfsd --type=method_call /org/ganesha/nfsd/admin org.ganesha.nfsd.admin.shutdown (code=exited, status=0/SUCCESS) Process: 28425 ExecStartPost=/bin/bash -c prlimit --pid $MAINPID --nofile=$NOFILE:$NOFILE (code=exited, status=0/SUCCESS) Process: 28423 ExecStart=/usr/bin/ganesha.nfsd $OPTIONS (code=exited, status=0/SUCCESS) Main PID: 28424 (ganesha.nfsd) CGroup: /system.slice/nfs-ganesha.service ??28424 /usr/bin/ganesha.nfsd -L /var/log/ganesha.log -f /etc/ganesha/ganesha.conf -N NIV_EVENT -p /var/run/ganesha.nfsd.pid> > Could you try taking a packet trace (during showmount or mount) and check the server responses.The problem is that the portmapper seems to be working but the nothing happens: 3785 0.652843 x.x.x.2 -> x.x.x.1 Portmap 98 V2 GETPORT Call MOUNT(100005) V:3 TCP 3788 0.653339 x.x.x.1 -> x.x.x.2 Portmap 70 V2 GETPORT Reply (Call In 3785) Port:33645 3789 0.653756 x.x.x.2 -> x.x.x.1 TCP 74 50774 > 33645 [SYN] Seq=0 Win=29200 Len=0 MSS=1460 SACK_PERM=1 TSval=73312128 TSecr=0 WS=128 3790 0.653784 x.x.x.1 -> x.x.x.2 TCP 74 33645 > 50774 [SYN, ACK] Seq=0 Ack=1 Win=14480 Len=0 MSS=1460 SACK_PERM=1 TSval=132248576 TSecr=73312128 WS=128 3791 0.654004 x.x.x.2 -> x.x.x.1 TCP 66 50774 > 33645 [ACK] Seq=1 Ack=1 Win=29312 Len=0 TSval=73312128 TSecr=132248576 3793 0.654174 x.x.x.2 -> x.x.x.1 MOUNT 158 V3 EXPORT Call 3794 0.654184 x.x.x.1 -> x.x.x.2 TCP 66 33645 > 50774 [ACK] Seq=1 Ack=93 Win=14592 Len=0 TSval=132248576 TSecr=73312129 86065 20.674219 x.x.x.2 -> x.x.x.1 TCP 66 50774 > 33645 [FIN, ACK] Seq=93 Ack=1 Win=29312 Len=0 TSval=73332149 TSecr=132248576 86247 20.713745 x.x.x.1 -> x.x.x.2 TCP 66 33645 > 50774 [ACK] Seq=1 Ack=94 Win=14592 Len=0 TSval=132268636 TSecr=73332149 Cheers, Alessandro> > Thanks, > Soumya > >> Cheers, >> >> Alessandro >> >> >>> Il giorno 09/giu/2015, alle ore 10:36, Alessandro De Salvo >>> <alessandro.desalvo at roma1.infn.it >>> <mailto:alessandro.desalvo at roma1.infn.it>> ha scritto: >>> >>> Hi Soumya, >>> >>>> Il giorno 09/giu/2015, alle ore 08:06, Soumya Koduri >>>> <skoduri at redhat.com <mailto:skoduri at redhat.com>> ha scritto: >>>> >>>> >>>> >>>> On 06/09/2015 01:31 AM, Alessandro De Salvo wrote: >>>>> OK, I found at least one of the bugs. >>>>> The /usr/libexec/ganesha/ganesha.sh has the following lines: >>>>> >>>>> if [ -e /etc/os-release ]; then >>>>> RHEL6_PCS_CNAME_OPTION="" >>>>> fi >>>>> >>>>> This is OK for RHEL < 7, but does not work for >= 7. I have changed >>>>> it to the following, to make it working: >>>>> >>>>> if [ -e /etc/os-release ]; then >>>>> eval $(grep -F "REDHAT_SUPPORT_PRODUCT=" /etc/os-release) >>>>> [ "$REDHAT_SUPPORT_PRODUCT" == "Fedora" ] && >>>>> RHEL6_PCS_CNAME_OPTION="" >>>>> fi >>>>> >>>> Oh..Thanks for the fix. Could you please file a bug for the same (and >>>> probably submit your fix as well). We shall have it corrected. >>> >>> Just did it,https://bugzilla.redhat.com/show_bug.cgi?id=1229601 >>> >>>> >>>>> Apart from that, the VIP_<node> I was using were wrong, and I should >>>>> have converted all the ?-? to underscores, maybe this could be >>>>> mentioned in the documentation when you will have it ready. >>>>> Now, the cluster starts, but the VIPs apparently not: >>>>> >>>> Sure. Thanks again for pointing it out. We shall make a note of it. >>>> >>>>> Online: [ atlas-node1 atlas-node2 ] >>>>> >>>>> Full list of resources: >>>>> >>>>> Clone Set: nfs-mon-clone [nfs-mon] >>>>> Started: [ atlas-node1 atlas-node2 ] >>>>> Clone Set: nfs-grace-clone [nfs-grace] >>>>> Started: [ atlas-node1 atlas-node2 ] >>>>> atlas-node1-cluster_ip-1 (ocf::heartbeat:IPaddr): Stopped >>>>> atlas-node1-trigger_ip-1 (ocf::heartbeat:Dummy): Started atlas-node1 >>>>> atlas-node2-cluster_ip-1 (ocf::heartbeat:IPaddr): Stopped >>>>> atlas-node2-trigger_ip-1 (ocf::heartbeat:Dummy): Started atlas-node2 >>>>> atlas-node1-dead_ip-1 (ocf::heartbeat:Dummy): Started atlas-node1 >>>>> atlas-node2-dead_ip-1 (ocf::heartbeat:Dummy): Started atlas-node2 >>>>> >>>>> PCSD Status: >>>>> atlas-node1: Online >>>>> atlas-node2: Online >>>>> >>>>> Daemon Status: >>>>> corosync: active/disabled >>>>> pacemaker: active/disabled >>>>> pcsd: active/enabled >>>>> >>>>> >>>> Here corosync and pacemaker shows 'disabled' state. Can you check the >>>> status of their services. They should be running prior to cluster >>>> creation. We need to include that step in document as well. >>> >>> Ah, OK, you?re right, I have added it to my puppet modules (we install >>> and configure ganesha via puppet, I?ll put the module on puppetforge >>> soon, in case anyone is interested). >>> >>>> >>>>> But the issue that is puzzling me more is the following: >>>>> >>>>> # showmount -e localhost >>>>> rpc mount export: RPC: Timed out >>>>> >>>>> And when I try to enable the ganesha exports on a volume I get this >>>>> error: >>>>> >>>>> # gluster volume set atlas-home-01 ganesha.enable on >>>>> volume set: failed: Failed to create NFS-Ganesha export config file. >>>>> >>>>> But I see the file created in /etc/ganesha/exports/*.conf >>>>> Still, showmount hangs and times out. >>>>> Any help? >>>>> Thanks, >>>>> >>>> Hmm that's strange. Sometimes, in case if there was no proper cleanup >>>> done while trying to re-create the cluster, we have seen such issues. >>>> >>>> https://bugzilla.redhat.com/show_bug.cgi?id=1227709 >>>> >>>> http://review.gluster.org/#/c/11093/ >>>> >>>> Can you please unexport all the volumes, teardown the cluster using >>>> 'gluster vol set <volname> ganesha.enable off? >>> >>> OK: >>> >>> # gluster vol set atlas-home-01 ganesha.enable off >>> volume set: failed: ganesha.enable is already 'off'. >>> >>> # gluster vol set atlas-data-01 ganesha.enable off >>> volume set: failed: ganesha.enable is already 'off'. >>> >>> >>>> 'gluster ganesha disable' command. >>> >>> I?m assuming you wanted to write nfs-ganesha instead? >>> >>> # gluster nfs-ganesha disable >>> ganesha enable : success >>> >>> >>> A side note (not really important): it?s strange that when I do a >>> disable the message is ?ganesha enable? :-) >>> >>>> >>>> Verify if the following files have been deleted on all the nodes- >>>> '/etc/cluster/cluster.conf? >>> >>> this file is not present at all, I think it?s not needed in CentOS 7 >>> >>>> '/etc/ganesha/ganesha.conf?, >>> >>> it?s still there, but empty, and I guess it should be OK, right? >>> >>>> '/etc/ganesha/exports/*? >>> >>> no more files there >>> >>>> '/var/lib/pacemaker/cib? >>> >>> it?s empty >>> >>>> >>>> Verify if the ganesha service is stopped on all the nodes. >>> >>> nope, it?s still running, I will stop it. >>> >>>> >>>> start/restart the services - corosync, pcs. >>> >>> In the node where I issued the nfs-ganesha disable there is no more >>> any /etc/corosync/corosync.conf so corosync won?t start. The other >>> node instead still has the file, it?s strange. >>> >>>> >>>> And re-try the HA cluster creation >>>> 'gluster ganesha enable? >>> >>> This time (repeated twice) it did not work at all: >>> >>> # pcs status >>> Cluster name: ATLAS_GANESHA_01 >>> Last updated: Tue Jun 9 10:13:43 2015 >>> Last change: Tue Jun 9 10:13:22 2015 >>> Stack: corosync >>> Current DC: atlas-node1 (1) - partition with quorum >>> Version: 1.1.12-a14efad >>> 2 Nodes configured >>> 6 Resources configured >>> >>> >>> Online: [ atlas-node1 atlas-node2 ] >>> >>> Full list of resources: >>> >>> Clone Set: nfs-mon-clone [nfs-mon] >>> Started: [ atlas-node1 atlas-node2 ] >>> Clone Set: nfs-grace-clone [nfs-grace] >>> Started: [ atlas-node1 atlas-node2 ] >>> atlas-node2-dead_ip-1 (ocf::heartbeat:Dummy): Started atlas-node1 >>> atlas-node1-dead_ip-1 (ocf::heartbeat:Dummy): Started atlas-node2 >>> >>> PCSD Status: >>> atlas-node1: Online >>> atlas-node2: Online >>> >>> Daemon Status: >>> corosync: active/enabled >>> pacemaker: active/enabled >>> pcsd: active/enabled >>> >>> >>> >>> I tried then "pcs cluster destroy" on both nodes, and then again >>> nfs-ganesha enable, but now I?m back to the old problem: >>> >>> # pcs status >>> Cluster name: ATLAS_GANESHA_01 >>> Last updated: Tue Jun 9 10:22:27 2015 >>> Last change: Tue Jun 9 10:17:00 2015 >>> Stack: corosync >>> Current DC: atlas-node2 (2) - partition with quorum >>> Version: 1.1.12-a14efad >>> 2 Nodes configured >>> 10 Resources configured >>> >>> >>> Online: [ atlas-node1 atlas-node2 ] >>> >>> Full list of resources: >>> >>> Clone Set: nfs-mon-clone [nfs-mon] >>> Started: [ atlas-node1 atlas-node2 ] >>> Clone Set: nfs-grace-clone [nfs-grace] >>> Started: [ atlas-node1 atlas-node2 ] >>> atlas-node1-cluster_ip-1 (ocf::heartbeat:IPaddr): Stopped >>> atlas-node1-trigger_ip-1 (ocf::heartbeat:Dummy): Started atlas-node1 >>> atlas-node2-cluster_ip-1 (ocf::heartbeat:IPaddr): Stopped >>> atlas-node2-trigger_ip-1 (ocf::heartbeat:Dummy): Started atlas-node2 >>> atlas-node1-dead_ip-1 (ocf::heartbeat:Dummy): Started atlas-node1 >>> atlas-node2-dead_ip-1 (ocf::heartbeat:Dummy): Started atlas-node2 >>> >>> PCSD Status: >>> atlas-node1: Online >>> atlas-node2: Online >>> >>> Daemon Status: >>> corosync: active/enabled >>> pacemaker: active/enabled >>> pcsd: active/enabled >>> >>> >>> Cheers, >>> >>> Alessandro >>> >>>> >>>> >>>> Thanks, >>>> Soumya >>>> >>>>> Alessandro >>>>> >>>>>> Il giorno 08/giu/2015, alle ore 20:00, Alessandro De Salvo >>>>>> <Alessandro.DeSalvo at roma1.infn.it >>>>>> <mailto:Alessandro.DeSalvo at roma1.infn.it>> ha scritto: >>>>>> >>>>>> Hi, >>>>>> indeed, it does not work :-) >>>>>> OK, this is what I did, with 2 machines, running CentOS 7.1, >>>>>> Glusterfs 3.7.1 and nfs-ganesha 2.2.0: >>>>>> >>>>>> 1) ensured that the machines are able to resolve their IPs (but >>>>>> this was already true since they were in the DNS); >>>>>> 2) disabled NetworkManager and enabled network on both machines; >>>>>> 3) created a gluster shared volume 'gluster_shared_storage' and >>>>>> mounted it on '/run/gluster/shared_storage' on all the cluster >>>>>> nodes using glusterfs native mount (on CentOS 7.1 there is a link >>>>>> by default /var/run -> ../run) >>>>>> 4) created an empty /etc/ganesha/ganesha.conf; >>>>>> 5) installed pacemaker pcs resource-agents corosync on all cluster >>>>>> machines; >>>>>> 6) set the ?hacluster? user the same password on all machines; >>>>>> 7) pcs cluster auth <hostname> -u hacluster -p <pass> on all the >>>>>> nodes (on both nodes I issued the commands for both nodes) >>>>>> 8) IPv6 is configured by default on all nodes, although the >>>>>> infrastructure is not ready for IPv6 >>>>>> 9) enabled pcsd and started it on all nodes >>>>>> 10) populated /etc/ganesha/ganesha-ha.conf with the following >>>>>> contents, one per machine: >>>>>> >>>>>> >>>>>> ===> atlas-node1 >>>>>> # Name of the HA cluster created. >>>>>> HA_NAME="ATLAS_GANESHA_01" >>>>>> # The server from which you intend to mount >>>>>> # the shared volume. >>>>>> HA_VOL_SERVER=?atlas-node1" >>>>>> # The subset of nodes of the Gluster Trusted Pool >>>>>> # that forms the ganesha HA cluster. IP/Hostname >>>>>> # is specified. >>>>>> HA_CLUSTER_NODES=?atlas-node1,atlas-node2" >>>>>> # Virtual IPs of each of the nodes specified above. >>>>>> VIP_atlas-node1=?x.x.x.1" >>>>>> VIP_atlas-node2=?x.x.x.2" >>>>>> >>>>>> ===> atlas-node2 >>>>>> # Name of the HA cluster created. >>>>>> HA_NAME="ATLAS_GANESHA_01" >>>>>> # The server from which you intend to mount >>>>>> # the shared volume. >>>>>> HA_VOL_SERVER=?atlas-node2" >>>>>> # The subset of nodes of the Gluster Trusted Pool >>>>>> # that forms the ganesha HA cluster. IP/Hostname >>>>>> # is specified. >>>>>> HA_CLUSTER_NODES=?atlas-node1,atlas-node2" >>>>>> # Virtual IPs of each of the nodes specified above. >>>>>> VIP_atlas-node1=?x.x.x.1" >>>>>> VIP_atlas-node2=?x.x.x.2? >>>>>> >>>>>> 11) issued gluster nfs-ganesha enable, but it fails with a cryptic >>>>>> message: >>>>>> >>>>>> # gluster nfs-ganesha enable >>>>>> Enabling NFS-Ganesha requires Gluster-NFS to be disabled across the >>>>>> trusted pool. Do you still want to continue? (y/n) y >>>>>> nfs-ganesha: failed: Failed to set up HA config for NFS-Ganesha. >>>>>> Please check the log file for details >>>>>> >>>>>> Looking at the logs I found nothing really special but this: >>>>>> >>>>>> ==> /var/log/glusterfs/etc-glusterfs-glusterd.vol.log <=>>>>>> [2015-06-08 17:57:15.672844] I [MSGID: 106132] >>>>>> [glusterd-proc-mgmt.c:83:glusterd_proc_stop] 0-management: nfs >>>>>> already stopped >>>>>> [2015-06-08 17:57:15.675395] I >>>>>> [glusterd-ganesha.c:386:check_host_list] 0-management: ganesha host >>>>>> found Hostname is atlas-node2 >>>>>> [2015-06-08 17:57:15.720692] I >>>>>> [glusterd-ganesha.c:386:check_host_list] 0-management: ganesha host >>>>>> found Hostname is atlas-node2 >>>>>> [2015-06-08 17:57:15.721161] I >>>>>> [glusterd-ganesha.c:335:is_ganesha_host] 0-management: ganesha host >>>>>> found Hostname is atlas-node2 >>>>>> [2015-06-08 17:57:16.633048] E >>>>>> [glusterd-ganesha.c:254:glusterd_op_set_ganesha] 0-management: >>>>>> Initial NFS-Ganesha set up failed >>>>>> [2015-06-08 17:57:16.641563] E >>>>>> [glusterd-syncop.c:1396:gd_commit_op_phase] 0-management: Commit of >>>>>> operation 'Volume (null)' failed on localhost : Failed to set up HA >>>>>> config for NFS-Ganesha. Please check the log file for details >>>>>> >>>>>> ==> /var/log/glusterfs/cmd_history.log <=>>>>>> [2015-06-08 17:57:16.643615] : nfs-ganesha enable : FAILED : >>>>>> Failed to set up HA config for NFS-Ganesha. Please check the log >>>>>> file for details >>>>>> >>>>>> ==> /var/log/glusterfs/cli.log <=>>>>>> [2015-06-08 17:57:16.643839] I [input.c:36:cli_batch] 0-: Exiting >>>>>> with: -1 >>>>>> >>>>>> >>>>>> Also, pcs seems to be fine for the auth part, although it obviously >>>>>> tells me the cluster is not running. >>>>>> >>>>>> I, [2015-06-08T19:57:16.305323 #7223] INFO -- : Running: >>>>>> /usr/sbin/corosync-cmapctl totem.cluster_name >>>>>> I, [2015-06-08T19:57:16.345457 #7223] INFO -- : Running: >>>>>> /usr/sbin/pcs cluster token-nodes >>>>>> ::ffff:141.108.38.46 - - [08/Jun/2015 19:57:16] "GET >>>>>> /remote/check_auth HTTP/1.1" 200 68 0.1919 >>>>>> ::ffff:141.108.38.46 - - [08/Jun/2015 19:57:16] "GET >>>>>> /remote/check_auth HTTP/1.1" 200 68 0.1920 >>>>>> atlas-node1.mydomain - - [08/Jun/2015:19:57:16 CEST] "GET >>>>>> /remote/check_auth HTTP/1.1" 200 68 >>>>>> - -> /remote/check_auth >>>>>> >>>>>> >>>>>> What am I doing wrong? >>>>>> Thanks, >>>>>> >>>>>> Alessandro >>>>>> >>>>>>> Il giorno 08/giu/2015, alle ore 19:30, Soumya Koduri >>>>>>> <skoduri at redhat.com <mailto:skoduri at redhat.com>> ha scritto: >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> On 06/08/2015 08:20 PM, Alessandro De Salvo wrote: >>>>>>>> Sorry, just another question: >>>>>>>> >>>>>>>> - in my installation of gluster 3.7.1 the command gluster >>>>>>>> features.ganesha enable does not work: >>>>>>>> >>>>>>>> # gluster features.ganesha enable >>>>>>>> unrecognized word: features.ganesha (position 0) >>>>>>>> >>>>>>>> Which version has full support for it? >>>>>>> >>>>>>> Sorry. This option has recently been changed. It is now >>>>>>> >>>>>>> $ gluster nfs-ganesha enable >>>>>>> >>>>>>> >>>>>>>> >>>>>>>> - in the documentation the ccs and cman packages are required, >>>>>>>> but they seems not to be available anymore on CentOS 7 and >>>>>>>> similar, I guess they are not really required anymore, as pcs >>>>>>>> should do the full job >>>>>>>> >>>>>>>> Thanks, >>>>>>>> >>>>>>>> Alessandro >>>>>>> >>>>>>> Looks like so from http://clusterlabs.org/quickstart-redhat.html. >>>>>>> Let us know if it doesn't work. >>>>>>> >>>>>>> Thanks, >>>>>>> Soumya >>>>>>> >>>>>>>> >>>>>>>>> Il giorno 08/giu/2015, alle ore 15:09, Alessandro De Salvo >>>>>>>>> <alessandro.desalvo at roma1.infn.it >>>>>>>>> <mailto:alessandro.desalvo at roma1.infn.it>> ha scritto: >>>>>>>>> >>>>>>>>> Great, many thanks Soumya! >>>>>>>>> Cheers, >>>>>>>>> >>>>>>>>> Alessandro >>>>>>>>> >>>>>>>>>> Il giorno 08/giu/2015, alle ore 13:53, Soumya Koduri >>>>>>>>>> <skoduri at redhat.com <mailto:skoduri at redhat.com>> ha scritto: >>>>>>>>>> >>>>>>>>>> Hi, >>>>>>>>>> >>>>>>>>>> Please find the slides of the demo video at [1] >>>>>>>>>> >>>>>>>>>> We recommend to have a distributed replica volume as a shared >>>>>>>>>> volume for better data-availability. >>>>>>>>>> >>>>>>>>>> Size of the volume depends on the workload you may have. Since >>>>>>>>>> it is used to maintain states of NLM/NFSv4 clients, you may >>>>>>>>>> calculate the size of the volume to be minimum of aggregate of >>>>>>>>>> (typical_size_of'/var/lib/nfs'_directory + >>>>>>>>>> ~4k*no_of_clients_connected_to_each_of_the_nfs_servers_at_any_point) >>>>>>>>>> >>>>>>>>>> We shall document about this feature sooner in the gluster docs >>>>>>>>>> as well. >>>>>>>>>> >>>>>>>>>> Thanks, >>>>>>>>>> Soumya >>>>>>>>>> >>>>>>>>>> [1] - http://www.slideshare.net/SoumyaKoduri/high-49117846 >>>>>>>>>> >>>>>>>>>> On 06/08/2015 04:34 PM, Alessandro De Salvo wrote: >>>>>>>>>>> Hi, >>>>>>>>>>> I have seen the demo video on ganesha HA, >>>>>>>>>>> https://www.youtube.com/watch?v=Z4mvTQC-efM >>>>>>>>>>> However there is no advice on the appropriate size of the >>>>>>>>>>> shared volume. How is it really used, and what should be a >>>>>>>>>>> reasonable size for it? >>>>>>>>>>> Also, are the slides from the video available somewhere, as >>>>>>>>>>> well as a documentation on all this? I did not manage to find >>>>>>>>>>> them. >>>>>>>>>>> Thanks, >>>>>>>>>>> >>>>>>>>>>> Alessandro >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> _______________________________________________ >>>>>>>>>>> Gluster-users mailing list >>>>>>>>>>> Gluster-users at gluster.org <mailto:Gluster-users at gluster.org> >>>>>>>>>>> http://www.gluster.org/mailman/listinfo/gluster-users >>>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> Gluster-users mailing list >>>>>> Gluster-users at gluster.org <mailto:Gluster-users at gluster.org> >>>>>> http://www.gluster.org/mailman/listinfo/gluster-users >>>>> >>> >>> _______________________________________________ >>> Gluster-users mailing list >>> Gluster-users at gluster.org <mailto:Gluster-users at gluster.org> >>> http://www.gluster.org/mailman/listinfo/gluster-users >>-------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 1770 bytes Desc: not available URL: <http://www.gluster.org/pipermail/gluster-users/attachments/20150609/c49281aa/attachment.p7s>
Alessandro De Salvo
2015-Jun-09 12:25 UTC
[Gluster-users] Questions on ganesha HA and shared storage size
OK, I can confirm that the ganesha.nsfd process is actually not answering to the calls. Here it is what I see: # rpcinfo -p program vers proto port service 100000 4 tcp 111 portmapper 100000 3 tcp 111 portmapper 100000 2 tcp 111 portmapper 100000 4 udp 111 portmapper 100000 3 udp 111 portmapper 100000 2 udp 111 portmapper 100024 1 udp 41594 status 100024 1 tcp 53631 status 100003 3 udp 2049 nfs 100003 3 tcp 2049 nfs 100003 4 udp 2049 nfs 100003 4 tcp 2049 nfs 100005 1 udp 58127 mountd 100005 1 tcp 56301 mountd 100005 3 udp 58127 mountd 100005 3 tcp 56301 mountd 100021 4 udp 46203 nlockmgr 100021 4 tcp 41798 nlockmgr 100011 1 udp 875 rquotad 100011 1 tcp 875 rquotad 100011 2 udp 875 rquotad 100011 2 tcp 875 rquotad # netstat -lpn | grep ganesha tcp6 14 0 :::2049 :::* LISTEN 11937/ganesha.nfsd tcp6 0 0 :::41798 :::* LISTEN 11937/ganesha.nfsd tcp6 0 0 :::875 :::* LISTEN 11937/ganesha.nfsd tcp6 10 0 :::56301 :::* LISTEN 11937/ganesha.nfsd tcp6 0 0 :::564 :::* LISTEN 11937/ganesha.nfsd udp6 0 0 :::2049 :::* 11937/ganesha.nfsd udp6 0 0 :::46203 :::* 11937/ganesha.nfsd udp6 0 0 :::58127 :::* 11937/ganesha.nfsd udp6 0 0 :::875 :::* 11937/ganesha.nfsd I'm attaching the strace of a showmount from a node to the other. This machinery was working with nfs-ganesha 2.1.0, so it must be something introduced with 2.2.0. Cheers, Alessandro On Tue, 2015-06-09 at 15:16 +0530, Soumya Koduri wrote:> > On 06/09/2015 02:48 PM, Alessandro De Salvo wrote: > > Hi, > > OK, the problem with the VIPs not starting is due to the ganesha_mon > > heartbeat script looking for a pid file called > > /var/run/ganesha.nfsd.pid, while by default ganesha.nfsd v.2.2.0 is > > creating /var/run/ganesha.pid, this needs to be corrected. The file is > > in glusterfs-ganesha-3.7.1-1.el7.x86_64, in my case. > > For the moment I have created a symlink in this way and it works: > > > > ln -s /var/run/ganesha.pid /var/run/ganesha.nfsd.pid > > > Thanks. Please update this as well in the bug. > > > So far so good, the VIPs are up and pingable, but still there is the > > problem of the hanging showmount (i.e. hanging RPC). > > Still, I see a lot of errors like this in /var/log/messages: > > > > Jun 9 11:15:20 atlas-node1 lrmd[31221]: notice: operation_finished: > > nfs-mon_monitor_10000:29292:stderr [ Error: Resource does not exist. ] > > > > While ganesha.log shows the server is not in grace: > > > > 09/06/2015 11:16:20 : epoch 5576aee4 : atlas-node1 : > > ganesha.nfsd-29964[main] main :MAIN :EVENT :ganesha.nfsd Starting: > > Ganesha Version /builddir/build/BUILD/nfs-ganesha-2.2.0/src, built at > > May 18 2015 14:17:18 on buildhw-09.phx2.fedoraproject.org > > <http://buildhw-09.phx2.fedoraproject.org> > > 09/06/2015 11:16:20 : epoch 5576aee4 : atlas-node1 : > > ganesha.nfsd-29965[main] nfs_set_param_from_conf :NFS STARTUP :EVENT > > :Configuration file successfully parsed > > 09/06/2015 11:16:20 : epoch 5576aee4 : atlas-node1 : > > ganesha.nfsd-29965[main] init_server_pkgs :NFS STARTUP :EVENT > > :Initializing ID Mapper. > > 09/06/2015 11:16:20 : epoch 5576aee4 : atlas-node1 : > > ganesha.nfsd-29965[main] init_server_pkgs :NFS STARTUP :EVENT :ID Mapper > > successfully initialized. > > 09/06/2015 11:16:20 : epoch 5576aee4 : atlas-node1 : > > ganesha.nfsd-29965[main] main :NFS STARTUP :WARN :No export entries > > found in configuration file !!! > > 09/06/2015 11:16:20 : epoch 5576aee4 : atlas-node1 : > > ganesha.nfsd-29965[main] config_errs_to_log :CONFIG :WARN :Config File > > ((null):0): Empty configuration file > > 09/06/2015 11:16:20 : epoch 5576aee4 : atlas-node1 : > > ganesha.nfsd-29965[main] lower_my_caps :NFS STARTUP :EVENT > > :CAP_SYS_RESOURCE was successfully removed for proper quota management > > in FSAL > > 09/06/2015 11:16:20 : epoch 5576aee4 : atlas-node1 : > > ganesha.nfsd-29965[main] lower_my_caps :NFS STARTUP :EVENT :currenty set > > capabilities are: > > cap_chown,cap_dac_override,cap_dac_read_search,cap_fowner,cap_fsetid,cap_kill,cap_setgid,cap_setuid,cap_setpcap,cap_linux_immutable,cap_net_bind_service,cap_net_broadcast,cap_net_admin,cap_net_raw,cap_ipc_lock,cap_ipc_owner,cap_sys_module,cap_sys_rawio,cap_sys_chroot,cap_sys_ptrace,cap_sys_pacct,cap_sys_admin,cap_sys_boot,cap_sys_nice,cap_sys_time,cap_sys_tty_config,cap_mknod,cap_lease,cap_audit_write,cap_audit_control,cap_setfcap+ep > > 09/06/2015 11:16:21 : epoch 5576aee4 : atlas-node1 : > > ganesha.nfsd-29965[main] nfs_Init_svc :DISP :CRIT :Cannot acquire > > credentials for principal nfs > > 09/06/2015 11:16:21 : epoch 5576aee4 : atlas-node1 : > > ganesha.nfsd-29965[main] nfs_Init_admin_thread :NFS CB :EVENT :Admin > > thread initialized > > 09/06/2015 11:16:21 : epoch 5576aee4 : atlas-node1 : > > ganesha.nfsd-29965[main] nfs4_start_grace :STATE :EVENT :NFS Server Now > > IN GRACE, duration 60 > > 09/06/2015 11:16:21 : epoch 5576aee4 : atlas-node1 : > > ganesha.nfsd-29965[main] nfs_rpc_cb_init_ccache :NFS STARTUP :EVENT > > :Callback creds directory (/var/run/ganesha) already exists > > 09/06/2015 11:16:21 : epoch 5576aee4 : atlas-node1 : > > ganesha.nfsd-29965[main] nfs_rpc_cb_init_ccache :NFS STARTUP :WARN > > :gssd_refresh_krb5_machine_credential failed (2:2) > > 09/06/2015 11:16:21 : epoch 5576aee4 : atlas-node1 : > > ganesha.nfsd-29965[main] nfs_Start_threads :THREAD :EVENT :Starting > > delayed executor. > > 09/06/2015 11:16:22 : epoch 5576aee4 : atlas-node1 : > > ganesha.nfsd-29965[main] nfs_Start_threads :THREAD :EVENT :9P/TCP > > dispatcher thread was started successfully > > 09/06/2015 11:16:22 : epoch 5576aee4 : atlas-node1 : > > ganesha.nfsd-29965[_9p_disp] _9p_dispatcher_thread :9P DISP :EVENT :9P > > dispatcher started > > 09/06/2015 11:16:22 : epoch 5576aee4 : atlas-node1 : > > ganesha.nfsd-29965[main] nfs_Start_threads :THREAD :EVENT > > :gsh_dbusthread was started successfully > > 09/06/2015 11:16:22 : epoch 5576aee4 : atlas-node1 : > > ganesha.nfsd-29965[main] nfs_Start_threads :THREAD :EVENT :admin thread > > was started successfully > > 09/06/2015 11:16:22 : epoch 5576aee4 : atlas-node1 : > > ganesha.nfsd-29965[main] nfs_Start_threads :THREAD :EVENT :reaper thread > > was started successfully > > 09/06/2015 11:16:22 : epoch 5576aee4 : atlas-node1 : > > ganesha.nfsd-29965[reaper] nfs_in_grace :STATE :EVENT :NFS Server Now IN > > GRACE > > 09/06/2015 11:16:22 : epoch 5576aee4 : atlas-node1 : > > ganesha.nfsd-29965[main] nfs_Start_threads :THREAD :EVENT :General > > fridge was started successfully > > 09/06/2015 11:16:22 : epoch 5576aee4 : atlas-node1 : > > ganesha.nfsd-29965[main] nfs_start :NFS STARTUP :EVENT > > :------------------------------------------------- > > 09/06/2015 11:16:22 : epoch 5576aee4 : atlas-node1 : > > ganesha.nfsd-29965[main] nfs_start :NFS STARTUP :EVENT : NFS > > SERVER INITIALIZED > > 09/06/2015 11:16:22 : epoch 5576aee4 : atlas-node1 : > > ganesha.nfsd-29965[main] nfs_start :NFS STARTUP :EVENT > > :------------------------------------------------- > > 09/06/2015 11:17:22 : epoch 5576aee4 : atlas-node1 : > > ganesha.nfsd-29965[reaper] nfs_in_grace :STATE :EVENT :NFS Server Now > > NOT IN GRACE > > > > > Please check the status of nfs-ganesha > $service nfs-ganesha status > > Could you try taking a packet trace (during showmount or mount) and > check the server responses. > > Thanks, > Soumya > > > Cheers, > > > > Alessandro > > > > > >> Il giorno 09/giu/2015, alle ore 10:36, Alessandro De Salvo > >> <alessandro.desalvo at roma1.infn.it > >> <mailto:alessandro.desalvo at roma1.infn.it>> ha scritto: > >> > >> Hi Soumya, > >> > >>> Il giorno 09/giu/2015, alle ore 08:06, Soumya Koduri > >>> <skoduri at redhat.com <mailto:skoduri at redhat.com>> ha scritto: > >>> > >>> > >>> > >>> On 06/09/2015 01:31 AM, Alessandro De Salvo wrote: > >>>> OK, I found at least one of the bugs. > >>>> The /usr/libexec/ganesha/ganesha.sh has the following lines: > >>>> > >>>> if [ -e /etc/os-release ]; then > >>>> RHEL6_PCS_CNAME_OPTION="" > >>>> fi > >>>> > >>>> This is OK for RHEL < 7, but does not work for >= 7. I have changed > >>>> it to the following, to make it working: > >>>> > >>>> if [ -e /etc/os-release ]; then > >>>> eval $(grep -F "REDHAT_SUPPORT_PRODUCT=" /etc/os-release) > >>>> [ "$REDHAT_SUPPORT_PRODUCT" == "Fedora" ] && > >>>> RHEL6_PCS_CNAME_OPTION="" > >>>> fi > >>>> > >>> Oh..Thanks for the fix. Could you please file a bug for the same (and > >>> probably submit your fix as well). We shall have it corrected. > >> > >> Just did it,https://bugzilla.redhat.com/show_bug.cgi?id=1229601 > >> > >>> > >>>> Apart from that, the VIP_<node> I was using were wrong, and I should > >>>> have converted all the ?-? to underscores, maybe this could be > >>>> mentioned in the documentation when you will have it ready. > >>>> Now, the cluster starts, but the VIPs apparently not: > >>>> > >>> Sure. Thanks again for pointing it out. We shall make a note of it. > >>> > >>>> Online: [ atlas-node1 atlas-node2 ] > >>>> > >>>> Full list of resources: > >>>> > >>>> Clone Set: nfs-mon-clone [nfs-mon] > >>>> Started: [ atlas-node1 atlas-node2 ] > >>>> Clone Set: nfs-grace-clone [nfs-grace] > >>>> Started: [ atlas-node1 atlas-node2 ] > >>>> atlas-node1-cluster_ip-1 (ocf::heartbeat:IPaddr): Stopped > >>>> atlas-node1-trigger_ip-1 (ocf::heartbeat:Dummy): Started atlas-node1 > >>>> atlas-node2-cluster_ip-1 (ocf::heartbeat:IPaddr): Stopped > >>>> atlas-node2-trigger_ip-1 (ocf::heartbeat:Dummy): Started atlas-node2 > >>>> atlas-node1-dead_ip-1 (ocf::heartbeat:Dummy): Started atlas-node1 > >>>> atlas-node2-dead_ip-1 (ocf::heartbeat:Dummy): Started atlas-node2 > >>>> > >>>> PCSD Status: > >>>> atlas-node1: Online > >>>> atlas-node2: Online > >>>> > >>>> Daemon Status: > >>>> corosync: active/disabled > >>>> pacemaker: active/disabled > >>>> pcsd: active/enabled > >>>> > >>>> > >>> Here corosync and pacemaker shows 'disabled' state. Can you check the > >>> status of their services. They should be running prior to cluster > >>> creation. We need to include that step in document as well. > >> > >> Ah, OK, you?re right, I have added it to my puppet modules (we install > >> and configure ganesha via puppet, I?ll put the module on puppetforge > >> soon, in case anyone is interested). > >> > >>> > >>>> But the issue that is puzzling me more is the following: > >>>> > >>>> # showmount -e localhost > >>>> rpc mount export: RPC: Timed out > >>>> > >>>> And when I try to enable the ganesha exports on a volume I get this > >>>> error: > >>>> > >>>> # gluster volume set atlas-home-01 ganesha.enable on > >>>> volume set: failed: Failed to create NFS-Ganesha export config file. > >>>> > >>>> But I see the file created in /etc/ganesha/exports/*.conf > >>>> Still, showmount hangs and times out. > >>>> Any help? > >>>> Thanks, > >>>> > >>> Hmm that's strange. Sometimes, in case if there was no proper cleanup > >>> done while trying to re-create the cluster, we have seen such issues. > >>> > >>> https://bugzilla.redhat.com/show_bug.cgi?id=1227709 > >>> > >>> http://review.gluster.org/#/c/11093/ > >>> > >>> Can you please unexport all the volumes, teardown the cluster using > >>> 'gluster vol set <volname> ganesha.enable off? > >> > >> OK: > >> > >> # gluster vol set atlas-home-01 ganesha.enable off > >> volume set: failed: ganesha.enable is already 'off'. > >> > >> # gluster vol set atlas-data-01 ganesha.enable off > >> volume set: failed: ganesha.enable is already 'off'. > >> > >> > >>> 'gluster ganesha disable' command. > >> > >> I?m assuming you wanted to write nfs-ganesha instead? > >> > >> # gluster nfs-ganesha disable > >> ganesha enable : success > >> > >> > >> A side note (not really important): it?s strange that when I do a > >> disable the message is ?ganesha enable? :-) > >> > >>> > >>> Verify if the following files have been deleted on all the nodes- > >>> '/etc/cluster/cluster.conf? > >> > >> this file is not present at all, I think it?s not needed in CentOS 7 > >> > >>> '/etc/ganesha/ganesha.conf?, > >> > >> it?s still there, but empty, and I guess it should be OK, right? > >> > >>> '/etc/ganesha/exports/*? > >> > >> no more files there > >> > >>> '/var/lib/pacemaker/cib? > >> > >> it?s empty > >> > >>> > >>> Verify if the ganesha service is stopped on all the nodes. > >> > >> nope, it?s still running, I will stop it. > >> > >>> > >>> start/restart the services - corosync, pcs. > >> > >> In the node where I issued the nfs-ganesha disable there is no more > >> any /etc/corosync/corosync.conf so corosync won?t start. The other > >> node instead still has the file, it?s strange. > >> > >>> > >>> And re-try the HA cluster creation > >>> 'gluster ganesha enable? > >> > >> This time (repeated twice) it did not work at all: > >> > >> # pcs status > >> Cluster name: ATLAS_GANESHA_01 > >> Last updated: Tue Jun 9 10:13:43 2015 > >> Last change: Tue Jun 9 10:13:22 2015 > >> Stack: corosync > >> Current DC: atlas-node1 (1) - partition with quorum > >> Version: 1.1.12-a14efad > >> 2 Nodes configured > >> 6 Resources configured > >> > >> > >> Online: [ atlas-node1 atlas-node2 ] > >> > >> Full list of resources: > >> > >> Clone Set: nfs-mon-clone [nfs-mon] > >> Started: [ atlas-node1 atlas-node2 ] > >> Clone Set: nfs-grace-clone [nfs-grace] > >> Started: [ atlas-node1 atlas-node2 ] > >> atlas-node2-dead_ip-1 (ocf::heartbeat:Dummy): Started atlas-node1 > >> atlas-node1-dead_ip-1 (ocf::heartbeat:Dummy): Started atlas-node2 > >> > >> PCSD Status: > >> atlas-node1: Online > >> atlas-node2: Online > >> > >> Daemon Status: > >> corosync: active/enabled > >> pacemaker: active/enabled > >> pcsd: active/enabled > >> > >> > >> > >> I tried then "pcs cluster destroy" on both nodes, and then again > >> nfs-ganesha enable, but now I?m back to the old problem: > >> > >> # pcs status > >> Cluster name: ATLAS_GANESHA_01 > >> Last updated: Tue Jun 9 10:22:27 2015 > >> Last change: Tue Jun 9 10:17:00 2015 > >> Stack: corosync > >> Current DC: atlas-node2 (2) - partition with quorum > >> Version: 1.1.12-a14efad > >> 2 Nodes configured > >> 10 Resources configured > >> > >> > >> Online: [ atlas-node1 atlas-node2 ] > >> > >> Full list of resources: > >> > >> Clone Set: nfs-mon-clone [nfs-mon] > >> Started: [ atlas-node1 atlas-node2 ] > >> Clone Set: nfs-grace-clone [nfs-grace] > >> Started: [ atlas-node1 atlas-node2 ] > >> atlas-node1-cluster_ip-1 (ocf::heartbeat:IPaddr): Stopped > >> atlas-node1-trigger_ip-1 (ocf::heartbeat:Dummy): Started atlas-node1 > >> atlas-node2-cluster_ip-1 (ocf::heartbeat:IPaddr): Stopped > >> atlas-node2-trigger_ip-1 (ocf::heartbeat:Dummy): Started atlas-node2 > >> atlas-node1-dead_ip-1 (ocf::heartbeat:Dummy): Started atlas-node1 > >> atlas-node2-dead_ip-1 (ocf::heartbeat:Dummy): Started atlas-node2 > >> > >> PCSD Status: > >> atlas-node1: Online > >> atlas-node2: Online > >> > >> Daemon Status: > >> corosync: active/enabled > >> pacemaker: active/enabled > >> pcsd: active/enabled > >> > >> > >> Cheers, > >> > >> Alessandro > >> > >>> > >>> > >>> Thanks, > >>> Soumya > >>> > >>>> Alessandro > >>>> > >>>>> Il giorno 08/giu/2015, alle ore 20:00, Alessandro De Salvo > >>>>> <Alessandro.DeSalvo at roma1.infn.it > >>>>> <mailto:Alessandro.DeSalvo at roma1.infn.it>> ha scritto: > >>>>> > >>>>> Hi, > >>>>> indeed, it does not work :-) > >>>>> OK, this is what I did, with 2 machines, running CentOS 7.1, > >>>>> Glusterfs 3.7.1 and nfs-ganesha 2.2.0: > >>>>> > >>>>> 1) ensured that the machines are able to resolve their IPs (but > >>>>> this was already true since they were in the DNS); > >>>>> 2) disabled NetworkManager and enabled network on both machines; > >>>>> 3) created a gluster shared volume 'gluster_shared_storage' and > >>>>> mounted it on '/run/gluster/shared_storage' on all the cluster > >>>>> nodes using glusterfs native mount (on CentOS 7.1 there is a link > >>>>> by default /var/run -> ../run) > >>>>> 4) created an empty /etc/ganesha/ganesha.conf; > >>>>> 5) installed pacemaker pcs resource-agents corosync on all cluster > >>>>> machines; > >>>>> 6) set the ?hacluster? user the same password on all machines; > >>>>> 7) pcs cluster auth <hostname> -u hacluster -p <pass> on all the > >>>>> nodes (on both nodes I issued the commands for both nodes) > >>>>> 8) IPv6 is configured by default on all nodes, although the > >>>>> infrastructure is not ready for IPv6 > >>>>> 9) enabled pcsd and started it on all nodes > >>>>> 10) populated /etc/ganesha/ganesha-ha.conf with the following > >>>>> contents, one per machine: > >>>>> > >>>>> > >>>>> ===> atlas-node1 > >>>>> # Name of the HA cluster created. > >>>>> HA_NAME="ATLAS_GANESHA_01" > >>>>> # The server from which you intend to mount > >>>>> # the shared volume. > >>>>> HA_VOL_SERVER=?atlas-node1" > >>>>> # The subset of nodes of the Gluster Trusted Pool > >>>>> # that forms the ganesha HA cluster. IP/Hostname > >>>>> # is specified. > >>>>> HA_CLUSTER_NODES=?atlas-node1,atlas-node2" > >>>>> # Virtual IPs of each of the nodes specified above. > >>>>> VIP_atlas-node1=?x.x.x.1" > >>>>> VIP_atlas-node2=?x.x.x.2" > >>>>> > >>>>> ===> atlas-node2 > >>>>> # Name of the HA cluster created. > >>>>> HA_NAME="ATLAS_GANESHA_01" > >>>>> # The server from which you intend to mount > >>>>> # the shared volume. > >>>>> HA_VOL_SERVER=?atlas-node2" > >>>>> # The subset of nodes of the Gluster Trusted Pool > >>>>> # that forms the ganesha HA cluster. IP/Hostname > >>>>> # is specified. > >>>>> HA_CLUSTER_NODES=?atlas-node1,atlas-node2" > >>>>> # Virtual IPs of each of the nodes specified above. > >>>>> VIP_atlas-node1=?x.x.x.1" > >>>>> VIP_atlas-node2=?x.x.x.2? > >>>>> > >>>>> 11) issued gluster nfs-ganesha enable, but it fails with a cryptic > >>>>> message: > >>>>> > >>>>> # gluster nfs-ganesha enable > >>>>> Enabling NFS-Ganesha requires Gluster-NFS to be disabled across the > >>>>> trusted pool. Do you still want to continue? (y/n) y > >>>>> nfs-ganesha: failed: Failed to set up HA config for NFS-Ganesha. > >>>>> Please check the log file for details > >>>>> > >>>>> Looking at the logs I found nothing really special but this: > >>>>> > >>>>> ==> /var/log/glusterfs/etc-glusterfs-glusterd.vol.log <=> >>>>> [2015-06-08 17:57:15.672844] I [MSGID: 106132] > >>>>> [glusterd-proc-mgmt.c:83:glusterd_proc_stop] 0-management: nfs > >>>>> already stopped > >>>>> [2015-06-08 17:57:15.675395] I > >>>>> [glusterd-ganesha.c:386:check_host_list] 0-management: ganesha host > >>>>> found Hostname is atlas-node2 > >>>>> [2015-06-08 17:57:15.720692] I > >>>>> [glusterd-ganesha.c:386:check_host_list] 0-management: ganesha host > >>>>> found Hostname is atlas-node2 > >>>>> [2015-06-08 17:57:15.721161] I > >>>>> [glusterd-ganesha.c:335:is_ganesha_host] 0-management: ganesha host > >>>>> found Hostname is atlas-node2 > >>>>> [2015-06-08 17:57:16.633048] E > >>>>> [glusterd-ganesha.c:254:glusterd_op_set_ganesha] 0-management: > >>>>> Initial NFS-Ganesha set up failed > >>>>> [2015-06-08 17:57:16.641563] E > >>>>> [glusterd-syncop.c:1396:gd_commit_op_phase] 0-management: Commit of > >>>>> operation 'Volume (null)' failed on localhost : Failed to set up HA > >>>>> config for NFS-Ganesha. Please check the log file for details > >>>>> > >>>>> ==> /var/log/glusterfs/cmd_history.log <=> >>>>> [2015-06-08 17:57:16.643615] : nfs-ganesha enable : FAILED : > >>>>> Failed to set up HA config for NFS-Ganesha. Please check the log > >>>>> file for details > >>>>> > >>>>> ==> /var/log/glusterfs/cli.log <=> >>>>> [2015-06-08 17:57:16.643839] I [input.c:36:cli_batch] 0-: Exiting > >>>>> with: -1 > >>>>> > >>>>> > >>>>> Also, pcs seems to be fine for the auth part, although it obviously > >>>>> tells me the cluster is not running. > >>>>> > >>>>> I, [2015-06-08T19:57:16.305323 #7223] INFO -- : Running: > >>>>> /usr/sbin/corosync-cmapctl totem.cluster_name > >>>>> I, [2015-06-08T19:57:16.345457 #7223] INFO -- : Running: > >>>>> /usr/sbin/pcs cluster token-nodes > >>>>> ::ffff:141.108.38.46 - - [08/Jun/2015 19:57:16] "GET > >>>>> /remote/check_auth HTTP/1.1" 200 68 0.1919 > >>>>> ::ffff:141.108.38.46 - - [08/Jun/2015 19:57:16] "GET > >>>>> /remote/check_auth HTTP/1.1" 200 68 0.1920 > >>>>> atlas-node1.mydomain - - [08/Jun/2015:19:57:16 CEST] "GET > >>>>> /remote/check_auth HTTP/1.1" 200 68 > >>>>> - -> /remote/check_auth > >>>>> > >>>>> > >>>>> What am I doing wrong? > >>>>> Thanks, > >>>>> > >>>>> Alessandro > >>>>> > >>>>>> Il giorno 08/giu/2015, alle ore 19:30, Soumya Koduri > >>>>>> <skoduri at redhat.com <mailto:skoduri at redhat.com>> ha scritto: > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> On 06/08/2015 08:20 PM, Alessandro De Salvo wrote: > >>>>>>> Sorry, just another question: > >>>>>>> > >>>>>>> - in my installation of gluster 3.7.1 the command gluster > >>>>>>> features.ganesha enable does not work: > >>>>>>> > >>>>>>> # gluster features.ganesha enable > >>>>>>> unrecognized word: features.ganesha (position 0) > >>>>>>> > >>>>>>> Which version has full support for it? > >>>>>> > >>>>>> Sorry. This option has recently been changed. It is now > >>>>>> > >>>>>> $ gluster nfs-ganesha enable > >>>>>> > >>>>>> > >>>>>>> > >>>>>>> - in the documentation the ccs and cman packages are required, > >>>>>>> but they seems not to be available anymore on CentOS 7 and > >>>>>>> similar, I guess they are not really required anymore, as pcs > >>>>>>> should do the full job > >>>>>>> > >>>>>>> Thanks, > >>>>>>> > >>>>>>> Alessandro > >>>>>> > >>>>>> Looks like so from http://clusterlabs.org/quickstart-redhat.html. > >>>>>> Let us know if it doesn't work. > >>>>>> > >>>>>> Thanks, > >>>>>> Soumya > >>>>>> > >>>>>>> > >>>>>>>> Il giorno 08/giu/2015, alle ore 15:09, Alessandro De Salvo > >>>>>>>> <alessandro.desalvo at roma1.infn.it > >>>>>>>> <mailto:alessandro.desalvo at roma1.infn.it>> ha scritto: > >>>>>>>> > >>>>>>>> Great, many thanks Soumya! > >>>>>>>> Cheers, > >>>>>>>> > >>>>>>>> Alessandro > >>>>>>>> > >>>>>>>>> Il giorno 08/giu/2015, alle ore 13:53, Soumya Koduri > >>>>>>>>> <skoduri at redhat.com <mailto:skoduri at redhat.com>> ha scritto: > >>>>>>>>> > >>>>>>>>> Hi, > >>>>>>>>> > >>>>>>>>> Please find the slides of the demo video at [1] > >>>>>>>>> > >>>>>>>>> We recommend to have a distributed replica volume as a shared > >>>>>>>>> volume for better data-availability. > >>>>>>>>> > >>>>>>>>> Size of the volume depends on the workload you may have. Since > >>>>>>>>> it is used to maintain states of NLM/NFSv4 clients, you may > >>>>>>>>> calculate the size of the volume to be minimum of aggregate of > >>>>>>>>> (typical_size_of'/var/lib/nfs'_directory + > >>>>>>>>> ~4k*no_of_clients_connected_to_each_of_the_nfs_servers_at_any_point) > >>>>>>>>> > >>>>>>>>> We shall document about this feature sooner in the gluster docs > >>>>>>>>> as well. > >>>>>>>>> > >>>>>>>>> Thanks, > >>>>>>>>> Soumya > >>>>>>>>> > >>>>>>>>> [1] - http://www.slideshare.net/SoumyaKoduri/high-49117846 > >>>>>>>>> > >>>>>>>>> On 06/08/2015 04:34 PM, Alessandro De Salvo wrote: > >>>>>>>>>> Hi, > >>>>>>>>>> I have seen the demo video on ganesha HA, > >>>>>>>>>> https://www.youtube.com/watch?v=Z4mvTQC-efM > >>>>>>>>>> However there is no advice on the appropriate size of the > >>>>>>>>>> shared volume. How is it really used, and what should be a > >>>>>>>>>> reasonable size for it? > >>>>>>>>>> Also, are the slides from the video available somewhere, as > >>>>>>>>>> well as a documentation on all this? I did not manage to find > >>>>>>>>>> them. > >>>>>>>>>> Thanks, > >>>>>>>>>> > >>>>>>>>>> Alessandro > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> _______________________________________________ > >>>>>>>>>> Gluster-users mailing list > >>>>>>>>>> Gluster-users at gluster.org <mailto:Gluster-users at gluster.org> > >>>>>>>>>> http://www.gluster.org/mailman/listinfo/gluster-users > >>>>>>>>>> > >>>>>>>> > >>>>>>> > >>>>> > >>>>> _______________________________________________ > >>>>> Gluster-users mailing list > >>>>> Gluster-users at gluster.org <mailto:Gluster-users at gluster.org> > >>>>> http://www.gluster.org/mailman/listinfo/gluster-users > >>>> > >> > >> _______________________________________________ > >> Gluster-users mailing list > >> Gluster-users at gluster.org <mailto:Gluster-users at gluster.org> > >> http://www.gluster.org/mailman/listinfo/gluster-users > >-------------- next part -------------- A non-text attachment was scrubbed... Name: strace-showmount.log.gz Type: application/gzip Size: 3388 bytes Desc: not available URL: <http://www.gluster.org/pipermail/gluster-users/attachments/20150609/3fea410b/attachment.bin>