thr3ads.net - Ocfs2 users - [Ocfs2-users] ESX 3.5 DRS and OCFS2 1.4.1-1 [Dec 2008]

If this information is useful, please help other people find it:
Share via:

David Murphy

2008-Dec-05 03:19 UTC

[Ocfs2-users] ESX 3.5 DRS and OCFS2 1.4.1-1

We are  getting:

 

Dec  4 17:19:41 web2 kernel: [9724159.177875] EXT2-fs warning: mounting
unchecked fs, running e2fsck is recommended

Dec  4 17:19:41 web2 kernel: [9724159.463691] VMware hgfs: HGFS is disabled
in the host

Dec  4 17:19:41 web2 kernel: [9724160.965637] OCFS2 Node Manager 1.3.3

Dec  4 17:19:41 web2 kernel: [9724161.033122] OCFS2 DLM 1.3.3

Dec  4 17:19:41 web2 kernel: [9724161.037686] OCFS2 DLMFS 1.3.3

Dec  4 17:19:41 web2 kernel: [9724161.038842] OCFS2 User DLM kernel
interface loaded

Dec  4 17:19:41 web2 kernel: [9724171.616652] o2net: accepted connection
from node rgapp1 (num 4) at 192.168.102.11:7777

Dec  4 17:19:41 web2 kernel: [9724171.722162] OCFS2 1.3.3

Dec  4 17:19:41 web2 kernel: [9724171.782112] ocfs2_dlm: Nodes in domain
("7D876A4B2EE14D0C8E1181E8DCF4237B"): 2 

Dec  4 17:19:41 web2 kernel: [9724171.782345] ocfs2_dlm: Node 4 joins domain
7D876A4B2EE14D0C8E1181E8DCF4237B

Dec  4 17:19:41 web2 kernel: [9724171.782348] ocfs2_dlm: Nodes in domain
("7D876A4B2EE14D0C8E1181E8DCF4237B"): 2 4 

Dec  4 17:19:41 web2 kernel: [9724171.782758] (4262,0):ocfs2_find_slot:268
slot 2 is already allocated to this node!

Dec  4 17:19:41 web2 kernel: [9724171.841264]
(4262,0):ocfs2_check_volume:1662 File system was not unmounted cleanly,
recovering volume.

Dec  4 17:19:41 web2 kernel: [9724171.841830] kjournald starting.  Commit
interval 5 seconds

Dec  4 17:19:41 web2 kernel: [9724171.880229] ocfs2: Mounting device (8,17)
on (node 2, slot 2) with ordered data mode.

Dec  4 17:19:43 web2 kernel: [9724175.991919] o2net: accepted connection
from node app1 (num 6) at 192.168.102.10:7777

Dec  4 17:19:45 web2 kernel: [9724178.086781] VMware memory control driver
initialized

Dec  4 17:19:46 web2 kernel: [9724178.235647] o2net: accepted connection
from node deploy (num 5) at 192.168.102.12:7777

Dec  4 17:19:50 web2 kernel: [9724182.319762] ocfs2_dlm: Node 6 joins domain
7D876A4B2EE14D0C8E1181E8DCF4237B

Dec  4 17:19:50 web2 kernel: [9724182.319773] ocfs2_dlm: Nodes in domain
("7D876A4B2EE14D0C8E1181E8DCF4237B"): 2 4 6 

Dec  4 17:19:50 web2 kernel: [9724182.598848] ocfs2_dlm: Node 5 joins domain
7D876A4B2EE14D0C8E1181E8DCF4237B

Dec  4 17:19:50 web2 kernel: [9724182.598853] ocfs2_dlm: Nodes in domain
("7D876A4B2EE14D0C8E1181E8DCF4237B"): 2 4 5 6 

Dec  4 17:21:32 web2 syslogd 1.5.0#1ubuntu1: restart.

 

 

 

 

This completely froze the entire cluster, when ESX tried to v-motion 3 of 6
nodes to a new host. 

Is it recommended by Oracle not to enable DRS on virtual machine using the
cluster, or is there a configuration we can use to keep crashes like this
from happening all the time.

 

I have seen several posts suggesting that disabling DRS would be a "way to
workaround" this issue but not really a good practice as you would loose a
lot of your HA abilities.

 

Also is there a way to have OCFS2 drop a node from the cluster if a new node
comes online with its ID?

 

David Murphy

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://oss.oracle.com/pipermail/ocfs2-users/attachments/20081204/dd6cd8bd/attachment.html

Sunil Mushran

2008-Dec-05 18:56 UTC

head link

[Ocfs2-users] ESX 3.5 DRS and OCFS2 1.4.1-1

What ID are you referring to? ip address? Hope not.

BTW, this is not 1.4.1.

David Murphy wrote:>
> We are getting:
>
> Dec 4 17:19:41 web2 kernel: [9724159.177875] EXT2-fs warning: mounting 
> unchecked fs, running e2fsck is recommended
>
> Dec 4 17:19:41 web2 kernel: [9724159.463691] VMware hgfs: HGFS is 
> disabled in the host
>
> Dec 4 17:19:41 web2 kernel: [9724160.965637] OCFS2 Node Manager 1.3.3
>
> Dec 4 17:19:41 web2 kernel: [9724161.033122] OCFS2 DLM 1.3.3
>
> Dec 4 17:19:41 web2 kernel: [9724161.037686] OCFS2 DLMFS 1.3.3
>
> Dec 4 17:19:41 web2 kernel: [9724161.038842] OCFS2 User DLM kernel 
> interface loaded
>
> Dec 4 17:19:41 web2 kernel: [9724171.616652] o2net: accepted 
> connection from node rgapp1 (num 4) at 192.168.102.11:7777
>
> Dec 4 17:19:41 web2 kernel: [9724171.722162] OCFS2 1.3.3
>
> Dec 4 17:19:41 web2 kernel: [9724171.782112] ocfs2_dlm: Nodes in 
> domain ("7D876A4B2EE14D0C8E1181E8DCF4237B"): 2
>
> Dec 4 17:19:41 web2 kernel: [9724171.782345] ocfs2_dlm: Node 4 joins 
> domain 7D876A4B2EE14D0C8E1181E8DCF4237B
>
> Dec 4 17:19:41 web2 kernel: [9724171.782348] ocfs2_dlm: Nodes in 
> domain ("7D876A4B2EE14D0C8E1181E8DCF4237B"): 2 4
>
> Dec 4 17:19:41 web2 kernel: [9724171.782758] 
> (4262,0):ocfs2_find_slot:268 slot 2 is already allocated to this node!
>
> Dec 4 17:19:41 web2 kernel: [9724171.841264] 
> (4262,0):ocfs2_check_volume:1662 File system was not unmounted 
> cleanly, recovering volume.
>
> Dec 4 17:19:41 web2 kernel: [9724171.841830] kjournald starting. 
> Commit interval 5 seconds
>
> Dec 4 17:19:41 web2 kernel: [9724171.880229] ocfs2: Mounting device 
> (8,17) on (node 2, slot 2) with ordered data mode.
>
> Dec 4 17:19:43 web2 kernel: [9724175.991919] o2net: accepted 
> connection from node app1 (num 6) at 192.168.102.10:7777
>
> Dec 4 17:19:45 web2 kernel: [9724178.086781] VMware memory control 
> driver initialized
>
> Dec 4 17:19:46 web2 kernel: [9724178.235647] o2net: accepted 
> connection from node deploy (num 5) at 192.168.102.12:7777
>
> Dec 4 17:19:50 web2 kernel: [9724182.319762] ocfs2_dlm: Node 6 joins 
> domain 7D876A4B2EE14D0C8E1181E8DCF4237B
>
> Dec 4 17:19:50 web2 kernel: [9724182.319773] ocfs2_dlm: Nodes in 
> domain ("7D876A4B2EE14D0C8E1181E8DCF4237B"): 2 4 6
>
> Dec 4 17:19:50 web2 kernel: [9724182.598848] ocfs2_dlm: Node 5 joins 
> domain 7D876A4B2EE14D0C8E1181E8DCF4237B
>
> Dec 4 17:19:50 web2 kernel: [9724182.598853] ocfs2_dlm: Nodes in 
> domain ("7D876A4B2EE14D0C8E1181E8DCF4237B"): 2 4 5 6
>
> Dec 4 17:21:32 web2 syslogd 1.5.0#1ubuntu1: restart.
>
> This completely froze the entire cluster, when ESX tried to v-motion 3 
> of 6 nodes to a new host.
>
> Is it recommended by Oracle not to enable DRS on virtual machine using 
> the cluster, or is there a configuration we can use to keep crashes 
> like this from happening all the time.
>
> I have seen several posts suggesting that disabling DRS would be a 
> ?way to workaround? this issue but not really a good practice as you 
> would loose a lot of your HA abilities.
>
> Also is there a way to have OCFS2 drop a node from the cluster if a 
> new node comes online with its ID?
>
> David Murphy
>
> ------------------------------------------------------------------------
>
> _______________________________________________
> Ocfs2-users mailing list
> Ocfs2-users at oss.oracle.com
> http://oss.oracle.com/mailman/listinfo/ocfs2-users

David Murphy

2008-Dec-05 21:12 UTC

head link

[Ocfs2-users] ESX 3.5 DRS and OCFS2 1.4.1-1

> (4262,0):ocfs2_find_slot:268 slot 2 is already allocated to this node!
Rather than ID maybe I should have  side Node Slot. 


Also:

[root at web2 ~]# dpkg -l | grep ocfs 
ii  ocfs2-tools                            1.4.1-1                     tools
for managing OCFS2 cluster filesystems


David


-----Original Message-----
From: ocfs2-users-bounces at oss.oracle.com
[mailto:ocfs2-users-bounces at oss.oracle.com] On Behalf Of Sunil Mushran
Sent: Friday, December 05, 2008 12:56 PM
To: David Murphy
Cc: ocfs2-users at oss.oracle.com
Subject: Re: [Ocfs2-users] ESX 3.5 DRS and OCFS2 1.4.1-1

What ID are you referring to? ip address? Hope not.

BTW, this is not 1.4.1.

David Murphy wrote:>
> We are getting:
>
> Dec 4 17:19:41 web2 kernel: [9724159.177875] EXT2-fs warning: mounting 
> unchecked fs, running e2fsck is recommended
>
> Dec 4 17:19:41 web2 kernel: [9724159.463691] VMware hgfs: HGFS is 
> disabled in the host
>
> Dec 4 17:19:41 web2 kernel: [9724160.965637] OCFS2 Node Manager 1.3.3
>
> Dec 4 17:19:41 web2 kernel: [9724161.033122] OCFS2 DLM 1.3.3
>
> Dec 4 17:19:41 web2 kernel: [9724161.037686] OCFS2 DLMFS 1.3.3
>
> Dec 4 17:19:41 web2 kernel: [9724161.038842] OCFS2 User DLM kernel 
> interface loaded
>
> Dec 4 17:19:41 web2 kernel: [9724171.616652] o2net: accepted 
> connection from node rgapp1 (num 4) at 192.168.102.11:7777
>
> Dec 4 17:19:41 web2 kernel: [9724171.722162] OCFS2 1.3.3
>
> Dec 4 17:19:41 web2 kernel: [9724171.782112] ocfs2_dlm: Nodes in 
> domain ("7D876A4B2EE14D0C8E1181E8DCF4237B"): 2
>
> Dec 4 17:19:41 web2 kernel: [9724171.782345] ocfs2_dlm: Node 4 joins 
> domain 7D876A4B2EE14D0C8E1181E8DCF4237B
>
> Dec 4 17:19:41 web2 kernel: [9724171.782348] ocfs2_dlm: Nodes in 
> domain ("7D876A4B2EE14D0C8E1181E8DCF4237B"): 2 4
>
> Dec 4 17:19:41 web2 kernel: [9724171.782758] 
> (4262,0):ocfs2_find_slot:268 slot 2 is already allocated to this node!
>
> Dec 4 17:19:41 web2 kernel: [9724171.841264] 
> (4262,0):ocfs2_check_volume:1662 File system was not unmounted 
> cleanly, recovering volume.
>
> Dec 4 17:19:41 web2 kernel: [9724171.841830] kjournald starting. 
> Commit interval 5 seconds
>
> Dec 4 17:19:41 web2 kernel: [9724171.880229] ocfs2: Mounting device 
> (8,17) on (node 2, slot 2) with ordered data mode.
>
> Dec 4 17:19:43 web2 kernel: [9724175.991919] o2net: accepted 
> connection from node app1 (num 6) at 192.168.102.10:7777
>
> Dec 4 17:19:45 web2 kernel: [9724178.086781] VMware memory control 
> driver initialized
>
> Dec 4 17:19:46 web2 kernel: [9724178.235647] o2net: accepted 
> connection from node deploy (num 5) at 192.168.102.12:7777
>
> Dec 4 17:19:50 web2 kernel: [9724182.319762] ocfs2_dlm: Node 6 joins 
> domain 7D876A4B2EE14D0C8E1181E8DCF4237B
>
> Dec 4 17:19:50 web2 kernel: [9724182.319773] ocfs2_dlm: Nodes in 
> domain ("7D876A4B2EE14D0C8E1181E8DCF4237B"): 2 4 6
>
> Dec 4 17:19:50 web2 kernel: [9724182.598848] ocfs2_dlm: Node 5 joins 
> domain 7D876A4B2EE14D0C8E1181E8DCF4237B
>
> Dec 4 17:19:50 web2 kernel: [9724182.598853] ocfs2_dlm: Nodes in 
> domain ("7D876A4B2EE14D0C8E1181E8DCF4237B"): 2 4 5 6
>
> Dec 4 17:21:32 web2 syslogd 1.5.0#1ubuntu1: restart.
>
> This completely froze the entire cluster, when ESX tried to v-motion 3 
> of 6 nodes to a new host.
>
> Is it recommended by Oracle not to enable DRS on virtual machine using 
> the cluster, or is there a configuration we can use to keep crashes 
> like this from happening all the time.
>
> I have seen several posts suggesting that disabling DRS would be a 
> "way to workaround" this issue but not really a good practice as
you
> would loose a lot of your HA abilities.
>
> Also is there a way to have OCFS2 drop a node from the cluster if a 
> new node comes online with its ID?
>
> David Murphy
>
> ------------------------------------------------------------------------
>
> _______________________________________________
> Ocfs2-users mailing list
> Ocfs2-users at oss.oracle.com
> http://oss.oracle.com/mailman/listinfo/ocfs2-users

_______________________________________________
Ocfs2-users mailing list
Ocfs2-users at oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users

Sunil Mushran

2008-Dec-09 01:47 UTC

head link

[Ocfs2-users] NFS Failover

While the nfs protocol is stateless and thus should handle failing-over,
the procedures themselves are synchronous. Meaning, I am not sure
how a nfs client will handle getting a ok for some metadata change
(mkdir, etc) just before a server dies and is recovered by another node.
If the op did not make it to the journal, it would be a null op. But the
nfs client would not know that as the server has failed-over.
This is a qs for nfs.

What is the stack of the nfs clients? As in, what are they waiting on?

Luis Freitas wrote:> Hi list,
>
>    I need to implement a High available NFS server. Since we already have
OCFS2 here for RAC, and already have a virtual IP on the RAC server that
failovers automatically to the other node, it seems a natural choice to use it
too for our NFS needs. We are using OCFS2 1.2. (Upgrade to 1.4 is not on our
current plans)
>
>    We did a preliminary failover test, and the client that mounts the
filesystem (Actually a solaris box) doesnt like the failover. We expect some
errors and minor data loss and can tolerate them as a transient condition, but
the problem is that the mounted filesystem on the client becomes useless until
we umount and remount it again.
>
>    I suspect that NFS uses inode numbers on underlying filesystem to create
"handles" that it passes on to clients, but I am not sure on how this
is done.
>
>    Anyone know if we can achieve a failover without needing to remount he
nfs share on the clients? Any special options are needed mounting the OCFS2
filesystem and also for exporting it as NFS, or on the client?
>
> Best Regards,
> Luis
>

Luis Freitas

2008-Dec-09 15:01 UTC

head link

[Ocfs2-users] NFS Failover

Sunil,

   They are not waiting, the kernel reconnects after a few seconds, but just
dont like the other nfs server, any attempt to access directories or files after
the virtual IP failover to the other nfs server was resulting in errors.
Unfortunatelly I dont have the exact error message here anymore.

   We found a parameter on the nfs server that seems to fix it, fsid. If you set
this to the same number on both servers it forces both of them to use the same
identifiers. Seems that if you dont, you need to guarantee that the mount is
done on the same device on both servers, and we cannot do this since we are
using powerpath.

  I would like to confirm if the inode numbers are consistent accross servers?

  That is:

[oracle at br001sv0440 concurrents]$ ls -il
total 8
131545 drwxr-xr-x  2  100 users 4096 Dec  9 12:12 admin
131543 drwxrwxrwx  2 root dba   4096 Dec  4 08:53 lost+found
[oracle at br001sv0440 concurrents]$

   Directory "admin" (Or other directories/files) is always be inode
number 131545, no mater on what server we are? Seems to be so, but I would like
to confirm.

   About the metadata changes, this share will be used for log files (Actually,
for a Oracle eBusiness Suite concurrent log and output files), so we can
tolerate if a few of the latest files are lost during the failover. The user can
simply run his report again. Also if some processes hang or die during the
failover it can be tolerated, as the internal manager can restart them.
Preferably processes should die instead of hanging.

   But I am concerned about dangling locks on the server. Not sure on how to
handle those. On the NFS-HA docs some files on /var/lib/nfs are copied using
scripts every few seconds, but this does not seem to be a foolprof way.

   I overviewed the docs from NFS-HA sent to the list, they are usefull, but
also very "Linux HA" centric, and require the heartbeat2 package. I
wont install another cluster stack, since I already have CRS here.

   Do anyone has pointers on a similar setup with CRS?

Best Regards,
Luis

--- On Mon, 12/8/08, Sunil Mushran <sunil.mushran at oracle.com> wrote:
> From: Sunil Mushran <sunil.mushran at oracle.com>
> Subject: Re: [Ocfs2-users] NFS Failover
> To: lfreitas34 at yahoo.com
> Cc: ocfs2-users at oss.oracle.com
> Date: Monday, December 8, 2008, 11:47 PM
> While the nfs protocol is stateless and thus should handle
> failing-over,
> the procedures themselves are synchronous. Meaning, I am
> not sure
> how a nfs client will handle getting a ok for some metadata
> change
> (mkdir, etc) just before a server dies and is recovered by
> another node.
> If the op did not make it to the journal, it would be a
> null op. But the
> nfs client would not know that as the server has
> failed-over.
> This is a qs for nfs.
> 
> What is the stack of the nfs clients? As in, what are they
> waiting on?
> 
> Luis Freitas wrote:
> > Hi list,
> >
> >    I need to implement a High available NFS server.
> Since we already have OCFS2 here for RAC, and already have a
> virtual IP on the RAC server that failovers automatically to
> the other node, it seems a natural choice to use it too for
> our NFS needs. We are using OCFS2 1.2. (Upgrade to 1.4 is
> not on our current plans)
> >
> >    We did a preliminary failover test, and the client
> that mounts the filesystem (Actually a solaris box) doesnt
> like the failover. We expect some errors and minor data loss
> and can tolerate them as a transient condition, but the
> problem is that the mounted filesystem on the client becomes
> useless until we umount and remount it again.
> >
> >    I suspect that NFS uses inode numbers on underlying
> filesystem to create "handles" that it passes on
> to clients, but I am not sure on how this is done.
> >
> >    Anyone know if we can achieve a failover without
> needing to remount he nfs share on the clients? Any special
> options are needed mounting the OCFS2 filesystem and also
> for exporting it as NFS, or on the client?
> >
> > Best Regards,
> > Luis
> >

Ocfs2 users - Dec 2008 - ESX 3.5 DRS and OCFS2 1.4.1-1

[Ocfs2-users] ESX 3.5 DRS and OCFS2 1.4.1-1

[Ocfs2-users] ESX 3.5 DRS and OCFS2 1.4.1-1

[Ocfs2-users] ESX 3.5 DRS and OCFS2 1.4.1-1

[Ocfs2-users] NFS Failover

[Ocfs2-users] NFS Failover