Do some tracing in the evms. Why is it taking over 60 secs to do the
io. That's the qs you need to get answered. Also, double the hb
threshold to 120 secs. See if that helps.
On Jan 18, 2010, at 11:42 AM, Angelo McComis <angelo at mccomis.com>
wrote:
> One more follow on,
>
> The combination of kernel.panic=60 and kernel.printk=7 4 1 7 seems to
> have netted the culrptit:
>
> E01-netconsole.log:Jan 18 09:45:10 E01 (10,0):o2hb_write_timeout:137
> ERROR: Heartbeat write timeout to device dm-12 after 60000
> milliseconds
> E01-netconsole.log:Jan 18 09:45:10 E01
> (10,0):o2hb_stop_all_regions:1517 ERROR: stopping heartbeat on all
> active regions.
> E01-netconsole.log:Jan 18 09:45:10 E01 ocfs2 is very sorry to be
> fencing this system by restarting
>
> dm-12 maps to my evms volume...
>
> iostat for dm-12 doesn't indicate that it's overly taxed.
>
> Can we get some ideas from the info provided?
>
> Thanks,
>
> Angelo
>
>
>
>
>
> On Mon, Jan 18, 2010 at 7:57 AM, Angelo McComis <angelo at
mccomis.com>
> wrote:
>> Some updates from the problem we've been having...
>>
>> Thanks to Sunil for suggesting netconsole be turned on. We've
enabled
>> netconsole, such that we've set it up on the ocfs2 cluster members,
>> with them reporting logs to a server on the same subnet that's
>> outside
>> of the cluster. The logs are there, but nothing related to ocfs2
>> after
>> the reboots. grep for o2hb, o2cb, ocfs2, etc. case insensititve,
>> nothing... Googling, I noted a reference to sending
>> sysctl -w kernel.printk="7 4 1 7"
>>
>> but Novell's suggestion (syslog entries on the receiver side, and
>> etc/modprobe.conf.local and etc/sysconfig/kernel on the sending side)
>> were pretty generic.
>>
>> What we've done so far:
>>
>> - Mount options: added nointr, noatime, datavolume (removed
>> "defaults")
>> - Multipath.conf: added it (we were running without a multipath.conf
>> which means use all dm- defaults)
>> - O2CB_HEARTBEAT_THRESHOLD: set it to 76 (was running default of 31)
>> - Turned on netconsole (but it's not telling us anything useful
yet)
>>
>> I know Sunil suggested that we can get to the bottom of the fencing
>> once and for all with the logging, but the above set of changes were
>> "best practice" enough to ahead with those even minus the
specifics
>> we
>> might get from what we'd learn from the logs.
>>
>> Once we pushed the above 4 items to our non-prod cluster, it
>> stabilized immediately. However, in another datacenter, we have the
>> same setup (six node cluster for prod, and a six node nonprod
>> cluster), and it's not having the same problems at all, running all
>> the defaults. Saturday during our maintenance, we pushed these
>> changes to our prod cluster and have seen no issues since.
>>
>> I tend to believe Sunil's assertion that this is storage related,
and
>> our storage environment is getting better all the time, but I'd
>> really
>> like to understand this better before I tag them as the cause.
>>
>> We have backed out the "good" changes from non prod in hopes
we would
>> start catching log entries from ocfs2/o2hb/o2cb/etc. but so far,
>> we've
>> seen a couple of fencing operations, but no log entries that are
>> helpful yet.
>>
>> So, technically we have some stabilization, but still no
>> instrumentation around it.
>>
>> Any ideas what we're missing on netconsole to close the circle? I
>> believe we can get
>>
>> Angelo
>>
>> On Wed, Jan 13, 2010 at 3:46 PM, Sunil Mushran <sunil.mushran at
oracle.com
>> > wrote:
>>> Do you have netconsole output? We have to determine the
>>> reason for the fencing before we can recommend any changes.
>>>
>>> Angelo McComis wrote:
>>>>
>>>> Some more about my setup, which started the discussion...
>>>>
>>>> Version info, mount options, etc. are herein.
>>>>
>>>> If there are recommended changes to this, I'm open to
suggestions
>>>> here. This is mostly an "out of the box"
configuration.
>>>>
>>>> We are not running Oracle DB, just using this for a shared
place
>>>> for
>>>> transaction files between application servers doing parallel
>>>> processing.
>>>>
>>>> So - Do we want the mount "datavolume, noatime" added
to just
>>>> _netdev
>>>> and heartbeat=local? Will that help or hurt? Also, do we want
to
>>>> turn up the number of HEARTBEAT_THRESHOLD?
>>>>
>>>>
>>>>
>>>> BEERGOGGLES1:~# modinfo ocfs2
>>>> filename:
/lib/modules/2.6.16.60-0.42.5-smp/kernel/fs/ocfs2/
>>>> ocfs2.ko
>>>> license: GPL
>>>> author: Oracle
>>>> version: 1.4.1-1-SLES
>>>> description: OCFS2 1.4.1-1-SLES Wed Jul 23 18:33:42 UTC 2008
>>>> (build
>>>> f922955d99ef972235bd0c1fc236c5ddbb368611)
>>>> srcversion: 986DD1EE4F5ABD8A44FF925
>>>> depends: ocfs2_dlm,jbd,ocfs2_nodemanager
>>>> supported: yes
>>>> vermagic: 2.6.16.60-0.42.5-smp SMP gcc-4.1
>>>>
>>>> BEERGOGGLES1:~# modinfo ocfs2_dlm
>>>> filename:
>>>>
/lib/modules/2.6.16.60-0.42.5-smp/kernel/fs/ocfs2/dlm/ocfs2_dlm.ko
>>>> license: GPL
>>>> author: Oracle
>>>> version: 1.4.1-1-SLES
>>>> description: OCFS2 DLM 1.4.1-1-SLES Wed Jul 23 18:33:42 UTC
2008
>>>> (build f922955d99ef972235bd0c1fc236c5ddbb368611)
>>>> srcversion: FDB660B2EB59EF106C6305F
>>>> depends: ocfs2_nodemanager
>>>> supported: yes
>>>> vermagic: 2.6.16.60-0.42.5-smp SMP gcc-4.1
>>>> parm: dlm_purge_interval_ms:int
>>>> parm: dlm_purge_locks_max:int
>>>>
>>>> BEERGOGGLES1:~# modinfo jbd
>>>> filename:
/lib/modules/2.6.16.60-0.42.5-smp/kernel/fs/jbd/
>>>> jbd.ko
>>>> license: GPL
>>>> srcversion: DCCDE02902B83F98EF81090
>>>> depends:
>>>> supported: yes
>>>> vermagic: 2.6.16.60-0.42.5-smp SMP gcc-4.1
>>>>
>>>> BEERGOGGLES1:~# modinfo ocfs2_nodemanager
>>>> filename:
>>>>
>>>> /lib/modules/2.6.16.60-0.42.5-smp/kernel/fs/ocfs2/cluster/
>>>> ocfs2_nodemanager.ko
>>>> license: GPL
>>>> author: Oracle
>>>> license: GPL
>>>> author: Oracle
>>>> version: 1.4.1-1-SLES
>>>> description: OCFS2 Node Manager 1.4.1-1-SLES Wed Jul 23
18:33:42
>>>> UTC 2008 (build f922955d99ef972235bd0c1fc236c5ddbb368611)
>>>> srcversion: B87371708A8B5E1828E14CD
>>>> depends: configfs
>>>> supported: yes
>>>> vermagic: 2.6.16.60-0.42.5-smp SMP gcc-4.1
>>>>
>>>> BEERGOGGLES1:~# /etc/init.d/o2cb status
>>>> Module "configfs": Loaded
>>>> Filesystem "configfs": Mounted
>>>> Module "ocfs2_nodemanager": Loaded
>>>> Module "ocfs2_dlm": Loaded
>>>> Module "ocfs2_dlmfs": Loaded
>>>> Filesystem "ocfs2_dlmfs": Mounted
>>>> Checking O2CB cluster ocfs2: Online
>>>> Heartbeat dead threshold = 31
>>>> Network idle timeout: 30000
>>>> Network keepalive delay: 2000
>>>> Network reconnect delay: 2000
>>>> Checking O2CB heartbeat: Active
>>>>
>>>> BEERGOGGLES1:~# mount | grep ocfs2
>>>> ocfs2_dlmfs on /dlm type ocfs2_dlmfs (rw)
>>>> /dev/evms/prod_app on /opt/VendorApsp/sharedapp type ocfs2
>>>> (rw,_netdev,heartbeat=local)
>>>>
>>>> BEERGOGGLES1:~# cat /etc/sysconfig/o2cb
>>>> #
>>>> # This is a configuration file for automatic startup of the
O2CB
>>>> # driver. It is generated by running /etc/init.d/o2cb
configure.
>>>> # On Debian based systems the preferred method is running
>>>> # 'dpkg-reconfigure ocfs2-tools'.
>>>> #
>>>>
>>>> # O2CB_ENABLED: 'true' means to load the driver on
boot.
>>>> O2CB_ENABLED=true
>>>>
>>>> # O2CB_BOOTCLUSTER: If not empty, the name of a cluster to
start.
>>>> O2CB_BOOTCLUSTER=ocfs2
>>>>
>>>> # O2CB_HEARTBEAT_THRESHOLD: Iterations before a node is
>>>> considered dead.
>>>> O2CB_HEARTBEAT_THRESHOLD>>>>
>>>> # O2CB_IDLE_TIMEOUT_MS: Time in ms before a network connection
is
>>>> considered dead.
>>>> O2CB_IDLE_TIMEOUT_MS>>>>
>>>> # O2CB_KEEPALIVE_DELAY_MS: Max time in ms before a keepalive
>>>> packet is
>>>> sent
>>>> O2CB_KEEPALIVE_DELAY_MS>>>>
>>>> # O2CB_RECONNECT_DELAY_MS: Min time in ms between connection
>>>> attempts
>>>> O2CB_RECONNECT_DELAY_MS>>>>
>>>> # O2CB_HEARTBEAT_MODE: Whether to use the native
"kernel" or the
>>>> "user"
>>>> # driven heartbeat (for example, for integration with heartbeat
>>>> 2.0.x)
>>>> O2CB_HEARTBEAT_MODE="kernel"
>>>>
>>>> _______________________________________________
>>>> Ocfs2-users mailing list
>>>> Ocfs2-users at oss.oracle.com
>>>> http://oss.oracle.com/mailman/listinfo/ocfs2-users
>>>>
>>>
>>>
>>