thr3ads.net - Ocfs2 users - [Ocfs2-users] input / out error on some nodes [Oct 2015]

If this information is useful, please help other people find it:
Share via:

Eric Ren

2015-Oct-26 02:40 UTC

[Ocfs2-users] input / out error on some nodes

Hi,

On 10/22/15 21:00, gjprabu wrote:> Hi Eric,
>
> Thanks for your reply, Still we are facing same issue. we found this 
> dmesg logs and this is known logs because our self made down node1 and 
> made up this is showing in logs and other then we didn't found error 
> message. Even we do have problem while unmounting. umount process goes 
> to "D" stat and fsck through fsck.ocfs2: I/O error. If required
to run
> any other command pls let me know.
>1. system log over boots
#journalctl --list-boots
If there is just one boot record, please " man journald.conf" to 
configure saving system logs over boots.
so, you can use "journalctl -b xxx" to see any specific boot system
log.

I can't see what steps exactly lead to that error message? Better to 
tidy up your problems from clean state.

2. umount issue may be caused by the bad condition cluster. 
Communication between nodes hung up.

3. please using device instead of mount point.

4. Did you build up CEPH  RBD based on a good conditional ocfs2 cluster? 
It's better test more if cluster is
good before working on it.


Thanks,
Eric
**> *ocfs2 version*
> debugfs.ocfs2 1.8.0
>
> *# cat /etc/sysconfig/o2cb*
> #
> # This is a configuration file for automatic startup of the O2CB
> # driver.  It is generated by running /etc/init.d/o2cb configure.
> # On Debian based systems the preferred method is running
> # 'dpkg-reconfigure ocfs2-tools'.
> #
>
> # O2CB_STACK: The name of the cluster stack backing O2CB.
> O2CB_STACK=o2cb
>
> # O2CB_BOOTCLUSTER: If not empty, the name of a cluster to start.
> O2CB_BOOTCLUSTER=ocfs2
>
> # O2CB_HEARTBEAT_THRESHOLD: Iterations before a node is considered dead.
> O2CB_HEARTBEAT_THRESHOLD=31
>
> # O2CB_IDLE_TIMEOUT_MS: Time in ms before a network connection is 
> considered dead.
> O2CB_IDLE_TIMEOUT_MS=30000
>
> # O2CB_KEEPALIVE_DELAY_MS: Max time in ms before a keepalive packet is 
> sent
> O2CB_KEEPALIVE_DELAY_MS=2000
>
> # O2CB_RECONNECT_DELAY_MS: Min time in ms between connection attempts
> O2CB_RECONNECT_DELAY_MS=2000
>
> *# fsck.ocfs2 -fy /home/build/downloads/*
> fsck.ocfs2 1.8.0
> fsck.ocfs2: I/O error on channel while opening
"/zoho/build/downloads/"
>
> _*dmesg logs*_
>
> [ 4229.886284] o2dlm: Joining domain A895BC216BE641A8A7E20AA89D57E051 
> ( 5 ) 1 nodes
> [ 4251.437451] o2dlm: Node 3 joins domain 
> A895BC216BE641A8A7E20AA89D57E051 ( 3 5 ) 2 nodes
> [ 4267.836392] o2dlm: Node 1 joins domain 
> A895BC216BE641A8A7E20AA89D57E051 ( 1 3 5 ) 3 nodes
> [ 4292.755589] o2dlm: Node 2 joins domain 
> A895BC216BE641A8A7E20AA89D57E051 ( 1 2 3 5 ) 4 nodes
> [ 4306.262165] o2dlm: Node 4 joins domain 
> A895BC216BE641A8A7E20AA89D57E051 ( 1 2 3 4 5 ) 5 nodes
> [316476.505401] (kworker/u192:0,95923,0):dlm_do_assert_master:1717 
> ERROR: Error -112 when sending message 502 (key 0xc3460ae7) to node 1
> [316476.505470] o2cb: o2dlm has evicted node 1 from domain 
> A895BC216BE641A8A7E20AA89D57E051
> [316480.437231] o2dlm: Begin recovery on domain 
> A895BC216BE641A8A7E20AA89D57E051 for node 1
> [316480.442389] o2cb: o2dlm has evicted node 1 from domain 
> A895BC216BE641A8A7E20AA89D57E051
> [316480.442412] (kworker/u192:0,95923,20):dlm_begin_reco_handler:2765 
> A895BC216BE641A8A7E20AA89D57E051: dead_node previously set to 1, node 
> 3 changing it to 1
> [316480.541237] o2dlm: Node 3 (he) is the Recovery Master for the dead 
> node 1 in domain A895BC216BE641A8A7E20AA89D57E051
> [316480.541241] o2dlm: End recovery on domain 
> A895BC216BE641A8A7E20AA89D57E051
> [316485.542733] o2dlm: Begin recovery on domain 
> A895BC216BE641A8A7E20AA89D57E051 for node 1
> [316485.542740] o2dlm: Node 3 (he) is the Recovery Master for the dead 
> node 1 in domain A895BC216BE641A8A7E20AA89D57E051
> [316485.542742] o2dlm: End recovery on domain 
> A895BC216BE641A8A7E20AA89D57E051
> [316490.544535] o2dlm: Begin recovery on domain 
> A895BC216BE641A8A7E20AA89D57E051 for node 1
> [316490.544538] o2dlm: Node 3 (he) is the Recovery Master for the dead 
> node 1 in domain A895BC216BE641A8A7E20AA89D57E051
> [316490.544539] o2dlm: End recovery on domain 
> A895BC216BE641A8A7E20AA89D57E051
> [316495.546356] o2dlm: Begin recovery on domain 
> A895BC216BE641A8A7E20AA89D57E051 for node 1
> [316495.546362] o2dlm: Node 3 (he) is the Recovery Master for the dead 
> node 1 in domain A895BC216BE641A8A7E20AA89D57E051
> [316495.546364] o2dlm: End recovery on domain 
> A895BC216BE641A8A7E20AA89D57E051
> [316500.548135] o2dlm: Begin recovery on domain 
> A895BC216BE641A8A7E20AA89D57E051 for node 1
> [316500.548139] o2dlm: Node 3 (he) is the Recovery Master for the dead 
> node 1 in domain A895BC216BE641A8A7E20AA89D57E051
> [316500.548140] o2dlm: End recovery on domain 
> A895BC216BE641A8A7E20AA89D57E051
> [316505.549947] o2dlm: Begin recovery on domain 
> A895BC216BE641A8A7E20AA89D57E051 for node 1
> [316505.549951] o2dlm: Node 3 (he) is the Recovery Master for the dead 
> node 1 in domain A895BC216BE641A8A7E20AA89D57E051
> [316505.549952] o2dlm: End recovery on domain 
> A895BC216BE641A8A7E20AA89D57E051
> [316510.551734] o2dlm: Begin recovery on domain 
> A895BC216BE641A8A7E20AA89D57E051 for node 1
> [316510.551739] o2dlm: Node 3 (he) is the Recovery Master for the dead 
> node 1 in domain A895BC216BE641A8A7E20AA89D57E051
> [316510.551740] o2dlm: End recovery on domain 
> A895BC216BE641A8A7E20AA89D57E051
> [316515.553543] o2dlm: Begin recovery on domain 
> A895BC216BE641A8A7E20AA89D57E051 for node 1
> [316515.553547] o2dlm: Node 3 (he) is the Recovery Master for the dead 
> node 1 in domain A895BC216BE641A8A7E20AA89D57E051
> [316515.553548] o2dlm: End recovery on domain 
> A895BC216BE641A8A7E20AA89D57E051
> [316520.555337] o2dlm: Begin recovery on domain 
> A895BC216BE641A8A7E20AA89D57E051 for node 1
> [316520.555341] o2dlm: Node 3 (he) is the Recovery Master for the dead 
> node 1 in domain A895BC216BE641A8A7E20AA89D57E051
> [316520.555343] o2dlm: End recovery on domain 
> A895BC216BE641A8A7E20AA89D57E051
> [316525.557131] o2dlm: Begin recovery on domain 
> A895BC216BE641A8A7E20AA89D57E051 for node 1
> [316525.557136] o2dlm: Node 3 (he) is the Recovery Master for the dead 
> node 1 in domain A895BC216BE641A8A7E20AA89D57E051
> [316525.557153] o2dlm: End recovery on domain 
> A895BC216BE641A8A7E20AA89D57E051
> [316530.558952] o2dlm: Begin recovery on domain 
> A895BC216BE641A8A7E20AA89D57E051 for node 1
> [316530.558955] o2dlm: Node 3 (he) is the Recovery Master for the dead 
> node 1 in domain A895BC216BE641A8A7E20AA89D57E051
> [316530.558957] o2dlm: End recovery on domain 
> A895BC216BE641A8A7E20AA89D57E051
> [316535.560781] o2dlm: Begin recovery on domain 
> A895BC216BE641A8A7E20AA89D57E051 for node 1
> [316535.560789] o2dlm: Node 3 (he) is the Recovery Master for the dead 
> node 1 in domain A895BC216BE641A8A7E20AA89D57E051
> [316535.560792] o2dlm: End recovery on domain 
> A895BC216BE641A8A7E20AA89D57E051
> [319419.525609] o2dlm: Node 1 joins domain 
> A895BC216BE641A8A7E20AA89D57E051 ( 1 2 3 4 5 ) 5 nodes
>
>
>
> *ps -auxxxxx | grep umount*
> root     32083 21.8  0.0 125620  2828 pts/14   D+   19:37 0:18 umount 
> /home/build/repository
> root     32196  0.0  0.0 112652  2264 pts/8    S+   19:38 0:00 grep 
> --color=auto umount
>
>
> *cat /proc/32083/stack*
> [<ffffffff8132ad7d>] o2net_send_message_vec+0x71d/0xb00
> [<ffffffff81352148>]
dlm_send_remote_unlock_request.isra.2+0x128/0x410
> [<ffffffff813527db>] dlmunlock_common+0x3ab/0x9e0
> [<ffffffff81353088>] dlmunlock+0x278/0x800
> [<ffffffff8131f765>] o2cb_dlm_unlock+0x35/0x50
> [<ffffffff8131ecfe>] ocfs2_dlm_unlock+0x1e/0x30
> [<ffffffff812a8776>] ocfs2_drop_lock.isra.29.part.30+0x1f6/0x700
> [<ffffffff812ae40d>] ocfs2_simple_drop_lockres+0x2d/0x40
> [<ffffffff8129b43c>] ocfs2_dentry_lock_put+0x5c/0x80
> [<ffffffff8129b4a2>] ocfs2_dentry_iput+0x42/0x1d0
> [<ffffffff81204dc2>] __dentry_kill+0x102/0x1f0
> [<ffffffff81205294>] shrink_dentry_list+0xe4/0x2a0
> [<ffffffff81205aa8>] shrink_dcache_parent+0x38/0x90
> [<ffffffff81205b16>] do_one_tree+0x16/0x50
> [<ffffffff81206e9f>] shrink_dcache_for_umount+0x2f/0x90
> [<ffffffff811efb15>] generic_shutdown_super+0x25/0x100
> [<ffffffff811eff57>] kill_block_super+0x27/0x70
> [<ffffffff811f02a9>] deactivate_locked_super+0x49/0x60
> [<ffffffff811f089e>] deactivate_super+0x4e/0x70
> [<ffffffff8120da83>] cleanup_mnt+0x43/0x90
> [<ffffffff8120db22>] __cleanup_mnt+0x12/0x20
> [<ffffffff81093ba4>] task_work_run+0xc4/0xe0
> [<ffffffff81013c67>] do_notify_resume+0x97/0xb0
> [<ffffffff817d2ee7>] int_signal+0x12/0x17
> [<ffffffffffffffff>] 0xffffffffffffffff
>
> Regards
> Prabu
>
>
>
>
> ---- On Wed, 21 Oct 2015 08:32:15 +0530 *Eric Ren <zren at suse.com>*
> wrote ----
>
>     Hi Prabu,
>
>     I guess others like me are not familiar with this case that
>     combine CEPH RBD and OCFS2.
>
>     We'd really like to help you. But I think ocfs2 developers cannot
>     get any info about what happened
>     to ocfs2 from your descriptions.
>
>     So, I'm wondering if you can reproduce and tell us the steps. Once
>     developers can reproduce it,
>     it's likely be resolved;-) BTW, any dmesg log about ocfs2
>     especially the initial error message and stack
>     back trace will be helpful!
>
>     Thanks,
>     Eric
>
>     On 10/20/15 17:29, gjprabu wrote:
>
>         Hi
>
>                 We are looking forward to your input on this.
>
>         Regads
>         Prabu
>
>         --- On Fri, 09 Oct 2015 12:08:19 +0530 *gjprabu
>         <gjprabu at zohocorp.com> <mailto:gjprabu at
zohocorp.com>* wrote ----
>
>
>
>
>
>                 Hi All,
>
>                          Anybody pls help me on this issue.
>
>                 Regards
>                 Prabu
>
>
>
>
>                 ---- On Thu, 08 Oct 2015 12:33:57 +0530 *gjprabu
>                 <gjprabu at zohocorp.com <mailto:gjprabu at
zohocorp.com>>*
>                 wrote ----
>
>
>
>                     Hi All,
>
>                            We have CEPH  RBD with OCFS2 mounted
>                     servers. we are facing i/o errors simultaneously
>                     while move the data's in the same disk (Copying is
>                     not having any problem). Temporary we remount the
>                     partition and the issue get resolved but after
>                     sometime problem again reproduced. If anybody
>                     faced same issue. Please help us.
>
>                     Note : We have total 5 Nodes, here two nodes
>                     working fine other nodes are showing like below
>                     input/output error.
>
>                     ls -althr
>                     ls: cannot access LITE_3_0_M4_1_TEST: Input/output
>                     error
>                     ls: cannot access LITE_3_0_M4_1_OLD: Input/output
>                     error
>                     total 0
>                     d????????? ? ? ? ? ? LITE_3_0_M4_1_TEST
>                     d????????? ? ? ? ? ? LITE_3_0_M4_1_OLD
>
>                     cluster:
>                            node_count=5
>                            heartbeat_mode = local
>                            name=ocfs2
>
>                     node:
>                             ip_port = 7777
>                             ip_address = 192.168.113.42
>                             number = 1
>                             name = integ-hm9
>                             cluster = ocfs2
>
>                     node:
>                             ip_port = 7777
>                             ip_address = 192.168.112.115
>                             number = 2
>                             name = integ-hm2
>                             cluster = ocfs2
>
>                     node:
>                             ip_port = 7777
>                             ip_address = 192.168.113.43
>                             number = 3
>                             name = integ-ci-1
>                             cluster = ocfs2
>                     node:
>                             ip_port = 7777
>                             ip_address = 192.168.112.217
>                             number = 4
>                             name = integ-hm8
>                             cluster = ocfs2
>                     node:
>                             ip_port = 7777
>                             ip_address = 192.168.112.192
>                             number = 5
>                             name = integ-hm5
>                             cluster = ocfs2
>
>
>                     Regards
>                     Prabu
>
>
>
>                     _______________________________________________
>                     Ocfs2-users mailing list
>                     Ocfs2-users at oss.oracle.com
>                     <mailto:Ocfs2-users at oss.oracle.com>
>                     https://oss.oracle.com/mailman/listinfo/ocfs2-users
>
>
>
>
>         _______________________________________________
>         Ocfs2-users mailing list
>         Ocfs2-users at oss.oracle.com <mailto:Ocfs2-users at
oss.oracle.com>  https://oss.oracle.com/mailman/listinfo/ocfs2-users
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://oss.oracle.com/pipermail/ocfs2-users/attachments/20151026/bdf0fcec/attachment-0001.html

gjprabu

2015-Oct-26 08:28 UTC

head link

[Ocfs2-users] input / out error on some nodes

Hi Eric,
We identified the issue. When we do simultaneous access on the same directory
its having i/o error. But normally cluster filesytem will handle this, in our
cause its not working. ocfs2 version ocfs2-tools-1.8.0-16.



Exmaple

Node1 : cd /home/downloads/test

Node2 : mv /home/downloads/test  /home/downloads/test1



Node1

ls -al /home/downloads/

d?????????   ? ?     ?         ?            ?   test1



Node2

ls -al /home/downloads/

drwxr-xr-x    2 root  root  3.9K Oct 26 12:06 test1





Regards

Prabu







 ---- On Mon, 26 Oct 2015 08:10:06 +0530 Eric Ren &lt;zren at
suse.com&gt; wrote ----




Hi,

 

On 10/22/15 21:00, gjprabu wrote:

Hi Eric,



Thanks for your reply, Still we are facing same issue. we found this dmesg logs
and this is known logs because our self made down node1 and made up this is
showing in logs and other then we didn't found error message. Even we do
have problem while unmounting. umount process goes to "D" stat and 
fsck through fsck.ocfs2: I/O error. If required to run any other command pls let
me know.




1. system log over boots

 #journalctl --list-boots

 If there is just one boot record, please " man journald.conf" to
configure saving system logs over boots.

 so, you can use "journalctl -b xxx" to see any specific boot system
log.

 

 I can't see what steps exactly lead to that error message? Better to tidy
up your problems from clean state.

 

 2. umount issue may be caused by the bad condition cluster. Communication
between nodes hung up.

 

 3. please using device instead of mount point.

 

 4. Did you build up CEPH  RBD based on a good conditional ocfs2 cluster?
It's better test more if cluster is

 good before working on it.

 

 

 Thanks,

 Eric 

 

ocfs2 version

debugfs.ocfs2 1.8.0



# cat /etc/sysconfig/o2cb

#

# This is a configuration file for automatic startup of the O2CB

# driver.  It is generated by running /etc/init.d/o2cb configure.

# On Debian based systems the preferred method is running

# 'dpkg-reconfigure ocfs2-tools'.

#



# O2CB_STACK: The name of the cluster stack backing O2CB.

O2CB_STACK=o2cb



# O2CB_BOOTCLUSTER: If not empty, the name of a cluster to start.

O2CB_BOOTCLUSTER=ocfs2



# O2CB_HEARTBEAT_THRESHOLD: Iterations before a node is considered dead.

O2CB_HEARTBEAT_THRESHOLD=31



# O2CB_IDLE_TIMEOUT_MS: Time in ms before a network connection is considered
dead.

O2CB_IDLE_TIMEOUT_MS=30000



# O2CB_KEEPALIVE_DELAY_MS: Max time in ms before a keepalive packet is sent

O2CB_KEEPALIVE_DELAY_MS=2000



# O2CB_RECONNECT_DELAY_MS: Min time in ms between connection attempts

O2CB_RECONNECT_DELAY_MS=2000



# fsck.ocfs2 -fy /home/build/downloads/

fsck.ocfs2 1.8.0

fsck.ocfs2: I/O error on channel while opening
"/zoho/build/downloads/"



dmesg logs



[ 4229.886284] o2dlm: Joining domain A895BC216BE641A8A7E20AA89D57E051 ( 5 ) 1
nodes

[ 4251.437451] o2dlm: Node 3 joins domain A895BC216BE641A8A7E20AA89D57E051 ( 3 5
) 2 nodes

[ 4267.836392] o2dlm: Node 1 joins domain A895BC216BE641A8A7E20AA89D57E051 ( 1 3
5 ) 3 nodes

[ 4292.755589] o2dlm: Node 2 joins domain A895BC216BE641A8A7E20AA89D57E051 ( 1 2
3 5 ) 4 nodes

[ 4306.262165] o2dlm: Node 4 joins domain A895BC216BE641A8A7E20AA89D57E051 ( 1 2
3 4 5 ) 5 nodes

[316476.505401] (kworker/u192:0,95923,0):dlm_do_assert_master:1717 ERROR: Error
-112 when sending message 502 (key 0xc3460ae7) to node 1

[316476.505470] o2cb: o2dlm has evicted node 1 from domain
A895BC216BE641A8A7E20AA89D57E051

[316480.437231] o2dlm: Begin recovery on domain A895BC216BE641A8A7E20AA89D57E051
for node 1

[316480.442389] o2cb: o2dlm has evicted node 1 from domain
A895BC216BE641A8A7E20AA89D57E051

[316480.442412] (kworker/u192:0,95923,20):dlm_begin_reco_handler:2765
A895BC216BE641A8A7E20AA89D57E051: dead_node previously set to 1, node 3 changing
it to 1

[316480.541237] o2dlm: Node 3 (he) is the Recovery Master for the dead node 1 in
domain A895BC216BE641A8A7E20AA89D57E051

[316480.541241] o2dlm: End recovery on domain A895BC216BE641A8A7E20AA89D57E051

[316485.542733] o2dlm: Begin recovery on domain A895BC216BE641A8A7E20AA89D57E051
for node 1

[316485.542740] o2dlm: Node 3 (he) is the Recovery Master for the dead node 1 in
domain A895BC216BE641A8A7E20AA89D57E051

[316485.542742] o2dlm: End recovery on domain A895BC216BE641A8A7E20AA89D57E051

[316490.544535] o2dlm: Begin recovery on domain A895BC216BE641A8A7E20AA89D57E051
for node 1

[316490.544538] o2dlm: Node 3 (he) is the Recovery Master for the dead node 1 in
domain A895BC216BE641A8A7E20AA89D57E051

[316490.544539] o2dlm: End recovery on domain A895BC216BE641A8A7E20AA89D57E051

[316495.546356] o2dlm: Begin recovery on domain A895BC216BE641A8A7E20AA89D57E051
for node 1

[316495.546362] o2dlm: Node 3 (he) is the Recovery Master for the dead node 1 in
domain A895BC216BE641A8A7E20AA89D57E051

[316495.546364] o2dlm: End recovery on domain A895BC216BE641A8A7E20AA89D57E051

[316500.548135] o2dlm: Begin recovery on domain A895BC216BE641A8A7E20AA89D57E051
for node 1

[316500.548139] o2dlm: Node 3 (he) is the Recovery Master for the dead node 1 in
domain A895BC216BE641A8A7E20AA89D57E051

[316500.548140] o2dlm: End recovery on domain A895BC216BE641A8A7E20AA89D57E051

[316505.549947] o2dlm: Begin recovery on domain A895BC216BE641A8A7E20AA89D57E051
for node 1

[316505.549951] o2dlm: Node 3 (he) is the Recovery Master for the dead node 1 in
domain A895BC216BE641A8A7E20AA89D57E051

[316505.549952] o2dlm: End recovery on domain A895BC216BE641A8A7E20AA89D57E051

[316510.551734] o2dlm: Begin recovery on domain A895BC216BE641A8A7E20AA89D57E051
for node 1

[316510.551739] o2dlm: Node 3 (he) is the Recovery Master for the dead node 1 in
domain A895BC216BE641A8A7E20AA89D57E051

[316510.551740] o2dlm: End recovery on domain A895BC216BE641A8A7E20AA89D57E051

[316515.553543] o2dlm: Begin recovery on domain A895BC216BE641A8A7E20AA89D57E051
for node 1

[316515.553547] o2dlm: Node 3 (he) is the Recovery Master for the dead node 1 in
domain A895BC216BE641A8A7E20AA89D57E051

[316515.553548] o2dlm: End recovery on domain A895BC216BE641A8A7E20AA89D57E051

[316520.555337] o2dlm: Begin recovery on domain A895BC216BE641A8A7E20AA89D57E051
for node 1

[316520.555341] o2dlm: Node 3 (he) is the Recovery Master for the dead node 1 in
domain A895BC216BE641A8A7E20AA89D57E051

[316520.555343] o2dlm: End recovery on domain A895BC216BE641A8A7E20AA89D57E051

[316525.557131] o2dlm: Begin recovery on domain A895BC216BE641A8A7E20AA89D57E051
for node 1

[316525.557136] o2dlm: Node 3 (he) is the Recovery Master for the dead node 1 in
domain A895BC216BE641A8A7E20AA89D57E051

[316525.557153] o2dlm: End recovery on domain A895BC216BE641A8A7E20AA89D57E051

[316530.558952] o2dlm: Begin recovery on domain A895BC216BE641A8A7E20AA89D57E051
for node 1

[316530.558955] o2dlm: Node 3 (he) is the Recovery Master for the dead node 1 in
domain A895BC216BE641A8A7E20AA89D57E051

[316530.558957] o2dlm: End recovery on domain A895BC216BE641A8A7E20AA89D57E051

[316535.560781] o2dlm: Begin recovery on domain A895BC216BE641A8A7E20AA89D57E051
for node 1

[316535.560789] o2dlm: Node 3 (he) is the Recovery Master for the dead node 1 in
domain A895BC216BE641A8A7E20AA89D57E051

[316535.560792] o2dlm: End recovery on domain A895BC216BE641A8A7E20AA89D57E051

[319419.525609] o2dlm: Node 1 joins domain A895BC216BE641A8A7E20AA89D57E051 ( 1
2 3 4 5 ) 5 nodes







ps -auxxxxx | grep umount

root     32083 21.8  0.0 125620  2828 pts/14   D+   19:37   0:18 umount
/home/build/repository

root     32196  0.0  0.0 112652  2264 pts/8    S+   19:38   0:00 grep
--color=auto umount





cat /proc/32083/stack 

[&lt;ffffffff8132ad7d&gt;] o2net_send_message_vec+0x71d/0xb00

[&lt;ffffffff81352148&gt;]
dlm_send_remote_unlock_request.isra.2+0x128/0x410

[&lt;ffffffff813527db&gt;] dlmunlock_common+0x3ab/0x9e0

[&lt;ffffffff81353088&gt;] dlmunlock+0x278/0x800

[&lt;ffffffff8131f765&gt;] o2cb_dlm_unlock+0x35/0x50

[&lt;ffffffff8131ecfe&gt;] ocfs2_dlm_unlock+0x1e/0x30

[&lt;ffffffff812a8776&gt;] ocfs2_drop_lock.isra.29.part.30+0x1f6/0x700

[&lt;ffffffff812ae40d&gt;] ocfs2_simple_drop_lockres+0x2d/0x40

[&lt;ffffffff8129b43c&gt;] ocfs2_dentry_lock_put+0x5c/0x80

[&lt;ffffffff8129b4a2&gt;] ocfs2_dentry_iput+0x42/0x1d0

[&lt;ffffffff81204dc2&gt;] __dentry_kill+0x102/0x1f0

[&lt;ffffffff81205294&gt;] shrink_dentry_list+0xe4/0x2a0

[&lt;ffffffff81205aa8&gt;] shrink_dcache_parent+0x38/0x90

[&lt;ffffffff81205b16&gt;] do_one_tree+0x16/0x50

[&lt;ffffffff81206e9f&gt;] shrink_dcache_for_umount+0x2f/0x90

[&lt;ffffffff811efb15&gt;] generic_shutdown_super+0x25/0x100

[&lt;ffffffff811eff57&gt;] kill_block_super+0x27/0x70

[&lt;ffffffff811f02a9&gt;] deactivate_locked_super+0x49/0x60

[&lt;ffffffff811f089e&gt;] deactivate_super+0x4e/0x70

[&lt;ffffffff8120da83&gt;] cleanup_mnt+0x43/0x90

[&lt;ffffffff8120db22&gt;] __cleanup_mnt+0x12/0x20

[&lt;ffffffff81093ba4&gt;] task_work_run+0xc4/0xe0

[&lt;ffffffff81013c67&gt;] do_notify_resume+0x97/0xb0

[&lt;ffffffff817d2ee7&gt;] int_signal+0x12/0x17

[&lt;ffffffffffffffff&gt;] 0xffffffffffffffff



Regards

Prabu










---- On Wed, 21 Oct 2015 08:32:15 +0530 Eric Ren &lt;zren at
suse.com&gt; wrote ----




Hi Prabu,



I guess others like me are not familiar with this case that combine CEPH RBD and
OCFS2.



We'd really like to help you. But I think ocfs2 developers cannot get any
info about what happened

to ocfs2 from your descriptions. 



So, I'm wondering if you can reproduce and tell us the steps. Once
developers can reproduce it,

it's likely be resolved;-) BTW, any dmesg log about ocfs2 especially the
initial error message and stack

back trace will be helpful!



Thanks,

Eric



On 10/20/15 17:29, gjprabu wrote:




Hi 



        We are looking forward to your input on this.



Regads

Prabu




--- On Fri, 09 Oct 2015 12:08:19 +0530 gjprabu &lt;gjprabu at
zohocorp.com&gt; wrote ----
















Hi All,



         Anybody pls help me on this issue.



Regards

Prabu










---- On Thu, 08 Oct 2015 12:33:57 +0530 gjprabu &lt;gjprabu at
zohocorp.com&gt; wrote ----











Hi All, 



       We have CEPH  RBD with OCFS2 mounted servers. we are facing i/o errors
simultaneously while move the data's in the same disk (Copying is not having
any problem). Temporary we remount the partition and the issue get resolved but
after sometime problem again reproduced. If anybody faced same issue. Please
help us.



Note : We have total 5 Nodes, here two nodes working fine other nodes are
showing like below input/output error.



ls -althr 

ls: cannot access LITE_3_0_M4_1_TEST: Input/output error 

ls: cannot access LITE_3_0_M4_1_OLD: Input/output error 

total 0 

d????????? ? ? ? ? ? LITE_3_0_M4_1_TEST 

d????????? ? ? ? ? ? LITE_3_0_M4_1_OLD 



cluster:

       node_count=5

       heartbeat_mode = local

       name=ocfs2



node:

        ip_port = 7777

        ip_address = 192.168.113.42

        number = 1

        name = integ-hm9

        cluster = ocfs2



node:

        ip_port = 7777

        ip_address = 192.168.112.115

        number = 2

        name = integ-hm2

        cluster = ocfs2



node:

        ip_port = 7777

        ip_address = 192.168.113.43

        number = 3

        name = integ-ci-1

        cluster = ocfs2

node:

        ip_port = 7777

        ip_address = 192.168.112.217

        number = 4

        name = integ-hm8

        cluster = ocfs2

node:

        ip_port = 7777

        ip_address = 192.168.112.192

        number = 5

        name = integ-hm5

        cluster = ocfs2





Regards

Prabu









_______________________________________________ 

Ocfs2-users mailing list 

Ocfs2-users at oss.oracle.com 

https://oss.oracle.com/mailman/listinfo/ocfs2-users










_______________________________________________ Ocfs2-users mailing list
Ocfs2-users at oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-users









-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://oss.oracle.com/pipermail/ocfs2-users/attachments/20151026/c440310b/attachment-0001.html

Ocfs2 users - Oct 2015 - input / out error on some nodes

[Ocfs2-users] input / out error on some nodes

[Ocfs2-users] input / out error on some nodes