On 03/26/2015 01:38 PM, Jonathan Heese wrote:> Joe,
>
> Thanks again for the reply.
>
> Your theory makes sense to me, but I'm still not seeing a solution
> from here... Can you (or anyone else) help me to:
>
> 1. Determine why it's trying to connect to some server via RDMA (seems
> like my nfs-server.vol config might be an obvious choice, but I'm not
> sure), and what server,
RDMA is just something it tries. It's a red herring.>
> 2. Determine why it's failing to connect thusly (was this part of the
> RDMA bug in 3.5.3?),
Again, red herring.>
> 3. Correct the bit of configuration causing 1) and 2) above.
The question is, why can't the nfs service connect to all the servers.
Check firewall, selinux, iptables, allowed hosts... the usual
suspects.>
> 4. Explain if there are any (significant) pros or cons to using the
> RDMA transport or the TCP transport (assuming both function over a
> 20Gb InfiniBand connection).
RDMA is remote direct memory access. It allows the hardware to put the
packet in ram instead of the kernel's TCP stack. This saves several
context transfers, decreasing latency significantly per fop.
>
> Thanks again!
>
> Regards,
> Jon Heese
>
> On Mar 26, 2015, at 4:20 PM, "Joe Julian" <joe at
julianfamily.org
> <mailto:joe at julianfamily.org>> wrote:
>
>> Every 3 seconds implies, to me, that it's trying to reconnect to a
>> server.
>>
>> On 03/26/2015 01:12 PM, Jonathan Heese wrote:
>>>
>>> Joe,
>>>
>>>
>>> Hmmm.... But every 3 seconds for all eternity? Seems a bit much for
>>> a "warning", doesn't it?
>>>
>>>
>>> Did you see my last reply? My nfs-server.vol file seems to indicate
>>> that RDMA is still in use in some capacity... Is this normal? If
>>> not, how can I reconcile this?
>>>
>>>
>>> Thanks.
>>>
>>>
>>> Regards,
>>>
>>> Jon Heese
>>>
>>>
>>>
------------------------------------------------------------------------
>>> *From:* gluster-users-bounces at gluster.org
>>> <gluster-users-bounces at gluster.org> on behalf of Joe
Julian
>>> <joe at julianfamily.org>
>>> *Sent:* Thursday, March 26, 2015 4:08 PM
>>> *To:* gluster-users at gluster.org
>>> *Subject:* Re: [Gluster-users] I/O error on replicated volume
>>> The RDMA warnings are not relevant if you don't use RDMA.
It's
>>> simply pointing out that it tried to register and it couldn't,
which
>>> would be expected if your system doesn't support it.
>>>
>>> On 03/23/2015 12:29 AM, Mohammed Rafi K C wrote:
>>>>
>>>> On 03/23/2015 11:28 AM, Jonathan Heese wrote:
>>>>> On Mar 23, 2015, at 1:20 AM, "Mohammed Rafi K C"
>>>>> <rkavunga at redhat.com <mailto:rkavunga at
redhat.com>> wrote:
>>>>>
>>>>>>
>>>>>> On 03/21/2015 07:49 PM, Jonathan Heese wrote:
>>>>>>>
>>>>>>> Mohamed,
>>>>>>>
>>>>>>>
>>>>>>> I have completed the steps you suggested (unmount
all, stop the
>>>>>>> volume, set the config.transport to tcp, start the
volume,
>>>>>>> mount, etc.), and the behavior has indeed changed.
>>>>>>>
>>>>>>>
>>>>>>> [root at duke ~]# gluster volume info
>>>>>>>
>>>>>>> Volume Name: gluster_disk
>>>>>>> Type: Replicate
>>>>>>> Volume ID: 2307a5a8-641e-44f4-8eaf-7cc2b704aafd
>>>>>>> Status: Started
>>>>>>> Number of Bricks: 1 x 2 = 2
>>>>>>> Transport-type: tcp
>>>>>>> Bricks:
>>>>>>> Brick1: duke-ib:/bricks/brick1
>>>>>>> Brick2: duchess-ib:/bricks/brick1
>>>>>>> Options Reconfigured:
>>>>>>> config.transport: tcp
>>>>>>>
>>>>>>>
>>>>>>> [root at duke ~]# gluster volume status
>>>>>>> Status of volume: gluster_disk
>>>>>>> Gluster process Port Online Pid
>>>>>>>
------------------------------------------------------------------------------
>>>>>>> Brick duke-ib:/bricks/brick1 49152 Y 16362
>>>>>>> Brick duchess-ib:/bricks/brick1 49152 Y
14155
>>>>>>> NFS Server on localhost 2049 Y 16374
>>>>>>> Self-heal Daemon on localhost N/A Y 16381
>>>>>>> NFS Server on duchess-ib 2049 Y 14167
>>>>>>> Self-heal Daemon on duchess-ib N/A Y
14174
>>>>>>>
>>>>>>> Task Status of Volume gluster_disk
>>>>>>>
------------------------------------------------------------------------------
>>>>>>> There are no active volume tasks
>>>>>>>
>>>>>>> I am no longer seeing the I/O errors during
prolonged periods of
>>>>>>> write I/O that I was seeing when the transport was
set to rdma.
>>>>>>> However, I am seeing this message on both nodes
every 3 seconds
>>>>>>> (almost exactly):
>>>>>>>
>>>>>>>
>>>>>>> ==> /var/log/glusterfs/nfs.log
<=>>>>>>> [2015-03-21 14:17:40.379719] W
>>>>>>> [rdma.c:1076:gf_rdma_cm_event_handler]
0-gluster_disk-client-1:
>>>>>>> cma event RDMA_CM_EVENT_REJECTED, error 8
(me:10.10.10.1:1023
>>>>>>> peer:10.10.10.2:49152)
>>>>>>>
>>>>>>>
>>>>>>> Is this something to worry about?
>>>>>>>
>>>>>> If you are not using nfs to export the volumes, there
is nothing
>>>>>> to worry.
>>>>>
>>>>> I'm using the native glusterfs FUSE component to mount
the volume
>>>>> locally on both servers -- I assume that you're
referring to the
>>>>> standard NFS protocol stuff, which I'm not using here.
>>>>>
>>>>> Incidentally, I would like to keep my logs from filling up
with
>>>>> junk if possible. Is there something I can do to get rid
of these
>>>>> (useless?) error messages?
>>>>
>>>> If i understand correctly, you are getting this enormous log
>>>> message from nfs log only, all other logs and everything are
fine
>>>> now, right ? If that is the case, and you are not at all using
nfs
>>>> for exporting the volume, as a workaround you can disable nfs
for
>>>> your volume or cluster. (gluster v set nfs.disable on). This
will
>>>> turnoff your gluster nfs server, and you will no longer get
those
>>>> log messages.
>>>>
>>>>
>>>>>>> Any idea why there are rdma pieces in play when
I've set my
>>>>>>> transport to tcp?
>>>>>>>
>>>>>>
>>>>>> there should not be any piece of rdma,if possible, can
you paste
>>>>>> the volfile for nfs server. You can find the volfile in
>>>>>> /var/lib/glusterd/nfs/nfs-server.vol or
>>>>>> /usr/local/var/lib/glusterd/nfs/nfs-server.vol
>>>>>
>>>>> I will get this for you when I can. Thanks.
>>>>
>>>> If you can make it, that will be great help to understand the
problem.
>>>>
>>>>
>>>> Rafi KC
>>>>
>>>>>
>>>>> Regards,
>>>>> Jon Heese
>>>>>
>>>>>> Rafi KC
>>>>>>>
>>>>>>> The actual I/O appears to be handled properly and
I've seen no
>>>>>>> further errors in the testing I've done so far.
>>>>>>>
>>>>>>>
>>>>>>> Thanks.
>>>>>>>
>>>>>>>
>>>>>>> Regards,
>>>>>>>
>>>>>>> Jon Heese
>>>>>>>
>>>>>>>
>>>>>>>
------------------------------------------------------------------------
>>>>>>> *From:* gluster-users-bounces at gluster.org
>>>>>>> <gluster-users-bounces at gluster.org> on
behalf of Jonathan Heese
>>>>>>> <jheese at inetu.net>
>>>>>>> *Sent:* Friday, March 20, 2015 7:04 AM
>>>>>>> *To:* Mohammed Rafi K C
>>>>>>> *Cc:* gluster-users
>>>>>>> *Subject:* Re: [Gluster-users] I/O error on
replicated volume
>>>>>>> Mohammed,
>>>>>>>
>>>>>>> Thanks very much for the reply. I will try that
and report back.
>>>>>>>
>>>>>>> Regards,
>>>>>>> Jon Heese
>>>>>>>
>>>>>>> On Mar 20, 2015, at 3:26 AM, "Mohammed Rafi K
C"
>>>>>>> <rkavunga at redhat.com <mailto:rkavunga at
redhat.com>> wrote:
>>>>>>>
>>>>>>>>
>>>>>>>> On 03/19/2015 10:16 PM, Jonathan Heese wrote:
>>>>>>>>>
>>>>>>>>> Hello all,
>>>>>>>>>
>>>>>>>>> Does anyone else have any further
suggestions for
>>>>>>>>> troubleshooting this?
>>>>>>>>>
>>>>>>>>> To sum up: I have a 2 node 2 brick
replicated volume, which
>>>>>>>>> holds a handful of iSCSI image files which
are mounted and
>>>>>>>>> served up by tgtd (CentOS 6) to a handful
of devices on a
>>>>>>>>> dedicated iSCSI network. The most
important iSCSI clients
>>>>>>>>> (initiators) are four VMware ESXi 5.5 hosts
that use the iSCSI
>>>>>>>>> volumes as backing for their datastores for
virtual machine
>>>>>>>>> storage.
>>>>>>>>>
>>>>>>>>> After a few minutes of sustained writing to
the volume, I am
>>>>>>>>> seeing a massive flood (over 1500 per
second at times) of this
>>>>>>>>> error in
/var/log/glusterfs/mnt-gluster-disk.log:
>>>>>>>>>
>>>>>>>>> [2015-03-16 02:24:07.582801] W
>>>>>>>>> [fuse-bridge.c:2242:fuse_writev_cbk]
0-glusterfs-fuse: 635358:
>>>>>>>>> WRITE => -1 (Input/output error)
>>>>>>>>>
>>>>>>>>> When this happens, the ESXi box fails its
write operation and
>>>>>>>>> returns an error to the effect of ?Unable
to write data to
>>>>>>>>> datastore?. I don?t see anything else in
the supporting logs
>>>>>>>>> to explain the root cause of the i/o
errors.
>>>>>>>>>
>>>>>>>>> Any and all suggestions are appreciated.
Thanks.
>>>>>>>>>
>>>>>>>>
>>>>>>>> From the mount logs, i assume that your volume
transport type
>>>>>>>> is rdma. There are some known issues for rdma
in 3.5.3, and the
>>>>>>>> patch for to address those issues are already
send to upstream
>>>>>>>> [1]. From the logs, I'm not sure and it is
hard to tell you
>>>>>>>> whether this problem is something related to
rdma transport or
>>>>>>>> not. To make sure that the tcp transport is
works well in this
>>>>>>>> scenario, if possible can you try to reproduce
the same using
>>>>>>>> tcp type volumes. You can change the transport
type of volume
>>>>>>>> by doing the following step ( not recommended
in normal use case).
>>>>>>>>
>>>>>>>> 1) unmount every client
>>>>>>>> 2) stop the volume
>>>>>>>> 3) run gluster volume set volname
config.transport tcp
>>>>>>>> 4) start the volume again
>>>>>>>> 5) mount the clients
>>>>>>>>
>>>>>>>> [1] : http://goo.gl/2PTL61
>>>>>>>>
>>>>>>>> Regards
>>>>>>>> Rafi KC
>>>>>>>>
>>>>>>>>> /Jon Heese/
>>>>>>>>> /Systems Engineer/
>>>>>>>>> *INetU Managed Hosting*
>>>>>>>>> P: 610.266.7441 x 261
>>>>>>>>> F: 610.266.7434
>>>>>>>>> www.inetu.net
<https://www.inetu.net/>
>>>>>>>>>
>>>>>>>>> /** This message contains confidential
information, which also
>>>>>>>>> may be privileged, and is intended only for
the person(s)
>>>>>>>>> addressed above. Any unauthorized use,
distribution, copying
>>>>>>>>> or disclosure of confidential and/or
privileged information is
>>>>>>>>> strictly prohibited. If you have received
this communication
>>>>>>>>> in error, please erase all copies of the
message and its
>>>>>>>>> attachments and notify the sender
immediately via reply
>>>>>>>>> e-mail. **/
>>>>>>>>>
>>>>>>>>> *From:*Jonathan Heese
>>>>>>>>> *Sent:* Tuesday, March 17, 2015 12:36 PM
>>>>>>>>> *To:* 'Ravishankar N';
gluster-users at gluster.org
>>>>>>>>> *Subject:* RE: [Gluster-users] I/O error on
replicated volume
>>>>>>>>>
>>>>>>>>> Ravi,
>>>>>>>>>
>>>>>>>>> The last lines in the mount log before the
massive vomit of
>>>>>>>>> I/O errors are from 22 minutes prior, and
seem innocuous to me:
>>>>>>>>>
>>>>>>>>> [2015-03-16 01:37:07.126340] E
>>>>>>>>>
[client-handshake.c:1760:client_query_portmap_cbk]
>>>>>>>>> 0-gluster_disk-client-0: failed to get the
port number for
>>>>>>>>> remote subvolume. Please run 'gluster
volume status' on server
>>>>>>>>> to see if brick process is running.
>>>>>>>>>
>>>>>>>>> [2015-03-16 01:37:07.126587] W
>>>>>>>>> [rdma.c:4273:gf_rdma_disconnect]
>>>>>>>>>
(-->/usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0x13f)
>>>>>>>>> [0x7fd9c557bccf]
>>>>>>>>>
(-->/usr/lib64/libgfrpc.so.0(rpc_clnt_handle_reply+0xa5)
>>>>>>>>> [0x7fd9c557a995]
>>>>>>>>>
(-->/usr/lib64/glusterfs/3.5.3/xlator/protocol/client.so(client_query_portmap_cbk+0x1ea)
>>>>>>>>> [0x7fd9c0d8fb9a])))
0-gluster_disk-client-0: disconnect called
>>>>>>>>> (peer:10.10.10.1:24008)
>>>>>>>>>
>>>>>>>>> [2015-03-16 01:37:07.126687] E
>>>>>>>>>
[client-handshake.c:1760:client_query_portmap_cbk]
>>>>>>>>> 0-gluster_disk-client-1: failed to get the
port number for
>>>>>>>>> remote subvolume. Please run 'gluster
volume status' on server
>>>>>>>>> to see if brick process is running.
>>>>>>>>>
>>>>>>>>> [2015-03-16 01:37:07.126737] W
>>>>>>>>> [rdma.c:4273:gf_rdma_disconnect]
>>>>>>>>>
(-->/usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0x13f)
>>>>>>>>> [0x7fd9c557bccf]
>>>>>>>>>
(-->/usr/lib64/libgfrpc.so.0(rpc_clnt_handle_reply+0xa5)
>>>>>>>>> [0x7fd9c557a995]
>>>>>>>>>
(-->/usr/lib64/glusterfs/3.5.3/xlator/protocol/client.so(client_query_portmap_cbk+0x1ea)
>>>>>>>>> [0x7fd9c0d8fb9a])))
0-gluster_disk-client-1: disconnect called
>>>>>>>>> (peer:10.10.10.2:24008)
>>>>>>>>>
>>>>>>>>> [2015-03-16 01:37:10.730165] I
>>>>>>>>> [rpc-clnt.c:1729:rpc_clnt_reconfig]
0-gluster_disk-client-0:
>>>>>>>>> changing port to 49152 (from 0)
>>>>>>>>>
>>>>>>>>> [2015-03-16 01:37:10.730276] W
>>>>>>>>> [rdma.c:4273:gf_rdma_disconnect]
>>>>>>>>>
(-->/usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0x13f)
>>>>>>>>> [0x7fd9c557bccf]
>>>>>>>>>
(-->/usr/lib64/libgfrpc.so.0(rpc_clnt_handle_reply+0xa5)
>>>>>>>>> [0x7fd9c557a995]
>>>>>>>>>
(-->/usr/lib64/glusterfs/3.5.3/xlator/protocol/client.so(client_query_portmap_cbk+0x1ea)
>>>>>>>>> [0x7fd9c0d8fb9a])))
0-gluster_disk-client-0: disconnect called
>>>>>>>>> (peer:10.10.10.1:24008)
>>>>>>>>>
>>>>>>>>> [2015-03-16 01:37:10.739500] I
>>>>>>>>> [rpc-clnt.c:1729:rpc_clnt_reconfig]
0-gluster_disk-client-1:
>>>>>>>>> changing port to 49152 (from 0)
>>>>>>>>>
>>>>>>>>> [2015-03-16 01:37:10.739560] W
>>>>>>>>> [rdma.c:4273:gf_rdma_disconnect]
>>>>>>>>>
(-->/usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0x13f)
>>>>>>>>> [0x7fd9c557bccf]
>>>>>>>>>
(-->/usr/lib64/libgfrpc.so.0(rpc_clnt_handle_reply+0xa5)
>>>>>>>>> [0x7fd9c557a995]
>>>>>>>>>
(-->/usr/lib64/glusterfs/3.5.3/xlator/protocol/client.so(client_query_portmap_cbk+0x1ea)
>>>>>>>>> [0x7fd9c0d8fb9a])))
0-gluster_disk-client-1: disconnect called
>>>>>>>>> (peer:10.10.10.2:24008)
>>>>>>>>>
>>>>>>>>> [2015-03-16 01:37:10.741883] I
>>>>>>>>>
[client-handshake.c:1677:select_server_supported_programs]
>>>>>>>>> 0-gluster_disk-client-0: Using Program
GlusterFS 3.3, Num
>>>>>>>>> (1298437), Version (330)
>>>>>>>>>
>>>>>>>>> [2015-03-16 01:37:10.744524] I
>>>>>>>>>
[client-handshake.c:1462:client_setvolume_cbk]
>>>>>>>>> 0-gluster_disk-client-0: Connected to
10.10.10.1:49152,
>>>>>>>>> attached to remote volume
'/bricks/brick1'.
>>>>>>>>>
>>>>>>>>> [2015-03-16 01:37:10.744537] I
>>>>>>>>>
[client-handshake.c:1474:client_setvolume_cbk]
>>>>>>>>> 0-gluster_disk-client-0: Server and Client
lk-version numbers
>>>>>>>>> are not same, reopening the fds
>>>>>>>>>
>>>>>>>>> [2015-03-16 01:37:10.744566] I
[afr-common.c:4267:afr_notify]
>>>>>>>>> 0-gluster_disk-replicate-0: Subvolume
'gluster_disk-client-0'
>>>>>>>>> came back up; going online.
>>>>>>>>>
>>>>>>>>> [2015-03-16 01:37:10.744627] I
>>>>>>>>>
[client-handshake.c:450:client_set_lk_version_cbk]
>>>>>>>>> 0-gluster_disk-client-0: Server lk version
= 1
>>>>>>>>>
>>>>>>>>> [2015-03-16 01:37:10.753037] I
>>>>>>>>>
[client-handshake.c:1677:select_server_supported_programs]
>>>>>>>>> 0-gluster_disk-client-1: Using Program
GlusterFS 3.3, Num
>>>>>>>>> (1298437), Version (330)
>>>>>>>>>
>>>>>>>>> [2015-03-16 01:37:10.755657] I
>>>>>>>>>
[client-handshake.c:1462:client_setvolume_cbk]
>>>>>>>>> 0-gluster_disk-client-1: Connected to
10.10.10.2:49152,
>>>>>>>>> attached to remote volume
'/bricks/brick1'.
>>>>>>>>>
>>>>>>>>> [2015-03-16 01:37:10.755676] I
>>>>>>>>>
[client-handshake.c:1474:client_setvolume_cbk]
>>>>>>>>> 0-gluster_disk-client-1: Server and Client
lk-version numbers
>>>>>>>>> are not same, reopening the fds
>>>>>>>>>
>>>>>>>>> [2015-03-16 01:37:10.761945] I
>>>>>>>>> [fuse-bridge.c:5016:fuse_graph_setup]
0-fuse: switched to graph 0
>>>>>>>>>
>>>>>>>>> [2015-03-16 01:37:10.762144] I
>>>>>>>>>
[client-handshake.c:450:client_set_lk_version_cbk]
>>>>>>>>> 0-gluster_disk-client-1: Server lk version
= 1
>>>>>>>>>
>>>>>>>>> [*2015-03-16 01:37:10.762279*] I
>>>>>>>>> [fuse-bridge.c:3953:fuse_init]
0-glusterfs-fuse: FUSE inited
>>>>>>>>> with protocol versions: glusterfs 7.22
kernel 7.14
>>>>>>>>>
>>>>>>>>> [*2015-03-16 01:59:26.098670*] W
>>>>>>>>> [fuse-bridge.c:2242:fuse_writev_cbk]
0-glusterfs-fuse: 292084:
>>>>>>>>> WRITE => -1 (Input/output error)
>>>>>>>>>
>>>>>>>>> ?
>>>>>>>>>
>>>>>>>>> I?ve seen no indication of split-brain on
any files at any
>>>>>>>>> point in this (ever since downdating from
3.6.2 to 3.5.3,
>>>>>>>>> which is when this particular issue
started):
>>>>>>>>>
>>>>>>>>> [root at duke
gfapi-module-for-linux-target-driver-]# gluster v
>>>>>>>>> heal gluster_disk info
>>>>>>>>>
>>>>>>>>> Brick duke.jonheese.local:/bricks/brick1/
>>>>>>>>>
>>>>>>>>> Number of entries: 0
>>>>>>>>>
>>>>>>>>> Brick
duchess.jonheese.local:/bricks/brick1/
>>>>>>>>>
>>>>>>>>> Number of entries: 0
>>>>>>>>>
>>>>>>>>> Thanks.
>>>>>>>>>
>>>>>>>>> /Jon Heese/
>>>>>>>>> /Systems Engineer/
>>>>>>>>> *INetU Managed Hosting*
>>>>>>>>> P: 610.266.7441 x 261
>>>>>>>>> F: 610.266.7434
>>>>>>>>> www.inetu.net
<https://www.inetu.net/>
>>>>>>>>>
>>>>>>>>> /** This message contains confidential
information, which also
>>>>>>>>> may be privileged, and is intended only for
the person(s)
>>>>>>>>> addressed above. Any unauthorized use,
distribution, copying
>>>>>>>>> or disclosure of confidential and/or
privileged information is
>>>>>>>>> strictly prohibited. If you have received
this communication
>>>>>>>>> in error, please erase all copies of the
message and its
>>>>>>>>> attachments and notify the sender
immediately via reply
>>>>>>>>> e-mail. **/
>>>>>>>>>
>>>>>>>>> *From:*Ravishankar N [mailto:ravishankar at
redhat.com]
>>>>>>>>> *Sent:* Tuesday, March 17, 2015 12:35 AM
>>>>>>>>> *To:* Jonathan Heese; gluster-users at
gluster.org
>>>>>>>>> <mailto:gluster-users at gluster.org>
>>>>>>>>> *Subject:* Re: [Gluster-users] I/O error on
replicated volume
>>>>>>>>>
>>>>>>>>> On 03/17/2015 02:14 AM, Jonathan Heese
wrote:
>>>>>>>>>
>>>>>>>>> Hello,
>>>>>>>>>
>>>>>>>>> So I resolved my previous issue with
split-brains and the
>>>>>>>>> lack of self-healing by dropping my
installed glusterfs*
>>>>>>>>> packages from 3.6.2 to 3.5.3, but now
I've picked up a new
>>>>>>>>> issue, which actually makes normal use
of the volume
>>>>>>>>> practically impossible.
>>>>>>>>>
>>>>>>>>> A little background for those not
already paying close
>>>>>>>>> attention:
>>>>>>>>> I have a 2 node 2 brick replicating
volume whose purpose
>>>>>>>>> in life is to hold iSCSI target files,
primarily for use
>>>>>>>>> to provide datastores to a VMware ESXi
cluster. The plan
>>>>>>>>> is to put a handful of image files on
the Gluster volume,
>>>>>>>>> mount them locally on both Gluster
nodes, and run tgtd on
>>>>>>>>> both, pointed to the image files on the
mounted gluster
>>>>>>>>> volume. Then the ESXi boxes will use
multipath
>>>>>>>>> (active/passive) iSCSI to connect to
the nodes, with
>>>>>>>>> automatic failover in case of planned
or unplanned
>>>>>>>>> downtime of the Gluster nodes.
>>>>>>>>>
>>>>>>>>> In my most recent round of testing with
3.5.3, I'm seeing
>>>>>>>>> a massive failure to write data to the
volume after about
>>>>>>>>> 5-10 minutes, so I've simplified
the scenario a bit (to
>>>>>>>>> minimize the variables) to: both
Gluster nodes up, only
>>>>>>>>> one node (duke) mounted and running
tgtd, and just regular
>>>>>>>>> (single path) iSCSI from a single ESXi
server.
>>>>>>>>>
>>>>>>>>> About 5-10 minutes into migration a VM
onto the test
>>>>>>>>> datastore, /var/log/messages on duke
gets blasted with a
>>>>>>>>> ton of messages exactly like this:
>>>>>>>>>
>>>>>>>>> Mar 15 22:24:06 duke tgtd:
bs_rdwr_request(180) io error
>>>>>>>>> 0x1781e00 2a -1 512 22971904,
Input/output error
>>>>>>>>>
>>>>>>>>> And
/var/log/glusterfs/mnt-gluster_disk.log gets blased
>>>>>>>>> with a ton of messages exactly like
this:
>>>>>>>>>
>>>>>>>>> [2015-03-16 02:24:07.572279] W
>>>>>>>>> [fuse-bridge.c:2242:fuse_writev_cbk]
0-glusterfs-fuse:
>>>>>>>>> 635299: WRITE => -1 (Input/output
error)
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Are there any messages in the mount log
from AFR about
>>>>>>>>> split-brain just before the above line
appears?
>>>>>>>>> Does `gluster v heal <VOLNAME> info`
show any files?
>>>>>>>>> Performing I/O on files that are in
split-brain fail with EIO.
>>>>>>>>>
>>>>>>>>> -Ravi
>>>>>>>>>
>>>>>>>>> And the write operation from
VMware's side fails as soon
>>>>>>>>> as these messages start.
>>>>>>>>>
>>>>>>>>> I don't see any other errors (in
the log files I know of)
>>>>>>>>> indicating the root cause of these i/o
errors. I'm sure
>>>>>>>>> that this is not enough information to
tell what's going
>>>>>>>>> on, but can anyone help me figure out
what to look at next
>>>>>>>>> to figure this out?
>>>>>>>>>
>>>>>>>>> I've also considered using Dan
Lambright's libgfapi
>>>>>>>>> gluster module for tgtd (or something
similar) to avoid
>>>>>>>>> going through FUSE, but I'm not
sure whether that would be
>>>>>>>>> irrelevant to this problem, since
I'm not 100% sure if it
>>>>>>>>> lies in FUSE or elsewhere.
>>>>>>>>>
>>>>>>>>> Thanks!
>>>>>>>>>
>>>>>>>>> /Jon Heese/
>>>>>>>>> /Systems Engineer/
>>>>>>>>> *INetU Managed Hosting*
>>>>>>>>> P: 610.266.7441 x 261
>>>>>>>>> F: 610.266.7434
>>>>>>>>> www.inetu.net
<https://www.inetu.net/>
>>>>>>>>>
>>>>>>>>> /** This message contains confidential
information, which
>>>>>>>>> also may be privileged, and is intended
only for the
>>>>>>>>> person(s) addressed above. Any
unauthorized use,
>>>>>>>>> distribution, copying or disclosure of
confidential and/or
>>>>>>>>> privileged information is strictly
prohibited. If you have
>>>>>>>>> received this communication in error,
please erase all
>>>>>>>>> copies of the message and its
attachments and notify the
>>>>>>>>> sender immediately via reply e-mail.
**/
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
_______________________________________________
>>>>>>>>>
>>>>>>>>> Gluster-users mailing list
>>>>>>>>>
>>>>>>>>> Gluster-users at gluster.org
<mailto:Gluster-users at gluster.org>
>>>>>>>>>
>>>>>>>>>
http://www.gluster.org/mailman/listinfo/gluster-users
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
_______________________________________________
>>>>>>>>> Gluster-users mailing list
>>>>>>>>> Gluster-users at gluster.org
>>>>>>>>>
http://www.gluster.org/mailman/listinfo/gluster-users
>>>>>>>>
>>>>>>
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> Gluster-users mailing list
>>>> Gluster-users at gluster.org
>>>> http://www.gluster.org/mailman/listinfo/gluster-users
>>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://www.gluster.org/pipermail/gluster-users/attachments/20150326/78ee1e4f/attachment-0001.html>