thr3ads.net - Gluster users - [Gluster-users] gluster and LIO, fairly basic setup, having major issues [Oct 2016]

If this information is useful, please help other people find it:
Share via:

Vijay Bellur

2016-Oct-06 16:22 UTC

[Gluster-users] gluster and LIO, fairly basic setup, having major issues

Hi Mike,

Can you please share your gluster volume configuration?

Also do you notice anything in client logs on the node where fileio
backstore is configured?

Thanks,
Vijay

On Wed, Oct 5, 2016 at 8:56 PM, Michael Ciccarelli <mikecicc01 at
gmail.com> wrote:> So I have a fairly basic setup using glusterfs between 2 nodes. The nodes
> have 10 gig connections and the bricks reside on SSD LVM LUNs:
>
> Brick1: media1-be:/gluster/brick1/gluster_volume_0
> Brick2: media2-be:/gluster/brick1/gluster_volume_0
>
>
> On this volume I have a LIO iscsi target with 1 fileio backstore that's
> being shared out to vmware ESXi hosts. The volume is around 900 gig and the
> fileio store is around 850g:
>
> -rw-r--r-- 1 root root 912680550400 Oct  5 20:47 iscsi.disk.3
>
> I set the WWN to be the same so the ESXi hosts see the nodes as 2 paths to
> the same target. I believe this is what I want. The issues I'm seeing
is
> that while the IO wait is low I'm seeing high CPU usage with only 3 VMs
> running on only 1 of the ESX servers:
>
> this is media2-be:
>   PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
>  1474 root      20   0 1396620  37912   5980 S 135.0  0.1 157:01.84
> glusterfsd
>  1469 root      20   0  747996  13724   5424 S   2.0  0.0   1:10.59
> glusterfs
>
> And this morning it seemed like I had to restart the LIO service on
> media1-be as the VMware was seeing time-out issues. I'm seeing issues
like
> this on the VMware ESX servers:
>
> 2016-10-06T00:51:41.100Z cpu0:32785)WARNING: ScsiDeviceIO: 1223: Device
> naa.600140501ce79002e724ebdb66a6756d performance has deteriorated. I/O
> latency increased from average value of 33420 microseconds to 732696
> microseconds.
>
> Are there any special settings I need to have gluster+LIO+vmware to work?
> Has anyone gotten this to work fairly well that it is stable? What am I
> missing?
>
> thanks,
> Mike
>
>
>
> _______________________________________________
> Gluster-users mailing list
> Gluster-users at gluster.org
> http://www.gluster.org/mailman/listinfo/gluster-users

Michael Ciccarelli

2016-Oct-06 20:25 UTC

head link

[Gluster-users] gluster and LIO, fairly basic setup, having major issues

this is the info file contents.. is there another file you would want to
see for config?
type=2
count=2
status=1
sub_count=2
stripe_count=1
replica_count=2
disperse_count=0
redundancy_count=0
version=3
transport-type=0
volume-id=98c258e6-ae9e-4407-8f25-7e3f7700e100
username=removed just cause
password=removed just cause
op-version=3
client-op-version=3
quota-version=0
parent_volname=N/A
restored_from_snap=00000000-0000-0000-0000-000000000000
snap-max-hard-limit=256
diagnostics.count-fop-hits=on
diagnostics.latency-measurement=on
performance.readdir-ahead=on
brick-0=media1-be:-gluster-brick1-gluster_volume_0
brick-1=media2-be:-gluster-brick1-gluster_volume_0

here are some log entries, etc-glusterfs-glusterd.vol.log:
The message "I [MSGID: 106006]
[glusterd-svc-mgmt.c:323:glusterd_svc_common_rpc_notify] 0-management: nfs
has disconnected from glusterd." repeated 39 times between [2016-10-06
20:10:14.963402] and [2016-10-06 20:12:11.979684]
[2016-10-06 20:12:14.980203] I [MSGID: 106006]
[glusterd-svc-mgmt.c:323:glusterd_svc_common_rpc_notify] 0-management: nfs
has disconnected from glusterd.
[2016-10-06 20:13:50.993490] W [socket.c:596:__socket_rwv] 0-nfs: readv on
/var/run/gluster/360710d59bc4799f8c8a6374936d2b1b.socket failed (Invalid
argument)

I can provide any specific details you would like to see.. Last night I
tried 1 more time and it appeared to be working ok for running 1 VM under
VMware but as soon as I had 3 running the targets became unresponsive. I
believe gluster volume is ok but for whatever reason the ISCSI target
daemon seems to be having some issues...

here is from the messages file:
Oct  5 23:13:00 media2 kernel: MODE SENSE: unimplemented page/subpage:
0x1c/0x02
Oct  5 23:13:00 media2 kernel: MODE SENSE: unimplemented page/subpage:
0x1c/0x02
Oct  5 23:13:35 media2 kernel:
iSCSI/iqn.1998-01.com.vmware:vmware4-0941d552: Unsupported SCSI Opcode
0x4d, sending CHECK_CONDITION.
Oct  5 23:13:35 media2 kernel:
iSCSI/iqn.1998-01.com.vmware:vmware4-0941d552: Unsupported SCSI Opcode
0x4d, sending CHECK_CONDITION.

and here are some more VMware iscsi errors:
2016-10-06T20:22:11.496Z cpu2:32825)NMP: nmp_ThrottleLogForDevice:2321: Cmd
0x89 (0x412e808532c0, 32801) to dev
"naa.6001405c0d86944f3d2468d80c7d1540"
on
2016-10-06T20:22:11.635Z cpu2:32787)ScsiDeviceIO: 2338: Cmd(0x412e808532c0)
0x89, CmdSN 0x4f05 from world 32801 to dev
"naa.6001405c0d86944f3d2468d80c7d1
2016-10-06T20:22:11.635Z cpu3:35532)Fil3: 15389: Max timeout retries
exceeded for caller Fil3_FileIO (status 'Timeout')

2016-10-06T20:22:11.635Z cpu2:196414)HBX: 2832: Waiting for timed out [HB
state abcdef02 offset 3928064 gen 25 stampUS 49571997650 uuid
57f5c142-45632d75
2016-10-06T20:22:11.635Z cpu3:35532)HBX: 2832: Waiting for timed out [HB
state abcdef02 offset 3928064 gen 25 stampUS 49571997650 uuid
57f5c142-45632d75-
2016-10-06T20:22:11.635Z cpu0:32799)NMP: nmp_ThrottleLogForDevice:2321: Cmd
0x28 (0x412e80848580, 32799) to dev
"naa.6001405c0d86944f3d2468d80c7d1540"
on
2016-10-06T20:22:11.635Z cpu0:32799)ScsiDeviceIO: 2325: Cmd(0x412e80848580)
0x28, CmdSN 0x4f06 from world 32799 to dev
"naa.6001405c0d86944f3d2468d80c7d1
2016-10-06T20:22:11.773Z cpu0:32843)NMP: nmp_ThrottleLogForDevice:2321: Cmd
0x28 (0x412e80848580, 32799) to dev
"naa.6001405c0d86944f3d2468d80c7d1540"
on
2016-10-06T20:22:11.916Z cpu0:35549)NMP: nmp_ThrottleLogForDevice:2321: Cmd
0x28 (0x412e80848580, 32799) to dev
"naa.6001405c0d86944f3d2468d80c7d1540"
on
2016-10-06T20:22:12.000Z cpu2:33431)iscsi_vmk: iscsivmk_ConnNetRegister:
socket 0x410987bf0800 network resource pool netsched.pools.persist.iscsi
associa
2016-10-06T20:22:12.000Z cpu2:33431)iscsi_vmk: iscsivmk_ConnNetRegister:
socket 0x410987bf0800 network tracker id 16 tracker.iSCSI.172.16.1.40
associated
2016-10-06T20:22:12.056Z cpu0:35549)NMP: nmp_ThrottleLogForDevice:2321: Cmd
0x28 (0x412e80848580, 32799) to dev
"naa.6001405c0d86944f3d2468d80c7d1540"
on
2016-10-06T20:22:12.194Z cpu0:35549)NMP: nmp_ThrottleLogForDevice:2321: Cmd
0x28 (0x412e80848580, 32799) to dev
"naa.6001405c0d86944f3d2468d80c7d1540"
on
2016-10-06T20:22:12.253Z cpu2:33431)WARNING: iscsi_vmk:
iscsivmk_StartConnection: vmhba38:CH:1 T:1 CN:0: iSCSI connection is being
marked "ONLINE"
2016-10-06T20:22:12.253Z cpu2:33431)WARNING: iscsi_vmk:
iscsivmk_StartConnection: Sess [ISID: 00023d000004 TARGET:
iqn.2016-09.iscsi.gluster:shared TPGT:
2016-10-06T20:22:12.253Z cpu2:33431)WARNING: iscsi_vmk:
iscsivmk_StartConnection: Conn [CID: 0 L: 172.16.1.53:49959 R:
172.16.1.40:3260]

Is it that the gluster overhead is just killing LIO/target?

thanks,
Mike

On Thu, Oct 6, 2016 at 12:22 PM, Vijay Bellur <vbellur at redhat.com>
wrote:
> Hi Mike,
>
> Can you please share your gluster volume configuration?
>
> Also do you notice anything in client logs on the node where fileio
> backstore is configured?
>
> Thanks,
> Vijay
>
> On Wed, Oct 5, 2016 at 8:56 PM, Michael Ciccarelli <mikecicc01 at
gmail.com>
> wrote:
> > So I have a fairly basic setup using glusterfs between 2 nodes. The
nodes
> > have 10 gig connections and the bricks reside on SSD LVM LUNs:
> >
> > Brick1: media1-be:/gluster/brick1/gluster_volume_0
> > Brick2: media2-be:/gluster/brick1/gluster_volume_0
> >
> >
> > On this volume I have a LIO iscsi target with 1 fileio backstore
that's
> > being shared out to vmware ESXi hosts. The volume is around 900 gig
and
> the
> > fileio store is around 850g:
> >
> > -rw-r--r-- 1 root root 912680550400 Oct  5 20:47 iscsi.disk.3
> >
> > I set the WWN to be the same so the ESXi hosts see the nodes as 2
paths
> to
> > the same target. I believe this is what I want. The issues I'm
seeing is
> > that while the IO wait is low I'm seeing high CPU usage with only
3 VMs
> > running on only 1 of the ESX servers:
> >
> > this is media2-be:
> >   PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+
> COMMAND
> >  1474 root      20   0 1396620  37912   5980 S 135.0  0.1 157:01.84
> > glusterfsd
> >  1469 root      20   0  747996  13724   5424 S   2.0  0.0   1:10.59
> > glusterfs
> >
> > And this morning it seemed like I had to restart the LIO service on
> > media1-be as the VMware was seeing time-out issues. I'm seeing
issues
> like
> > this on the VMware ESX servers:
> >
> > 2016-10-06T00:51:41.100Z cpu0:32785)WARNING: ScsiDeviceIO: 1223:
Device
> > naa.600140501ce79002e724ebdb66a6756d performance has deteriorated. I/O
> > latency increased from average value of 33420 microseconds to 732696
> > microseconds.
> >
> > Are there any special settings I need to have gluster+LIO+vmware to
work?
> > Has anyone gotten this to work fairly well that it is stable? What am
I
> > missing?
> >
> > thanks,
> > Mike
> >
> >
> >
> > _______________________________________________
> > Gluster-users mailing list
> > Gluster-users at gluster.org
> > http://www.gluster.org/mailman/listinfo/gluster-users
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://www.gluster.org/pipermail/gluster-users/attachments/20161006/030efd8a/attachment.html>

Gluster users - Oct 2016 - gluster and LIO, fairly basic setup, having major issues

[Gluster-users] gluster and LIO, fairly basic setup, having major issues

[Gluster-users] gluster and LIO, fairly basic setup, having major issues