Hi David. Is the cluster still in this state? If so can you grab a couple
stack traces from the offending brick (gfs01a) process with gstack? Make sure
that its the brick process spinning your CPUs with top or something, we want to
be sure the stack traces are from the offending process. That will give us an
idea of what it is chewing on. Other than that maybe you could take a couple
sosreports on the servers and open a BZ. It may be a good idea to roll back
versions until we can get this sorted, I don't know how long you can have
the cluster in this state. Once you get a bugzilla open I'll try to repro
what you are seeing to see if this is reproducible.
-b
----- Original Message -----> From: "David Robinson" <david.robinson at corvidtec.com>
> To: gluster-users at gluster.org, "Gluster Devel"
<gluster-devel at gluster.org>
> Sent: Saturday, October 17, 2015 12:19:36 PM
> Subject: [Gluster-users] 3.6.6 issues
>
> I upgraded my storage server from 3.6.3 to 3.6.6 and am now having issues.
My
> setup (4x2) is shown below. One of the bricks (gfs01a) has a very high
> cpu-load even though the load on the other 3-bricks (gfs01b, gfs02a,
gfs02b)
> is almost zero. The FUSE mounted partition is extremely slow and basically
> unuseable since the upgrade. I am getting a lot of the messages shown below
> in the logs on gfs01a and gfs01b. Nothing out of the ordinary is showing up
> on the gfs02a/gfs02b bricks.
> Can someone help?
> [root at gfs01b glusterfs]# gluster volume info homegfs
>
> Volume Name: homegfs
> Type: Distributed-Replicate
> Volume ID: 1e32672a-f1b7-4b58-ba94-58c085e59071
> Status: Started
> Number of Bricks: 4 x 2 = 8
> Transport-type: tcp
> Bricks:
> Brick1: gfsib01a.corvidtec.com:/data/brick01a/homegfs
> Brick2: gfsib01b.corvidtec.com:/data/brick01b/homegfs
> Brick3: gfsib01a.corvidtec.com:/data/brick02a/homegfs
> Brick4: gfsib01b.corvidtec.com:/data/brick02b/homegfs
> Brick5: gfsib02a.corvidtec.com:/data/brick01a/homegfs
> Brick6: gfsib02b.corvidtec.com:/data/brick01b/homegfs
> Brick7: gfsib02a.corvidtec.com:/data/brick02a/homegfs
> Brick8: gfsib02b.corvidtec.com:/data/brick02b/homegfs
> Options Reconfigured:
> changelog.rollover-time: 15
> changelog.fsync-interval: 3
> changelog.changelog: on
> geo-replication.ignore-pid-check: on
> geo-replication.indexing: off
> storage.owner-gid: 100
> network.ping-timeout: 10
> server.allow-insecure: on
> performance.write-behind-window-size: 128MB
> performance.cache-size: 128MB
> performance.io-thread-count: 32
> server.manage-gids: on
> [root@ gfs01a glusterfs]# tail -f cli.log
> [2015-10-17 16:05:44.299933] I [socket.c:2353:socket_event_handler]
> 0-transport: disconnecting now
> [2015-10-17 16:05:44.331233] I [input.c:36:cli_batch] 0-: Exiting with: 0
> [2015-10-17 16:06:33.397631] I [socket.c:2353:socket_event_handler]
> 0-transport: disconnecting now
> [2015-10-17 16:06:33.432970] I [input.c:36:cli_batch] 0-: Exiting with: 0
> [2015-10-17 16:11:22.441290] I [socket.c:2353:socket_event_handler]
> 0-transport: disconnecting now
> [2015-10-17 16:11:22.472227] I [input.c:36:cli_batch] 0-: Exiting with: 0
> [2015-10-17 16:15:44.176391] I [socket.c:2353:socket_event_handler]
> 0-transport: disconnecting now
> [2015-10-17 16:15:44.205064] I [input.c:36:cli_batch] 0-: Exiting with: 0
> [2015-10-17 16:16:33.366424] I [socket.c:2353:socket_event_handler]
> 0-transport: disconnecting now
> [2015-10-17 16:16:33.377160] I [input.c:36:cli_batch] 0-: Exiting with: 0
> [root@ gfs01a glusterfs]# tail etc-glusterfs-glusterd.vol.log
> [2015-10-17 15:56:33.177207] I
> [glusterd-handler.c:3836:__glusterd_handle_status_volume] 0-management:
> Received status volume req for volume Source
> [2015-10-17 16:01:22.303635] I
> [glusterd-handler.c:3836:__glusterd_handle_status_volume] 0-management:
> Received status volume req for volume Software
> [2015-10-17 16:05:44.320555] I
> [glusterd-handler.c:3836:__glusterd_handle_status_volume] 0-management:
> Received status volume req for volume homegfs
> [2015-10-17 16:06:17.204783] W [rpcsvc.c:254:rpcsvc_program_actor]
> 0-rpc-service: RPC program not available (req 1298437 330)
> [2015-10-17 16:06:17.204811] E [rpcsvc.c:544:rpcsvc_check_and_reply_error]
> 0-rpcsvc: rpc actor failed to complete successfully
> [2015-10-17 16:06:33.408695] I
> [glusterd-handler.c:3836:__glusterd_handle_status_volume] 0-management:
> Received status volume req for volume Source
> [2015-10-17 16:11:22.462374] I
> [glusterd-handler.c:3836:__glusterd_handle_status_volume] 0-management:
> Received status volume req for volume Software
> [2015-10-17 16:12:30.608092] E
[glusterd-op-sm.c:207:glusterd_get_txn_opinfo]
> 0-: Unable to get transaction opinfo for transaction ID :
> d143b66b-2ac9-4fd9-8635-fe1eed41d56b
> [2015-10-17 16:15:44.198292] I
> [glusterd-handler.c:3836:__glusterd_handle_status_volume] 0-management:
> Received status volume req for volume homegfs
> [2015-10-17 16:16:33.368170] I
> [glusterd-handler.c:3836:__glusterd_handle_status_volume] 0-management:
> Received status volume req for volume Source
> [root@ gfs01b glusterfs]# tail -f glustershd.log
> [2015-10-17 16:11:45.996447] I
> [afr-self-heal-metadata.c:54:__afr_selfheal_metadata_do]
> 0-homegfs-replicate-1: performing metadata selfheal on
> 0a65d73a-a416-418e-92f0-5cec7d240433
> [2015-10-17 16:11:46.030947] I
[afr-self-heal-common.c:476:afr_log_selfheal]
> 0-homegfs-replicate-1: Completed metadata selfheal on
> 0a65d73a-a416-418e-92f0-5cec7d240433. source=1 sinks=0
> [2015-10-17 16:11:46.031241] W
[client-rpc-fops.c:2772:client3_3_lookup_cbk]
> 0-homegfs-client-3: remote operation failed: No such file or directory.
> Path: <gfid:d2714957-0c83-4ab2-8cfc-1931c8e9d0bf>
> (d2714957-0c83-4ab2-8cfc-1931c8e9d0bf)
> [2015-10-17 16:11:46.031633] W
[client-rpc-fops.c:2772:client3_3_lookup_cbk]
> 0-homegfs-client-3: remote operation failed: No such file or directory.
> Path: <gfid:87c5f875-c3e7-4b14-807a-4e6d940750fc>
> (87c5f875-c3e7-4b14-807a-4e6d940750fc)
> [2015-10-17 16:11:47.043367] W
[client-rpc-fops.c:2772:client3_3_lookup_cbk]
> 0-homegfs-client-3: remote operation failed: No such file or directory.
> Path: <gfid:d2714957-0c83-4ab2-8cfc-1931c8e9d0bf>
> (d2714957-0c83-4ab2-8cfc-1931c8e9d0bf)
> [2015-10-17 16:11:47.054199] W
[client-rpc-fops.c:2772:client3_3_lookup_cbk]
> 0-homegfs-client-3: remote operation failed: No such file or directory.
> Path: <gfid:87c5f875-c3e7-4b14-807a-4e6d940750fc>
> (87c5f875-c3e7-4b14-807a-4e6d940750fc)
> [2015-10-17 16:12:48.001869] W
[client-rpc-fops.c:2772:client3_3_lookup_cbk]
> 0-homegfs-client-3: remote operation failed: No such file or directory.
> Path: <gfid:d2714957-0c83-4ab2-8cfc-1931c8e9d0bf>
> (d2714957-0c83-4ab2-8cfc-1931c8e9d0bf)
> [2015-10-17 16:12:48.012671] W
[client-rpc-fops.c:2772:client3_3_lookup_cbk]
> 0-homegfs-client-3: remote operation failed: No such file or directory.
> Path: <gfid:87c5f875-c3e7-4b14-807a-4e6d940750fc>
> (87c5f875-c3e7-4b14-807a-4e6d940750fc)
> [2015-10-17 16:13:49.011591] W
[client-rpc-fops.c:2772:client3_3_lookup_cbk]
> 0-homegfs-client-3: remote operation failed: No such file or directory.
> Path: <gfid:d2714957-0c83-4ab2-8cfc-1931c8e9d0bf>
> (d2714957-0c83-4ab2-8cfc-1931c8e9d0bf)
> [2015-10-17 16:13:49.018600] W
[client-rpc-fops.c:2772:client3_3_lookup_cbk]
> 0-homegfs-client-3: remote operation failed: No such file or directory.
> Path: <gfid:87c5f875-c3e7-4b14-807a-4e6d940750fc>
> (87c5f875-c3e7-4b14-807a-4e6d940750fc)
> [root@ gfs01b glusterfs]# tail cli.log
> [2015-10-16 10:52:16.002922] I [input.c:36:cli_batch] 0-: Exiting with: 0
> [2015-10-16 10:52:16.167432] I [socket.c:2353:socket_event_handler]
> 0-transport: disconnecting now
> [2015-10-16 10:52:18.248024] I [input.c:36:cli_batch] 0-: Exiting with: 0
> [2015-10-17 16:12:30.607603] I [socket.c:2353:socket_event_handler]
> 0-transport: disconnecting now
> [2015-10-17 16:12:30.628810] I [input.c:36:cli_batch] 0-: Exiting with: 0
> [2015-10-17 16:12:33.992818] I [socket.c:2353:socket_event_handler]
> 0-transport: disconnecting now
> [2015-10-17 16:12:33.998944] I [input.c:36:cli_batch] 0-: Exiting with: 0
> [2015-10-17 16:12:38.604461] I [socket.c:2353:socket_event_handler]
> 0-transport: disconnecting now
> [2015-10-17 16:12:38.605532] I [cli-rpc-ops.c:588:gf_cli_get_volume_cbk]
> 0-cli: Received resp to get vol: 0
> [2015-10-17 16:12:38.605659] I [input.c:36:cli_batch] 0-: Exiting with: 0
> [root@ gfs01b glusterfs]# tail etc-glusterfs-glusterd.vol.log
> [2015-10-16 14:29:56.495120] E [rpcsvc.c:617:rpcsvc_handle_rpc_call]
> 0-rpc-service: Request received from non-privileged port. Failing request
> [2015-10-16 14:29:59.369109] E [rpcsvc.c:617:rpcsvc_handle_rpc_call]
> 0-rpc-service: Request received from non-privileged port. Failing request
> [2015-10-16 14:29:59.512093] E [rpcsvc.c:617:rpcsvc_handle_rpc_call]
> 0-rpc-service: Request received from non-privileged port. Failing request
> [2015-10-16 14:30:02.383574] E [rpcsvc.c:617:rpcsvc_handle_rpc_call]
> 0-rpc-service: Request received from non-privileged port. Failing request
> [2015-10-16 14:30:02.529206] E [rpcsvc.c:617:rpcsvc_handle_rpc_call]
> 0-rpc-service: Request received from non-privileged port. Failing request
> [2015-10-16 16:01:20.389100] E [rpcsvc.c:617:rpcsvc_handle_rpc_call]
> 0-rpc-service: Request received from non-privileged port. Failing request
> [2015-10-17 16:12:30.611161] W
> [glusterd-op-sm.c:4066:glusterd_op_modify_op_ctx] 0-management: op_ctx
> modification failed
> [2015-10-17 16:12:30.612433] I
> [glusterd-handler.c:3836:__glusterd_handle_status_volume] 0-management:
> Received status volume req for volume Software
> [2015-10-17 16:12:30.618444] I
> [glusterd-handler.c:3836:__glusterd_handle_status_volume] 0-management:
> Received status volume req for volume Source
> [2015-10-17 16:12:30.624005] I
> [glusterd-handler.c:3836:__glusterd_handle_status_volume] 0-management:
> Received status volume req for volume homegfs
> [2015-10-17 16:12:33.993869] I
> [glusterd-handler.c:3836:__glusterd_handle_status_volume] 0-management:
> Received status volume req for volume homegfs
> [2015-10-17 16:12:38.605389] I
> [glusterd-handler.c:1296:__glusterd_handle_cli_get_volume] 0-glusterd:
> Received get vol req
> [root at gfs01b glusterfs]# gluster volume status homegfs
> Status of volume: homegfs
> Gluster process Port Online Pid
>
------------------------------------------------------------------------------
> Brick gfsib01a.corvidtec.com:/data/brick01a/homegfs 49152 Y 3820
> Brick gfsib01b.corvidtec.com:/data/brick01b/homegfs 49152 Y 3808
> Brick gfsib01a.corvidtec.com:/data/brick02a/homegfs 49153 Y 3825
> Brick gfsib01b.corvidtec.com:/data/brick02b/homegfs 49153 Y 3813
> Brick gfsib02a.corvidtec.com:/data/brick01a/homegfs 49152 Y 3967
> Brick gfsib02b.corvidtec.com:/data/brick01b/homegfs 49152 Y 3952
> Brick gfsib02a.corvidtec.com:/data/brick02a/homegfs 49153 Y 3972
> Brick gfsib02b.corvidtec.com:/data/brick02b/homegfs 49153 Y 3957
> NFS Server on localhost 2049 Y 3822
> Self-heal Daemon on localhost N/A Y 3827
> NFS Server on 10.200.70.1 2049 Y 3834
> Self-heal Daemon on 10.200.70.1 N/A Y 3839
> NFS Server on gfsib02a.corvidtec.com 2049 Y 3981
> Self-heal Daemon on gfsib02a.corvidtec.com N/A Y 3986
> NFS Server on gfsib02b.corvidtec.com 2049 Y 3966
> Self-heal Daemon on gfsib02b.corvidtec.com N/A Y 3971
>
> Task Status of Volume homegfs
>
------------------------------------------------------------------------------
> Task : Rebalance
> ID : 58b6cc76-c29c-4695-93fe-c42b1112e171
> Status : completed
>
>
>
> =======================>
>
>
> David F. Robinson, Ph.D.
>
> President - Corvid Technologies
>
> 145 Overhill Drive
>
> Mooresville, NC 28117
>
> 704.799.6944 x101 [Office]
>
> 704.252.1310 [Cell]
>
> 704.799.7974 [Fax]
>
> d avid.robinson at corvidtec.com
>
> http://www.corvidtec.com
>
> _______________________________________________
> Gluster-users mailing list
> Gluster-users at gluster.org
> http://www.gluster.org/mailman/listinfo/gluster-users