thr3ads.net - Gluster users - [Gluster-users] Gluster High CPU/Clients Hanging on Heavy Writes [Aug 2018]

If this information is useful, please help other people find it:
Share via:

Atin Mukherjee

2018-Aug-05 13:39 UTC

[Gluster-users] Gluster High CPU/Clients Hanging on Heavy Writes

On Sun, 5 Aug 2018 at 13:29, Yuhao Zhang <zzyzxd at gmail.com> wrote:
> Sorry, what I meant was, if I start the transfer now and get glusterd into
> zombie status,
>
glusterd or glusterfsd?

it's unlikely that I can fully recover the server without a
reboot.>
>
> On Aug 5, 2018, at 02:55, Raghavendra Gowdappa <rgowdapp at
redhat.com>
> wrote:
>
>
>
> On Sun, Aug 5, 2018 at 1:22 PM, Yuhao Zhang <zzyzxd at gmail.com>
wrote:
>
>> This is a semi-production server and I can't bring it down right
now.
>> Will try to get the monitoring output when I get a chance.
>>
>
> Collecting top output doesn't require to bring down servers.
>
>
>> As I recall, the high CPU processes are brick daemons (glusterfsd) and
>> htop showed they were in status D. However, I saw zero zpool IO as
clients
>> were all hanging.
>>
>>
>> On Aug 5, 2018, at 02:38, Raghavendra Gowdappa <rgowdapp at
redhat.com>
>> wrote:
>>
>>
>>
>> On Sun, Aug 5, 2018 at 12:44 PM, Yuhao Zhang <zzyzxd at
gmail.com> wrote:
>>
>>> Hi,
>>>
>>> I am running into a situation that heavy write causes Gluster
server
>>> went into zombie with many high CPU processes and all clients
hangs, it is
>>> almost 100% reproducible on my machine. Hope someone can help.
>>>
>>
>> Can you give us the output of monitioring these processes with High cpu
>> usage captured in the duration when your tests are running?
>>
>>
>>    - MON_INTERVAL=10 # can be increased for very long runs
>>    - top -bd $MON_INTERVAL > /tmp/top_proc.${HOSTNAME}.txt # CPU
>>    utilization by process
>>    - top -bHd $MON_INTERVAL > /tmp/top_thr.${HOSTNAME}.txt # CPU
>>    utilization by thread
>>
>>
>>
>>> I started to observe this issue when running rsync to copy files
from
>>> another server and I thought it might be because Gluster
doesn't like
>>> rsync's delta transfer with a lot of small writes. However, I
was able to
>>> reproduce this with "rsync --whole-file --inplace", or
even with cp or scp.
>>> It usually appears after starting the transfer for a few hours, but
>>> sometimes can happen within several minutes.
>>>
>>> Since this is a single node Gluster distributed volume, I tried to
>>> transfer files directly onto the server bypassing Gluster clients,
but it
>>> still caused the same issue.
>>>
>>> It is running on top of a ZFS RAIDZ2 dataset. Options are attached.
>>> Also, I attached the statedump generated when my clients hung, and
volume
>>> options.
>>>
>>> - Ubuntu 16.04 x86_64 / 4.4.0-116-generic
>>> - GlusterFS 3.12.8
>>>
>>> Thank you,
>>> Yuhao
>>>
>>>
>>> _______________________________________________
>>> Gluster-users mailing list
>>> Gluster-users at gluster.org
>>> https://lists.gluster.org/mailman/listinfo/gluster-users
>>
>>
>>
>
> _______________________________________________
> Gluster-users mailing list
> Gluster-users at gluster.org
> https://lists.gluster.org/mailman/listinfo/gluster-users
-- 
--Atin
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.gluster.org/pipermail/gluster-users/attachments/20180805/d1bdf22e/attachment.html>

Yuhao Zhang

2018-Aug-06 04:56 UTC

head link

[Gluster-users] Gluster High CPU/Clients Hanging on Heavy Writes

Atin, that was my typo... I think it was glusterfsd, but not 100% sure. I will
keep an eye when it happens next time.

Thank you all for looking into this! I tried another transfer earlier today but
it didn't get the chance to reach the point where glusterfsd starts to fail
before we needed to start the production mission critical jobs. I am going to
try another time next week and report back to the thread.

Btw, I took a look into syslog and find this event happened right at the time
when clients started to hang:

Aug  4 01:54:45 ch1prdtick03 kernel: [30601.142702] INFO: task
glusteriotwr3:6895 blocked for more than 120 seconds.
Aug  4 01:54:45 ch1prdtick03 kernel: [30601.142760]       Tainted: P           O
4.4.0-116-generic #140-Ubuntu
Aug  4 01:54:45 ch1prdtick03 kernel: [30601.142805] "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
Aug  4 01:54:45 ch1prdtick03 kernel: [30601.142863] glusteriotwr3   D
ffff88102c18fb98     0  6895      1 0x00000000
Aug  4 01:54:45 ch1prdtick03 kernel: [30601.142872]  ffff88102c18fb98
ffff88102c18fb68 ffff88085be0b800 ffff88103f314600
Aug  4 01:54:45 ch1prdtick03 kernel: [30601.142877]  ffff88102c190000
ffff8805397fe0f4 ffff88103f314600 00000000ffffffff
Aug  4 01:54:45 ch1prdtick03 kernel: [30601.142882]  ffff8805397fe0f8
ffff88102c18fbb0 ffffffff8184ae45 ffff8805397fe0f0
Aug  4 01:54:45 ch1prdtick03 kernel: [30601.142887] Call Trace:
Aug  4 01:54:45 ch1prdtick03 kernel: [30601.142899]  [<ffffffff8184ae45>]
schedule+0x35/0x80
Aug  4 01:54:45 ch1prdtick03 kernel: [30601.142904]  [<ffffffff8184b0ee>]
schedule_preempt_disabled+0xe/0x10
Aug  4 01:54:45 ch1prdtick03 kernel: [30601.142910]  [<ffffffff8184cd29>]
__mutex_lock_slowpath+0xb9/0x130
Aug  4 01:54:45 ch1prdtick03 kernel: [30601.142915]  [<ffffffff8184cdbf>]
mutex_lock+0x1f/0x30
Aug  4 01:54:45 ch1prdtick03 kernel: [30601.142922]  [<ffffffff8122071d>]
walk_component+0x21d/0x310
Aug  4 01:54:45 ch1prdtick03 kernel: [30601.142926]  [<ffffffff81221ba1>]
link_path_walk+0x191/0x5c0
Aug  4 01:54:45 ch1prdtick03 kernel: [30601.142931]  [<ffffffff8121ff3b>]
? path_init+0x1eb/0x3c0
Aug  4 01:54:45 ch1prdtick03 kernel: [30601.142935]  [<ffffffff812220cc>]
path_lookupat+0x7c/0x110
Aug  4 01:54:45 ch1prdtick03 kernel: [30601.142940]  [<ffffffff8123a828>]
? __vfs_setxattr_noperm+0x128/0x1a0
Aug  4 01:54:45 ch1prdtick03 kernel: [30601.142944]  [<ffffffff81223d01>]
filename_lookup+0xb1/0x180
Aug  4 01:54:45 ch1prdtick03 kernel: [30601.142948]  [<ffffffff8123aadb>]
? setxattr+0x18b/0x200
Aug  4 01:54:45 ch1prdtick03 kernel: [30601.142955]  [<ffffffff811f2109>]
? kmem_cache_alloc+0x189/0x1f0
Aug  4 01:54:45 ch1prdtick03 kernel: [30601.142960]  [<ffffffff81223906>]
? getname_flags+0x56/0x1f0
Aug  4 01:54:45 ch1prdtick03 kernel: [30601.142964]  [<ffffffff81223ea6>]
user_path_at_empty+0x36/0x40
Aug  4 01:54:45 ch1prdtick03 kernel: [30601.142970]  [<ffffffff81218d76>]
vfs_fstatat+0x66/0xc0
Aug  4 01:54:45 ch1prdtick03 kernel: [30601.142975]  [<ffffffff81219331>]
SYSC_newlstat+0x31/0x60
Aug  4 01:54:45 ch1prdtick03 kernel: [30601.142980]  [<ffffffff8121dade>]
? path_put+0x1e/0x30
Aug  4 01:54:45 ch1prdtick03 kernel: [30601.142984]  [<ffffffff8123ac0f>]
? path_setxattr+0xbf/0xe0
Aug  4 01:54:45 ch1prdtick03 kernel: [30601.142989]  [<ffffffff8121946e>]
SyS_newlstat+0xe/0x10
Aug  4 01:54:45 ch1prdtick03 kernel: [30601.142996]  [<ffffffff8184efc8>]
entry_SYSCALL_64_fastpath+0x1c/0xbb


> On Aug 5, 2018, at 08:39, Atin Mukherjee <atin.mukherjee83 at
gmail.com> wrote:
> 
> 
> 
> On Sun, 5 Aug 2018 at 13:29, Yuhao Zhang <zzyzxd at gmail.com
<mailto:zzyzxd at gmail.com>> wrote:
> Sorry, what I meant was, if I start the transfer now and get glusterd into
zombie status,
> 
> glusterd or glusterfsd?
> 
> it's unlikely that I can fully recover the server without a reboot.
> 
> 
>> On Aug 5, 2018, at 02:55, Raghavendra Gowdappa <rgowdapp at
redhat.com <mailto:rgowdapp at redhat.com>> wrote:
>> 
>> 
>> 
>> On Sun, Aug 5, 2018 at 1:22 PM, Yuhao Zhang <zzyzxd at gmail.com
<mailto:zzyzxd at gmail.com>> wrote:
>> This is a semi-production server and I can't bring it down right
now. Will try to get the monitoring output when I get a chance.
>> 
>> Collecting top output doesn't require to bring down servers.
>> 
>> 
>> As I recall, the high CPU processes are brick daemons (glusterfsd) and
htop showed they were in status D. However, I saw zero zpool IO as clients were
all hanging.
>> 
>> 
>>> On Aug 5, 2018, at 02:38, Raghavendra Gowdappa <rgowdapp at
redhat.com <mailto:rgowdapp at redhat.com>> wrote:
>>> 
>>> 
>>> 
>>> On Sun, Aug 5, 2018 at 12:44 PM, Yuhao Zhang <zzyzxd at
gmail.com <mailto:zzyzxd at gmail.com>> wrote:
>>> Hi,
>>> 
>>> I am running into a situation that heavy write causes Gluster
server went into zombie with many high CPU processes and all clients hangs, it
is almost 100% reproducible on my machine. Hope someone can help.
>>> 
>>> Can you give us the output of monitioring these processes with High
cpu usage captured in the duration when your tests are running?
>>> 
>>> MON_INTERVAL=10 # can be increased for very long runs
>>> top -bd $MON_INTERVAL > /tmp/top_proc.${HOSTNAME}.txt # CPU
utilization by process
>>> top -bHd $MON_INTERVAL > /tmp/top_thr.${HOSTNAME}.txt # CPU
utilization by thread
>>> 
>>> 
>>> I started to observe this issue when running rsync to copy files
from another server and I thought it might be because Gluster doesn't like
rsync's delta transfer with a lot of small writes. However, I was able to
reproduce this with "rsync --whole-file --inplace", or even with cp or
scp. It usually appears after starting the transfer for a few hours, but
sometimes can happen within several minutes.
>>> 
>>> Since this is a single node Gluster distributed volume, I tried to
transfer files directly onto the server bypassing Gluster clients, but it still
caused the same issue.
>>> 
>>> It is running on top of a ZFS RAIDZ2 dataset. Options are attached.
Also, I attached the statedump generated when my clients hung, and volume
options.
>>> 
>>> - Ubuntu 16.04 x86_64 / 4.4.0-116-generic
>>> - GlusterFS 3.12.8
>>> 
>>> Thank you,
>>> Yuhao
>>> 
>>> 
>>> _______________________________________________
>>> Gluster-users mailing list
>>> Gluster-users at gluster.org <mailto:Gluster-users at
gluster.org>
>>> https://lists.gluster.org/mailman/listinfo/gluster-users
<https://lists.gluster.org/mailman/listinfo/gluster-users>
>> 
> 
> _______________________________________________
> Gluster-users mailing list
> Gluster-users at gluster.org <mailto:Gluster-users at gluster.org>
> https://lists.gluster.org/mailman/listinfo/gluster-users
<https://lists.gluster.org/mailman/listinfo/gluster-users>--
> --Atin
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.gluster.org/pipermail/gluster-users/attachments/20180805/6b5f9027/attachment.html>

Gluster users - Aug 2018 - Gluster High CPU/Clients Hanging on Heavy Writes

[Gluster-users] Gluster High CPU/Clients Hanging on Heavy Writes

[Gluster-users] Gluster High CPU/Clients Hanging on Heavy Writes