thr3ads.net - Gluster users - [Gluster-users] java application crushes while reading a zip file [Jan 2019]

If this information is useful, please help other people find it:
Share via:

Dmitry Isakbayev

2019-Jan-02 16:28 UTC

[Gluster-users] java application crushes while reading a zip file

Still no JVM crushes.  Is it possible that running glusterfs with
performance options turned off for a couple of days cleared out the "stale
metadata issue"?


On Mon, Dec 31, 2018 at 1:38 PM Dmitry Isakbayev <isakdim at gmail.com>
wrote:
> The software ran with all of the options turned off over the weekend
> without any problems.
> I will try to collect the debug info for you.  I have re-enabled the 3
> three options, but yet to see the problem reoccurring.
>
>
> On Sat, Dec 29, 2018 at 6:46 PM Raghavendra Gowdappa <rgowdapp at
redhat.com>
> wrote:
>
>> Thanks Dmitry. Can you provide the following debug info I asked
earlier:
>>
>> * strace -ff -v ... of java application
>> * dump of the I/O traffic seen by the mountpoint (use --dump-fuse while
>> mounting).
>>
>> regards,
>> Raghavendra
>>
>> On Sat, Dec 29, 2018 at 2:08 AM Dmitry Isakbayev <isakdim at
gmail.com>
>> wrote:
>>
>>> These 3 options seem to trigger both (reading zip file and renaming
>>> files) problems.
>>>
>>> Options Reconfigured:
>>> performance.io-cache: off
>>> performance.stat-prefetch: off
>>> performance.quick-read: off
>>> performance.parallel-readdir: off
>>> *performance.readdir-ahead: on*
>>> *performance.write-behind: on*
>>> *performance.read-ahead: on*
>>> performance.client-io-threads: off
>>> nfs.disable: on
>>> transport.address-family: inet
>>>
>>>
>>> On Fri, Dec 28, 2018 at 10:24 AM Dmitry Isakbayev <isakdim at
gmail.com>
>>> wrote:
>>>
>>>> Turning a single option on at a time still worked fine.  I will
keep
>>>> trying.
>>>>
>>>> We had used 4.1.5 on KVM/CentOS7.5 at AWS without these issues
or log
>>>> messages.  Do you suppose these issues are triggered by the new
environment
>>>> or did not exist in 4.1.5?
>>>>
>>>> [root at node1 ~]# glusterfs --version
>>>> glusterfs 4.1.5
>>>>
>>>> On AWS using
>>>> [root at node1 ~]# hostnamectl
>>>>    Static hostname: node1
>>>>          Icon name: computer-vm
>>>>            Chassis: vm
>>>>         Machine ID: b30d0f2110ac3807b210c19ede3ce88f
>>>>            Boot ID: 52bb159a0aa94043a40e7c7651967bd9
>>>>     Virtualization: kvm
>>>>   Operating System: CentOS Linux 7 (Core)
>>>>        CPE OS Name: cpe:/o:centos:centos:7
>>>>             Kernel: Linux 3.10.0-862.3.2.el7.x86_64
>>>>       Architecture: x86-64
>>>>
>>>>
>>>>
>>>>
>>>> On Fri, Dec 28, 2018 at 8:56 AM Raghavendra Gowdappa <
>>>> rgowdapp at redhat.com> wrote:
>>>>
>>>>>
>>>>>
>>>>> On Fri, Dec 28, 2018 at 7:23 PM Dmitry Isakbayev
<isakdim at gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Ok. I will try different options.
>>>>>>
>>>>>> This system is scheduled to go into production soon. 
What version
>>>>>> would you recommend to roll back to?
>>>>>>
>>>>>
>>>>> These are long standing issues. So, rolling back may not
make these
>>>>> issues go away. Instead if you think performance is
agreeable to you,
>>>>> please keep these xlators off in production.
>>>>>
>>>>>
>>>>>> On Thu, Dec 27, 2018 at 10:55 PM Raghavendra Gowdappa
<
>>>>>> rgowdapp at redhat.com> wrote:
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Fri, Dec 28, 2018 at 3:13 AM Dmitry Isakbayev
<isakdim at gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Raghavendra,
>>>>>>>>
>>>>>>>> Thank  for the suggestion.
>>>>>>>>
>>>>>>>>
>>>>>>>> I am suing
>>>>>>>>
>>>>>>>> [root at jl-fanexoss1p glusterfs]# gluster
--version
>>>>>>>> glusterfs 5.0
>>>>>>>>
>>>>>>>> On
>>>>>>>> [root at jl-fanexoss1p glusterfs]# hostnamectl
>>>>>>>>          Icon name: computer-vm
>>>>>>>>            Chassis: vm
>>>>>>>>         Machine ID:
e44b8478ef7a467d98363614f4e50535
>>>>>>>>            Boot ID:
eed98992fdda4c88bdd459a89101766b
>>>>>>>>     Virtualization: vmware
>>>>>>>>   Operating System: Red Hat Enterprise Linux
Server 7.5 (Maipo)
>>>>>>>>        CPE OS Name:
cpe:/o:redhat:enterprise_linux:7.5:GA:server
>>>>>>>>             Kernel: Linux
3.10.0-862.14.4.el7.x86_64
>>>>>>>>       Architecture: x86-64
>>>>>>>>
>>>>>>>>
>>>>>>>> I have configured the following options
>>>>>>>>
>>>>>>>> [root at jl-fanexoss1p glusterfs]# gluster
volume info
>>>>>>>> Volume Name: gv0
>>>>>>>> Type: Replicate
>>>>>>>> Volume ID: 5ffbda09-c5e2-4abc-b89e-79b5d8a40824
>>>>>>>> Status: Started
>>>>>>>> Snapshot Count: 0
>>>>>>>> Number of Bricks: 1 x 3 = 3
>>>>>>>> Transport-type: tcp
>>>>>>>> Bricks:
>>>>>>>> Brick1:
jl-fanexoss1p.cspire.net:/data/brick1/gv0
>>>>>>>> Brick2:
sl-fanexoss2p.cspire.net:/data/brick1/gv0
>>>>>>>> Brick3: nxquorum1p.cspire.net:/data/brick1/gv0
>>>>>>>> Options Reconfigured:
>>>>>>>> performance.io-cache: off
>>>>>>>> performance.stat-prefetch: off
>>>>>>>> performance.quick-read: off
>>>>>>>> performance.parallel-readdir: off
>>>>>>>> performance.readdir-ahead: off
>>>>>>>> performance.write-behind: off
>>>>>>>> performance.read-ahead: off
>>>>>>>> performance.client-io-threads: off
>>>>>>>> nfs.disable: on
>>>>>>>> transport.address-family: inet
>>>>>>>>
>>>>>>>> I don't know if it is related, but I am
seeing a lot of
>>>>>>>> [2018-12-27 20:19:23.776080] W [MSGID: 114031]
>>>>>>>> [client-rpc-fops_v2.c:1932:client4_0_seek_cbk]
2-gv0-client-0: remote
>>>>>>>> operation failed [No such device or address]
>>>>>>>> [2018-12-27 20:19:47.735190] E [MSGID: 101191]
>>>>>>>> [event-epoll.c:671:event_dispatch_epoll_worker]
2-epoll: Failed to dispatch
>>>>>>>> handler
>>>>>>>>
>>>>>>>
>>>>>>> These msgs were introduced by patch [1]. To the
best of my knowledge
>>>>>>> they are benign. We'll be sending a patch to
fix these msgs though.
>>>>>>>
>>>>>>> +Mohit Agrawal <moagrawa at redhat.com>
+Milind Changire
>>>>>>> <mchangir at redhat.com> . Can you try to
identify why we are seeing
>>>>>>> these messages? If possible please send a patch to
fix this.
>>>>>>>
>>>>>>> [1]
>>>>>>>
https://review.gluster.org/r/I578c3fc67713f4234bd3abbec5d3fbba19059ea5
>>>>>>>
>>>>>>>
>>>>>>>> And java.io exceptions trying to rename files.
>>>>>>>>
>>>>>>>
>>>>>>> When you see the errors is it possible to collect,
>>>>>>> * strace of the java application (strace -ff -v
...)
>>>>>>> * fuse-dump of the glusterfs mount (use option
--dump-fuse while
>>>>>>> mounting)?
>>>>>>>
>>>>>>> I also need another favour from you. By trail and
error, can you
>>>>>>> point out which of the many performance xlators
you've turned off is
>>>>>>> causing the issue?
>>>>>>>
>>>>>>> The above two data-points will help us to fix the
problem.
>>>>>>>
>>>>>>>
>>>>>>>> Thank You,
>>>>>>>> Dmitry
>>>>>>>>
>>>>>>>>
>>>>>>>> On Thu, Dec 27, 2018 at 3:48 PM Raghavendra
Gowdappa <
>>>>>>>> rgowdapp at redhat.com> wrote:
>>>>>>>>
>>>>>>>>> What version of glusterfs are you using? It
might be either
>>>>>>>>> * a stale metadata issue.
>>>>>>>>> * inconsistent ctime issue.
>>>>>>>>>
>>>>>>>>> Can you try turning off all performance
xlators? If the issue is
>>>>>>>>> 1, that should help.
>>>>>>>>>
>>>>>>>>> On Fri, Dec 28, 2018 at 1:51 AM Dmitry
Isakbayev <
>>>>>>>>> isakdim at gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> Attempted to set
'performance.read-ahead off` according to
>>>>>>>>>>
https://jira.apache.org/jira/browse/AMQ-7041
>>>>>>>>>> That did not help.
>>>>>>>>>>
>>>>>>>>>> On Mon, Dec 24, 2018 at 2:11 PM Dmitry
Isakbayev <
>>>>>>>>>> isakdim at gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> The core file generated by JVM
suggests that it happens because
>>>>>>>>>>> the file is changing while it is
being read -
>>>>>>>>>>>
https://bugs.java.com/bugdatabase/view_bug.do?bug_id=8186557.
>>>>>>>>>>> The application reads in the
zipfile and goes through the zip
>>>>>>>>>>> entries, then reloads the file and
goes the zip entries again.  It does so
>>>>>>>>>>> 3 times.  The application never
crushes on the 1st cycle but sometimes
>>>>>>>>>>> crushes on the 2nd or 3rd cycle.
>>>>>>>>>>> The zip file is generated about 20
seconds prior to it being
>>>>>>>>>>> used and is not updated or even
used by any other application.  I have
>>>>>>>>>>> never seen this problem on a plain
file system.
>>>>>>>>>>>
>>>>>>>>>>> I would appreciate any suggestions
on how to go debugging this
>>>>>>>>>>> issue.  I can change the source
code of the java application.
>>>>>>>>>>>
>>>>>>>>>>> Regards,
>>>>>>>>>>> Dmitry
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
_______________________________________________
>>>>>>>>>> Gluster-users mailing list
>>>>>>>>>> Gluster-users at gluster.org
>>>>>>>>>>
https://lists.gluster.org/mailman/listinfo/gluster-users
>>>>>>>>>
>>>>>>>>>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.gluster.org/pipermail/gluster-users/attachments/20190102/98d12ec5/attachment.html>

Raghavendra Gowdappa

2019-Jan-03 02:25 UTC

head link

[Gluster-users] java application crushes while reading a zip file

On Wed, Jan 2, 2019 at 9:59 PM Dmitry Isakbayev <isakdim at gmail.com>
wrote:
> Still no JVM crushes.  Is it possible that running glusterfs with
> performance options turned off for a couple of days cleared out the
"stale
> metadata issue"?
>
restarting these options, would've cleared the existing cache and hence
previous stale metadata would've been cleared. Hitting stale metadata
again  depends on races. That might be the reason you are still not seeing
the issue. Can you try with enabling all perf xlators (default
configuration)?

>
> On Mon, Dec 31, 2018 at 1:38 PM Dmitry Isakbayev <isakdim at
gmail.com>
> wrote:
>
>> The software ran with all of the options turned off over the weekend
>> without any problems.
>> I will try to collect the debug info for you.  I have re-enabled the 3
>> three options, but yet to see the problem reoccurring.
>>
>>
>> On Sat, Dec 29, 2018 at 6:46 PM Raghavendra Gowdappa <rgowdapp at
redhat.com>
>> wrote:
>>
>>> Thanks Dmitry. Can you provide the following debug info I asked
earlier:
>>>
>>> * strace -ff -v ... of java application
>>> * dump of the I/O traffic seen by the mountpoint (use --dump-fuse
while
>>> mounting).
>>>
>>> regards,
>>> Raghavendra
>>>
>>> On Sat, Dec 29, 2018 at 2:08 AM Dmitry Isakbayev <isakdim at
gmail.com>
>>> wrote:
>>>
>>>> These 3 options seem to trigger both (reading zip file and
renaming
>>>> files) problems.
>>>>
>>>> Options Reconfigured:
>>>> performance.io-cache: off
>>>> performance.stat-prefetch: off
>>>> performance.quick-read: off
>>>> performance.parallel-readdir: off
>>>> *performance.readdir-ahead: on*
>>>> *performance.write-behind: on*
>>>> *performance.read-ahead: on*
>>>> performance.client-io-threads: off
>>>> nfs.disable: on
>>>> transport.address-family: inet
>>>>
>>>>
>>>> On Fri, Dec 28, 2018 at 10:24 AM Dmitry Isakbayev <isakdim
at gmail.com>
>>>> wrote:
>>>>
>>>>> Turning a single option on at a time still worked fine.  I
will keep
>>>>> trying.
>>>>>
>>>>> We had used 4.1.5 on KVM/CentOS7.5 at AWS without these
issues or log
>>>>> messages.  Do you suppose these issues are triggered by the
new environment
>>>>> or did not exist in 4.1.5?
>>>>>
>>>>> [root at node1 ~]# glusterfs --version
>>>>> glusterfs 4.1.5
>>>>>
>>>>> On AWS using
>>>>> [root at node1 ~]# hostnamectl
>>>>>    Static hostname: node1
>>>>>          Icon name: computer-vm
>>>>>            Chassis: vm
>>>>>         Machine ID: b30d0f2110ac3807b210c19ede3ce88f
>>>>>            Boot ID: 52bb159a0aa94043a40e7c7651967bd9
>>>>>     Virtualization: kvm
>>>>>   Operating System: CentOS Linux 7 (Core)
>>>>>        CPE OS Name: cpe:/o:centos:centos:7
>>>>>             Kernel: Linux 3.10.0-862.3.2.el7.x86_64
>>>>>       Architecture: x86-64
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Fri, Dec 28, 2018 at 8:56 AM Raghavendra Gowdappa <
>>>>> rgowdapp at redhat.com> wrote:
>>>>>
>>>>>>
>>>>>>
>>>>>> On Fri, Dec 28, 2018 at 7:23 PM Dmitry Isakbayev
<isakdim at gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Ok. I will try different options.
>>>>>>>
>>>>>>> This system is scheduled to go into production
soon.  What version
>>>>>>> would you recommend to roll back to?
>>>>>>>
>>>>>>
>>>>>> These are long standing issues. So, rolling back may
not make these
>>>>>> issues go away. Instead if you think performance is
agreeable to you,
>>>>>> please keep these xlators off in production.
>>>>>>
>>>>>>
>>>>>>> On Thu, Dec 27, 2018 at 10:55 PM Raghavendra
Gowdappa <
>>>>>>> rgowdapp at redhat.com> wrote:
>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Fri, Dec 28, 2018 at 3:13 AM Dmitry
Isakbayev <isakdim at gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Raghavendra,
>>>>>>>>>
>>>>>>>>> Thank  for the suggestion.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I am suing
>>>>>>>>>
>>>>>>>>> [root at jl-fanexoss1p glusterfs]# gluster
--version
>>>>>>>>> glusterfs 5.0
>>>>>>>>>
>>>>>>>>> On
>>>>>>>>> [root at jl-fanexoss1p glusterfs]#
hostnamectl
>>>>>>>>>          Icon name: computer-vm
>>>>>>>>>            Chassis: vm
>>>>>>>>>         Machine ID:
e44b8478ef7a467d98363614f4e50535
>>>>>>>>>            Boot ID:
eed98992fdda4c88bdd459a89101766b
>>>>>>>>>     Virtualization: vmware
>>>>>>>>>   Operating System: Red Hat Enterprise
Linux Server 7.5 (Maipo)
>>>>>>>>>        CPE OS Name:
cpe:/o:redhat:enterprise_linux:7.5:GA:server
>>>>>>>>>             Kernel: Linux
3.10.0-862.14.4.el7.x86_64
>>>>>>>>>       Architecture: x86-64
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I have configured the following options
>>>>>>>>>
>>>>>>>>> [root at jl-fanexoss1p glusterfs]# gluster
volume info
>>>>>>>>> Volume Name: gv0
>>>>>>>>> Type: Replicate
>>>>>>>>> Volume ID:
5ffbda09-c5e2-4abc-b89e-79b5d8a40824
>>>>>>>>> Status: Started
>>>>>>>>> Snapshot Count: 0
>>>>>>>>> Number of Bricks: 1 x 3 = 3
>>>>>>>>> Transport-type: tcp
>>>>>>>>> Bricks:
>>>>>>>>> Brick1:
jl-fanexoss1p.cspire.net:/data/brick1/gv0
>>>>>>>>> Brick2:
sl-fanexoss2p.cspire.net:/data/brick1/gv0
>>>>>>>>> Brick3:
nxquorum1p.cspire.net:/data/brick1/gv0
>>>>>>>>> Options Reconfigured:
>>>>>>>>> performance.io-cache: off
>>>>>>>>> performance.stat-prefetch: off
>>>>>>>>> performance.quick-read: off
>>>>>>>>> performance.parallel-readdir: off
>>>>>>>>> performance.readdir-ahead: off
>>>>>>>>> performance.write-behind: off
>>>>>>>>> performance.read-ahead: off
>>>>>>>>> performance.client-io-threads: off
>>>>>>>>> nfs.disable: on
>>>>>>>>> transport.address-family: inet
>>>>>>>>>
>>>>>>>>> I don't know if it is related, but I am
seeing a lot of
>>>>>>>>> [2018-12-27 20:19:23.776080] W [MSGID:
114031]
>>>>>>>>>
[client-rpc-fops_v2.c:1932:client4_0_seek_cbk] 2-gv0-client-0: remote
>>>>>>>>> operation failed [No such device or
address]
>>>>>>>>> [2018-12-27 20:19:47.735190] E [MSGID:
101191]
>>>>>>>>>
[event-epoll.c:671:event_dispatch_epoll_worker] 2-epoll: Failed to dispatch
>>>>>>>>> handler
>>>>>>>>>
>>>>>>>>
>>>>>>>> These msgs were introduced by patch [1]. To the
best of my
>>>>>>>> knowledge they are benign. We'll be sending
a patch to fix these msgs
>>>>>>>> though.
>>>>>>>>
>>>>>>>> +Mohit Agrawal <moagrawa at redhat.com>
+Milind Changire
>>>>>>>> <mchangir at redhat.com> . Can you try to
identify why we are seeing
>>>>>>>> these messages? If possible please send a patch
to fix this.
>>>>>>>>
>>>>>>>> [1]
>>>>>>>>
https://review.gluster.org/r/I578c3fc67713f4234bd3abbec5d3fbba19059ea5
>>>>>>>>
>>>>>>>>
>>>>>>>>> And java.io exceptions trying to rename
files.
>>>>>>>>>
>>>>>>>>
>>>>>>>> When you see the errors is it possible to
collect,
>>>>>>>> * strace of the java application (strace -ff -v
...)
>>>>>>>> * fuse-dump of the glusterfs mount (use option
--dump-fuse while
>>>>>>>> mounting)?
>>>>>>>>
>>>>>>>> I also need another favour from you. By trail
and error, can you
>>>>>>>> point out which of the many performance xlators
you've turned off is
>>>>>>>> causing the issue?
>>>>>>>>
>>>>>>>> The above two data-points will help us to fix
the problem.
>>>>>>>>
>>>>>>>>
>>>>>>>>> Thank You,
>>>>>>>>> Dmitry
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Thu, Dec 27, 2018 at 3:48 PM Raghavendra
Gowdappa <
>>>>>>>>> rgowdapp at redhat.com> wrote:
>>>>>>>>>
>>>>>>>>>> What version of glusterfs are you
using? It might be either
>>>>>>>>>> * a stale metadata issue.
>>>>>>>>>> * inconsistent ctime issue.
>>>>>>>>>>
>>>>>>>>>> Can you try turning off all performance
xlators? If the issue is
>>>>>>>>>> 1, that should help.
>>>>>>>>>>
>>>>>>>>>> On Fri, Dec 28, 2018 at 1:51 AM Dmitry
Isakbayev <
>>>>>>>>>> isakdim at gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Attempted to set
'performance.read-ahead off` according to
>>>>>>>>>>>
https://jira.apache.org/jira/browse/AMQ-7041
>>>>>>>>>>> That did not help.
>>>>>>>>>>>
>>>>>>>>>>> On Mon, Dec 24, 2018 at 2:11 PM
Dmitry Isakbayev <
>>>>>>>>>>> isakdim at gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> The core file generated by JVM
suggests that it happens because
>>>>>>>>>>>> the file is changing while it
is being read -
>>>>>>>>>>>>
https://bugs.java.com/bugdatabase/view_bug.do?bug_id=8186557.
>>>>>>>>>>>> The application reads in the
zipfile and goes through the zip
>>>>>>>>>>>> entries, then reloads the file
and goes the zip entries again.  It does so
>>>>>>>>>>>> 3 times.  The application never
crushes on the 1st cycle but sometimes
>>>>>>>>>>>> crushes on the 2nd or 3rd
cycle.
>>>>>>>>>>>> The zip file is generated about
20 seconds prior to it being
>>>>>>>>>>>> used and is not updated or even
used by any other application.  I have
>>>>>>>>>>>> never seen this problem on a
plain file system.
>>>>>>>>>>>>
>>>>>>>>>>>> I would appreciate any
suggestions on how to go debugging this
>>>>>>>>>>>> issue.  I can change the source
code of the java application.
>>>>>>>>>>>>
>>>>>>>>>>>> Regards,
>>>>>>>>>>>> Dmitry
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
_______________________________________________
>>>>>>>>>>> Gluster-users mailing list
>>>>>>>>>>> Gluster-users at gluster.org
>>>>>>>>>>>
https://lists.gluster.org/mailman/listinfo/gluster-users
>>>>>>>>>>
>>>>>>>>>>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.gluster.org/pipermail/gluster-users/attachments/20190103/adb5a2e6/attachment-0001.html>

Gluster users - Jan 2019 - java application crushes while reading a zip file

[Gluster-users] java application crushes while reading a zip file

[Gluster-users] java application crushes while reading a zip file