thr3ads.net - Ocfs2 users - [Ocfs2-users] Problem with OCFS2 disk on some moments (slow until stalls) [Sep 2015]

If this information is useful, please help other people find it:
Share via:

Area de Sistemas

2015-Sep-14 08:20 UTC

[Ocfs2-users] Problem with OCFS2 disk on some moments (slow until stalls)

Hello everyone,

We have a problem in a 3 member OCFS2 cluster used to serve an web/php 
application that access (read and/or write) files located in the OCFS2 
volume.
The problem appears only some times (apparently during high load periods).

SYMPTOMS:
- access to OCFS2 content becomes more an more slow until stalls
     * a "ls" command that normally takes <=1s takes 30s, 40s,
1m,...
- load average of the system grows to 150, 200 or even more

- high iowait values: 70-90%
     * but CPU usage is low

- in the syslog appears a lot of messages like:
     (httpd,XXXXX,Y):ocfs2_rename:1474 ERROR: status = -13
   or
     (httpd,XXXXX,Y):ocfs2_unlink:951 ERROR: status = -2

   and the more "worrying":
      kernel: INFO: task httpd:3488 blocked for more than 120 seconds.
      kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" 
disables this message.
      kernel: httpd           D c6fe5d74     0  3488   1616 0x00000080
      kernel: c6fe5e04 00000082 00000000 c6fe5d74 c6fe5d74 000041fd 
c6fe5d88 c0439b18
      kernel: c0b976c0 c0b976c0 c0b976c0 c0b976c0 ed0f0ac0 c6fe5de8 
c0b976c0 f75ac6c0
      kernel: f2f0cd60 c0a95060 00000001 c6fe5dbc c0874b8d c6fe5de8 
f8fd9a86 00000001
      kernel: Call Trace:
      kernel: [<c0439b18>] ? default_spin_lock_flags+0x8/0x10
      kernel: [<c0874b8d>] ? _raw_spin_lock+0xd/0x10
      kernel: [<f8fd9a86>] ? ocfs2_dentry_revalidate+0xc6/0x2d0 [ocfs2]
      kernel: [<f8ff17be>] ? ocfs2_permission+0xfe/0x110 [ocfs2]
      kernel: [<f905b6f0>] ? ocfs2_acl_chmod+0xd0/0xd0 [ocfs2]
      kernel: [<c0873105>] schedule+0x35/0x50
      kernel: [<c0873b2e>] __mutex_lock_slowpath+0xbe/0x120
      ....


(UNACCEPTABLE) WORKAROUND:
    stop httpd (really slow)
    stop ocfs2 service (really slow)
    start ocfs2 an httpd

MORE INFO:
- OS information:
     Oracle Linux 6.4 32bit
     4GB RAM
     uname -a: 2.6.39-400.109.6.el6uek.i686 #1 SMP Wed Aug 28 09:55:10 
PDT 2013 i686 i686 i386 GNU/Linux
     * anyway: we have another 5 nodes cluster with Oracle Linux 7.1 (so 
64bit OS) serving a newer version of the same application and the 
problems are similar, so it appears not to be a OS problem but a more 
specific OCFS2 problem (bug? some tuning? other?)

- standard configuration
     * if you want I can show the cluster.conf configuration but is the 
"expected configuration"

- standard configuration in o2cb:
     Driver for "configfs": Loaded
     Filesystem "configfs": Mounted
     Stack glue driver: Loaded
     Stack plugin "o2cb": Loaded
     Driver for "ocfs2_dlmfs": Loaded
     Filesystem "ocfs2_dlmfs": Mounted
     Checking O2CB cluster "MoodleOCFS2": Online
       Heartbeat dead threshold: 31
       Network idle timeout: 30000
       Network keepalive delay: 2000
       Network reconnect delay: 2000
       Heartbeat mode: Local
     Checking O2CB heartbeat: Active

- mount options: _netdev,rw,noatime
     * so other options (commit, data, ...) have their default values


Any ideas/suggestion?

Regards.

-- 
------------------------------------------------------------------------

*Area de Sistemas
Servicio de las Tecnologias de la Informacion y Comunicaciones (STIC)
Universidad de Valladolid
Edificio Alfonso VIII, C/Real de Burgos s/n. 47011, Valladolid - ESPA?A
Telefono: 983 18-6410, Fax: 983 423271
E-mail: sistemas at uva.es
*

*
------------------------------------------------------------------------
*
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://oss.oracle.com/pipermail/ocfs2-users/attachments/20150914/33a9e419/attachment.html
-------------- next part --------------
A non-text attachment was scrubbed...
Name: escudo-uva.gif
Type: image/gif
Size: 1517 bytes
Desc: not available
Url :
http://oss.oracle.com/pipermail/ocfs2-users/attachments/20150914/33a9e419/attachment.gif

Tariq Saeed

2015-Sep-14 18:29 UTC

head link

[Ocfs2-users] Problem with OCFS2 disk on some moments (slow until stalls)

On 09/14/2015 01:20 AM, Area de Sistemas wrote:> Hello everyone,
>
> We have a problem in a 3 member OCFS2 cluster used to serve an web/php 
> application that access (read and/or write) files located in the OCFS2 
> volume.
> The problem appears only some times (apparently during high load periods).
>
> SYMPTOMS:
> - access to OCFS2 content becomes more an more slow until stalls
>     * a "ls" command that normally takes <=1s takes 30s, 40s,
1m,...
> - load average of the system grows to 150, 200 or even more
>
> - high iowait values: 70-90%
>          This is hint that disk is under pressure. Run iostat (see man 
page)
          when this happens, producing report every 3 seconds or and look at
          %util col
                        %util
                      Percentage of CPU time during which I/O requests 
were issued to the  device  (bandwidth
                      utilization for the device). Device saturation 
occurs when this value is close to 100%.
>    * but CPU usage is low
>
> - in the syslog appears a lot of messages like:
>     (httpd,XXXXX,Y):ocfs2_rename:1474 ERROR: status = -13EACCES    Permission denied. find the filename and check perms ls
-l.>   or
>     (httpd,XXXXX,Y):ocfs2_unlink:951 ERROR: status = -2ENOENT     All we can say is an attempt to delete a file from a 
directory that has already been deleted.
                         This requires some knowledge of the 
environment. Is there an application log.>
>   and the more "worrying":
>      kernel: INFO: task httpd:3488 blocked for more than 120 seconds.
>      kernel: "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs"
> disables this message.
>      kernel: httpd           D c6fe5d74     0  3488   1616 0x00000080
>      kernel: c6fe5e04 00000082 00000000 c6fe5d74 c6fe5d74 000041fd 
> c6fe5d88 c0439b18
>      kernel: c0b976c0 c0b976c0 c0b976c0 c0b976c0 ed0f0ac0 c6fe5de8 
> c0b976c0 f75ac6c0
>      kernel: f2f0cd60 c0a95060 00000001 c6fe5dbc c0874b8d c6fe5de8 
> f8fd9a86 00000001
>      kernel: Call Trace:
>      kernel: [<c0439b18>] ? default_spin_lock_flags+0x8/0x10
>      kernel: [<c0874b8d>] ? _raw_spin_lock+0xd/0x10
>      kernel: [<f8fd9a86>] ? ocfs2_dentry_revalidate+0xc6/0x2d0
[ocfs2]
>      kernel: [<f8ff17be>] ? ocfs2_permission+0xfe/0x110 [ocfs2]
>      kernel: [<f905b6f0>] ? ocfs2_acl_chmod+0xd0/0xd0 [ocfs2]
>      kernel: [<c0873105>] schedule+0x35/0x50
>      kernel: [<c0873b2e>] __mutex_lock_slowpath+0xbe/0x120
>      ....
>the important part of bt is cut off. Where is the rest of it? The 
entries starting with "?"
are junk. You can attach /v/l/messages to give us a complete pic.My 
guess is blocking on
mutex for so long is that the thread holding mutex is blocked on i/o.
Run "ps -e -o pid,stat,comm,whchan=WIDE_WCHAN-COLUMN" and look at
'D'
state (uninterruptable slee)
process. These are processes usually blocked on i/o.>
> (UNACCEPTABLE) WORKAROUND:
>    stop httpd (really slow)
>    stop ocfs2 service (really slow)
>    start ocfs2 an httpd
>
> MORE INFO:
> - OS information:
>     Oracle Linux 6.4 32bit
>     4GB RAM
>     uname -a: 2.6.39-400.109.6.el6uek.i686 #1 SMP Wed Aug 28 09:55:10 
> PDT 2013 i686 i686 i386 GNU/Linux
>     * anyway: we have another 5 nodes cluster with Oracle Linux 7.1 
> (so 64bit OS) serving a newer version of the same application and the 
> problems are similar, so it appears not to be a OS problem but a more 
> specific OCFS2 problem (bug? some tuning? other?)
>
> - standard configuration
>     * if you want I can show the cluster.conf configuration but is the 
> "expected configuration"
>
> - standard configuration in o2cb:
>     Driver for "configfs": Loaded
>     Filesystem "configfs": Mounted
>     Stack glue driver: Loaded
>     Stack plugin "o2cb": Loaded
>     Driver for "ocfs2_dlmfs": Loaded
>     Filesystem "ocfs2_dlmfs": Mounted
>     Checking O2CB cluster "MoodleOCFS2": Online
>       Heartbeat dead threshold: 31
>       Network idle timeout: 30000
>       Network keepalive delay: 2000
>       Network reconnect delay: 2000
>       Heartbeat mode: Local
>     Checking O2CB heartbeat: Active
>
> - mount options: _netdev,rw,noatime
>     * so other options (commit, data, ...) have their default values
>
>
> Any ideas/suggestion?
>
> Regards.
>
> -- 
> ------------------------------------------------------------------------
>
> *Area de Sistemas
> Servicio de las Tecnologias de la Informacion y Comunicaciones (STIC)
> Universidad de Valladolid
> Edificio Alfonso VIII, C/Real de Burgos s/n. 47011, Valladolid - ESPA?A
> Telefono: 983 18-6410, Fax: 983 423271
> E-mail: sistemas at uva.es
> *
>
> *
> ------------------------------------------------------------------------
> *
>
>
> _______________________________________________
> Ocfs2-users mailing list
> Ocfs2-users at oss.oracle.com
> https://oss.oracle.com/mailman/listinfo/ocfs2-users
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://oss.oracle.com/pipermail/ocfs2-users/attachments/20150914/6e654c06/attachment-0001.html
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: image/gif
Size: 1517 bytes
Desc: not available
Url :
http://oss.oracle.com/pipermail/ocfs2-users/attachments/20150914/6e654c06/attachment-0001.gif

Ocfs2 users - Sep 2015 - Problem with OCFS2 disk on some moments (slow until stalls)

[Ocfs2-users] Problem with OCFS2 disk on some moments (slow until stalls)

[Ocfs2-users] Problem with OCFS2 disk on some moments (slow until stalls)