Area de Sistemas
2015-Sep-14 08:20 UTC
[Ocfs2-users] Problem with OCFS2 disk on some moments (slow until stalls)
Hello everyone, We have a problem in a 3 member OCFS2 cluster used to serve an web/php application that access (read and/or write) files located in the OCFS2 volume. The problem appears only some times (apparently during high load periods). SYMPTOMS: - access to OCFS2 content becomes more an more slow until stalls * a "ls" command that normally takes <=1s takes 30s, 40s, 1m,... - load average of the system grows to 150, 200 or even more - high iowait values: 70-90% * but CPU usage is low - in the syslog appears a lot of messages like: (httpd,XXXXX,Y):ocfs2_rename:1474 ERROR: status = -13 or (httpd,XXXXX,Y):ocfs2_unlink:951 ERROR: status = -2 and the more "worrying": kernel: INFO: task httpd:3488 blocked for more than 120 seconds. kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. kernel: httpd D c6fe5d74 0 3488 1616 0x00000080 kernel: c6fe5e04 00000082 00000000 c6fe5d74 c6fe5d74 000041fd c6fe5d88 c0439b18 kernel: c0b976c0 c0b976c0 c0b976c0 c0b976c0 ed0f0ac0 c6fe5de8 c0b976c0 f75ac6c0 kernel: f2f0cd60 c0a95060 00000001 c6fe5dbc c0874b8d c6fe5de8 f8fd9a86 00000001 kernel: Call Trace: kernel: [<c0439b18>] ? default_spin_lock_flags+0x8/0x10 kernel: [<c0874b8d>] ? _raw_spin_lock+0xd/0x10 kernel: [<f8fd9a86>] ? ocfs2_dentry_revalidate+0xc6/0x2d0 [ocfs2] kernel: [<f8ff17be>] ? ocfs2_permission+0xfe/0x110 [ocfs2] kernel: [<f905b6f0>] ? ocfs2_acl_chmod+0xd0/0xd0 [ocfs2] kernel: [<c0873105>] schedule+0x35/0x50 kernel: [<c0873b2e>] __mutex_lock_slowpath+0xbe/0x120 .... (UNACCEPTABLE) WORKAROUND: stop httpd (really slow) stop ocfs2 service (really slow) start ocfs2 an httpd MORE INFO: - OS information: Oracle Linux 6.4 32bit 4GB RAM uname -a: 2.6.39-400.109.6.el6uek.i686 #1 SMP Wed Aug 28 09:55:10 PDT 2013 i686 i686 i386 GNU/Linux * anyway: we have another 5 nodes cluster with Oracle Linux 7.1 (so 64bit OS) serving a newer version of the same application and the problems are similar, so it appears not to be a OS problem but a more specific OCFS2 problem (bug? some tuning? other?) - standard configuration * if you want I can show the cluster.conf configuration but is the "expected configuration" - standard configuration in o2cb: Driver for "configfs": Loaded Filesystem "configfs": Mounted Stack glue driver: Loaded Stack plugin "o2cb": Loaded Driver for "ocfs2_dlmfs": Loaded Filesystem "ocfs2_dlmfs": Mounted Checking O2CB cluster "MoodleOCFS2": Online Heartbeat dead threshold: 31 Network idle timeout: 30000 Network keepalive delay: 2000 Network reconnect delay: 2000 Heartbeat mode: Local Checking O2CB heartbeat: Active - mount options: _netdev,rw,noatime * so other options (commit, data, ...) have their default values Any ideas/suggestion? Regards. -- ------------------------------------------------------------------------ *Area de Sistemas Servicio de las Tecnologias de la Informacion y Comunicaciones (STIC) Universidad de Valladolid Edificio Alfonso VIII, C/Real de Burgos s/n. 47011, Valladolid - ESPA?A Telefono: 983 18-6410, Fax: 983 423271 E-mail: sistemas at uva.es * * ------------------------------------------------------------------------ * -------------- next part -------------- An HTML attachment was scrubbed... URL: http://oss.oracle.com/pipermail/ocfs2-users/attachments/20150914/33a9e419/attachment.html -------------- next part -------------- A non-text attachment was scrubbed... Name: escudo-uva.gif Type: image/gif Size: 1517 bytes Desc: not available Url : http://oss.oracle.com/pipermail/ocfs2-users/attachments/20150914/33a9e419/attachment.gif
Tariq Saeed
2015-Sep-14 18:29 UTC
[Ocfs2-users] Problem with OCFS2 disk on some moments (slow until stalls)
On 09/14/2015 01:20 AM, Area de Sistemas wrote:> Hello everyone, > > We have a problem in a 3 member OCFS2 cluster used to serve an web/php > application that access (read and/or write) files located in the OCFS2 > volume. > The problem appears only some times (apparently during high load periods). > > SYMPTOMS: > - access to OCFS2 content becomes more an more slow until stalls > * a "ls" command that normally takes <=1s takes 30s, 40s, 1m,... > - load average of the system grows to 150, 200 or even more > > - high iowait values: 70-90% >This is hint that disk is under pressure. Run iostat (see man page) when this happens, producing report every 3 seconds or and look at %util col %util Percentage of CPU time during which I/O requests were issued to the device (bandwidth utilization for the device). Device saturation occurs when this value is close to 100%.> * but CPU usage is low > > - in the syslog appears a lot of messages like: > (httpd,XXXXX,Y):ocfs2_rename:1474 ERROR: status = -13EACCES Permission denied. find the filename and check perms ls -l.> or > (httpd,XXXXX,Y):ocfs2_unlink:951 ERROR: status = -2ENOENT All we can say is an attempt to delete a file from a directory that has already been deleted. This requires some knowledge of the environment. Is there an application log.> > and the more "worrying": > kernel: INFO: task httpd:3488 blocked for more than 120 seconds. > kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" > disables this message. > kernel: httpd D c6fe5d74 0 3488 1616 0x00000080 > kernel: c6fe5e04 00000082 00000000 c6fe5d74 c6fe5d74 000041fd > c6fe5d88 c0439b18 > kernel: c0b976c0 c0b976c0 c0b976c0 c0b976c0 ed0f0ac0 c6fe5de8 > c0b976c0 f75ac6c0 > kernel: f2f0cd60 c0a95060 00000001 c6fe5dbc c0874b8d c6fe5de8 > f8fd9a86 00000001 > kernel: Call Trace: > kernel: [<c0439b18>] ? default_spin_lock_flags+0x8/0x10 > kernel: [<c0874b8d>] ? _raw_spin_lock+0xd/0x10 > kernel: [<f8fd9a86>] ? ocfs2_dentry_revalidate+0xc6/0x2d0 [ocfs2] > kernel: [<f8ff17be>] ? ocfs2_permission+0xfe/0x110 [ocfs2] > kernel: [<f905b6f0>] ? ocfs2_acl_chmod+0xd0/0xd0 [ocfs2] > kernel: [<c0873105>] schedule+0x35/0x50 > kernel: [<c0873b2e>] __mutex_lock_slowpath+0xbe/0x120 > .... >the important part of bt is cut off. Where is the rest of it? The entries starting with "?" are junk. You can attach /v/l/messages to give us a complete pic.My guess is blocking on mutex for so long is that the thread holding mutex is blocked on i/o. Run "ps -e -o pid,stat,comm,whchan=WIDE_WCHAN-COLUMN" and look at 'D' state (uninterruptable slee) process. These are processes usually blocked on i/o.> > (UNACCEPTABLE) WORKAROUND: > stop httpd (really slow) > stop ocfs2 service (really slow) > start ocfs2 an httpd > > MORE INFO: > - OS information: > Oracle Linux 6.4 32bit > 4GB RAM > uname -a: 2.6.39-400.109.6.el6uek.i686 #1 SMP Wed Aug 28 09:55:10 > PDT 2013 i686 i686 i386 GNU/Linux > * anyway: we have another 5 nodes cluster with Oracle Linux 7.1 > (so 64bit OS) serving a newer version of the same application and the > problems are similar, so it appears not to be a OS problem but a more > specific OCFS2 problem (bug? some tuning? other?) > > - standard configuration > * if you want I can show the cluster.conf configuration but is the > "expected configuration" > > - standard configuration in o2cb: > Driver for "configfs": Loaded > Filesystem "configfs": Mounted > Stack glue driver: Loaded > Stack plugin "o2cb": Loaded > Driver for "ocfs2_dlmfs": Loaded > Filesystem "ocfs2_dlmfs": Mounted > Checking O2CB cluster "MoodleOCFS2": Online > Heartbeat dead threshold: 31 > Network idle timeout: 30000 > Network keepalive delay: 2000 > Network reconnect delay: 2000 > Heartbeat mode: Local > Checking O2CB heartbeat: Active > > - mount options: _netdev,rw,noatime > * so other options (commit, data, ...) have their default values > > > Any ideas/suggestion? > > Regards. > > -- > ------------------------------------------------------------------------ > > *Area de Sistemas > Servicio de las Tecnologias de la Informacion y Comunicaciones (STIC) > Universidad de Valladolid > Edificio Alfonso VIII, C/Real de Burgos s/n. 47011, Valladolid - ESPA?A > Telefono: 983 18-6410, Fax: 983 423271 > E-mail: sistemas at uva.es > * > > * > ------------------------------------------------------------------------ > * > > > _______________________________________________ > Ocfs2-users mailing list > Ocfs2-users at oss.oracle.com > https://oss.oracle.com/mailman/listinfo/ocfs2-users-------------- next part -------------- An HTML attachment was scrubbed... URL: http://oss.oracle.com/pipermail/ocfs2-users/attachments/20150914/6e654c06/attachment-0001.html -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/gif Size: 1517 bytes Desc: not available Url : http://oss.oracle.com/pipermail/ocfs2-users/attachments/20150914/6e654c06/attachment-0001.gif