I have set up a replicated, four-node gluster config for a web farm. The idea is that each web node is its own Gluster server, and will have its own copy of the entire web root locally. It then serves the cluster to itself via a mount. We're running it over dual GigE NICs bonded. The problem I am having is when we switch live traffic to nodes in the cluster, they almost immediately get out of sync. The issue seems to be with cache files that are read/written a lot. Here is an excerpt pointing to issues with our OpenX banner cache: [2012-02-25 18:53:04.198326] E [afr-self-heal-common.c:2074:afr_self_heal_completion_cbk] 0-web-pub-replicate-0: background meta-data data missing-entry self-heal failed on /cust/site1/www/openx/var/cache/deliverycache_f8e7a8862cb80b4933c58acdf65aaef5.php [2012-02-25 18:53:04.199191] W [afr-common.c:1121:afr_conflicting_iattrs] 0-web-pub-replicate-0: /cust/site1/www/openx/var/cache/deliverycache_f8e7a8862cb80b4933c58acdf65aaef5.php: gfid differs on subvolume 0 (53fa373a-3830-4c5e-aa22-6ed35c947d97, c12e0cdd-9b6c-4988-b793-819db0472780) [2012-02-25 18:53:04.199210] W [afr-common.c:1121:afr_conflicting_iattrs] 0-web-pub-replicate-0: /cust/site1/www/openx/var/cache/deliverycache_f8e7a8862cb80b4933c58acdf65aaef5.php: gfid differs on subvolume 0 (53fa373a-3830-4c5e-aa22-6ed35c947d97, c12e0cdd-9b6c-4988-b793-819db0472780) [2012-02-25 18:53:04.199219] W [afr-common.c:882:afr_detect_self_heal_by_iatt] 0-web-pub-replicate-0: /cust/site1/www/openx/var/cache/deliverycache_f8e7a8862cb80b4933c58acdf65aaef5.php: gfid different on subvolume [2012-02-25 18:53:04.199236] I [afr-common.c:1038:afr_launch_self_heal] 0-web-pub-replicate-0: background meta-data data missing-entry self-heal triggered. path: /cust/site1/www/openx/var/cache/deliverycache_f8e7a8862cb80b4933c58acdf65aaef5.php [2012-02-25 18:53:04.200752] W [afr-common.c:1121:afr_conflicting_iattrs] 0-web-pub-replicate-0: /cust/site1/www/openx/var/cache/deliverycache_f8e7a8862cb80b4933c58acdf65aaef5.php: gfid differs on subvolume 0 (53fa373a-3830-4c5e-aa22-6ed35c947d97, c12e0cdd-9b6c-4988-b793-819db0472780) [2012-02-25 18:53:04.200971] I [afr-self-heal-common.c:963:afr_sh_missing_entries_done] 0-web-pub-replicate-0: split brain found, aborting selfheal of /cust/site1/www/openx/var/cache/deliverycache_f8e7a8862cb80b4933c58acdf65aaef5.php [2012-02-25 18:53:04.200986] E [afr-self-heal-common.c:2074:afr_self_heal_completion_cbk] 0-web-pub-replicate-0: background meta-data data missing-entry self-heal failed on /cust/site1/www/openx/var/cache/deliverycache_f8e7a8862cb80b4933c58acdf65aaef5.php [2012-02-25 18:53:04.202159] W [afr-common.c:1121:afr_conflicting_iattrs] 0-web-pub-replicate-0: /cust/site1/www/openx/var/cache/deliverycache_f901ff39b456df599289c590ed89b19d.php: gfid differs on subvolume 1 (375e1754-0420-4e26-9176-bb2128c6596b, 3e9eca35-3351-450e-b8ab-c62785968953) [2012-02-25 18:53:04.202178] W [afr-common.c:1121:afr_conflicting_iattrs] 0-web-pub-replicate-0: /cust/site1/www/openx/var/cache/deliverycache_f901ff39b456df599289c590ed89b19d.php: gfid differs on subvolume 1 (375e1754-0420-4e26-9176-bb2128c6596b, 3e9eca35-3351-450e-b8ab-c62785968953) [2012-02-25 18:53:04.202188] W [afr-common.c:882:afr_detect_self_heal_by_iatt] 0-web-pub-replicate-0: /cust/site1/www/openx/var/cache/deliverycache_f901ff39b456df599289c590ed89b19d.php: gfid different on subvolume [2012-02-25 18:53:04.202204] I [afr-common.c:1038:afr_launch_self_heal] 0-web-pub-replicate-0: background meta-data data missing-entry self-heal triggered. path: /cust/site1/www/openx/var/cache/deliverycache_f901ff39b456df599289c590ed89b19d.php [2012-02-25 18:53:04.203463] W [afr-common.c:1121:afr_conflicting_iattrs] 0-web-pub-replicate-0: /cust/site1/www/openx/var/cache/deliverycache_f901ff39b456df599289c590ed89b19d.php: gfid differs on subvolume 0 (375e1754-0420-4e26-9176-bb2128c6596b, 3e9eca35-3351-450e-b8ab-c62785968953) [2012-02-25 18:53:04.203678] I [afr-self-heal-common.c:963:afr_sh_missing_entries_done] 0-web-pub-replicate-0: split brain found, aborting selfheal of /cust/site1/www/openx/var/cache/deliverycache_f901ff39b456df599289c590ed89b19d.php [2012-02-25 18:53:04.203693] E [afr-self-heal-common.c:2074:afr_self_heal_completion_cbk] 0-web-pub-replicate-0: background meta-data data missing-entry self-heal failed on /cust/site1/www/openx/var/cache/deliverycache_f901ff39b456df599289c590ed89b19d.php [2012-02-25 18:53:04.204759] W [afr-common.c:1121:afr_conflicting_iattrs] 0-web-pub-replicate-0: /cust/site1/www/openx/var/cache/deliverycache_f901ff39b456df599289c590ed89b19d.php: gfid differs on subvolume 0 (375e1754-0420-4e26-9176-bb2128c6596b, 3e9eca35-3351-450e-b8ab-c62785968953) [2012-02-25 18:53:04.204781] W [afr-common.c:1121:afr_conflicting_iattrs] 0-web-pub-replicate-0: /cust/site1/www/openx/var/cache/deliverycache_f901ff39b456df599289c590ed89b19d.php: gfid differs on subvolume 0 (375e1754-0420-4e26-9176-bb2128c6596b, 3e9eca35-3351-450e-b8ab-c62785968953) [2012-02-25 18:53:04.204800] W [afr-common.c:882:afr_detect_self_heal_by_iatt] 0-web-pub-replicate-0: /cust/site1/www/openx/var/cache/deliverycache_f901ff39b456df599289c590ed89b19d.php: gfid different on subvolume [2012-02-25 18:53:04.204818] I [afr-common.c:1038:afr_launch_self_heal] 0-web-pub-replicate-0: background meta-data data missing-entry self-heal triggered. path: /cust/site1/www/openx/var/cache/deliverycache_f901ff39b456df599289c590ed89b19d.php [2012-02-25 18:53:04.206150] W [afr-common.c:1121:afr_conflicting_iattrs] 0-web-pub-replicate-0: /cust/site1/www/openx/var/cache/deliverycache_f901ff39b456df599289c590ed89b19d.php: gfid differs on subvolume 0 (375e1754-0420-4e26-9176-bb2128c6596b, 3e9eca35-3351-450e-b8ab-c62785968953) [2012-02-25 18:53:04.206384] I [afr-self-heal-common.c:963:afr_sh_missing_entries_done] 0-web-pub-replicate-0: split brain found, aborting selfheal of /cust/site1/www/openx/var/cache/deliverycache_f901ff39b456df599289c590ed89b19d.php [2012-02-25 18:53:04.206400] E [afr-self-heal-common.c:2074:afr_self_heal_completion_cbk] 0-web-pub-replicate-0: background meta-data data missing-entry self-heal failed on /cust/site1/www/openx/var/cache/deliverycache_f901ff39b456df599289c590ed89b19d.php [2012-02-25 18:53:04.207725] W [afr-common.c:1121:afr_conflicting_iattrs] 0-web-pub-replicate-0: /cust/site1/www/openx/var/cache/deliverycache_f901ff39b456df599289c590ed89b19d.php: gfid differs on subvolume 0 (375e1754-0420-4e26-9176-bb2128c6596b, 3e9eca35-3351-450e-b8ab-c62785968953) [2012-02-25 18:53:04.207746] W [afr-common.c:1121:afr_conflicting_iattrs] 0-web-pub-replicate-0: /cust/site1/www/openx/var/cache/deliverycache_f901ff39b456df599289c590ed89b19d.php: gfid differs on subvolume 0 (375e1754-0420-4e26-9176-bb2128c6596b, 3e9eca35-3351-450e-b8ab-c62785968953) [2012-02-25 18:53:04.207756] W [afr-common.c:882:afr_detect_self_heal_by_iatt] 0-web-pub-replicate-0: /cust/site1/www/openx/var/cache/deliverycache_f901ff39b456df599289c590ed89b19d.php: gfid different on subvolume [2012-02-25 18:53:04.207772] I [afr-common.c:1038:afr_launch_self_heal] 0-web-pub-replicate-0: background meta-data data missing-entry self-heal triggered. path: /cust/site1/www/openx/var/cache/deliverycache_f901ff39b456df599289c590ed89b19d.php [2012-02-25 18:53:04.209217] W [afr-common.c:1121:afr_conflicting_iattrs] 0-web-pub-replicate-0: /cust/site1/www/openx/var/cache/deliverycache_f901ff39b456df599289c590ed89b19d.php: gfid differs on subvolume 0 (375e1754-0420-4e26-9176-bb2128c6596b, 3e9eca35-3351-450e-b8ab-c62785968953) Nodes and network are fine. I have tried mounting the volumes using both the Gluster native client and with the Gluster NFS client but get the same results. It's killing performance. Here is the config: 1: volume web-pub-client-0 2: type protocol/client 3: option remote-host web-web1 4: option remote-subvolume /glusterfs/pub 5: option transport-type tcp 6: end-volume 7: 8: volume web-pub-client-1 9: type protocol/client 10: option remote-host web-web2 11: option remote-subvolume /glusterfs/pub 12: option transport-type tcp 13: end-volume 14: 15: volume web-pub-client-2 16: type protocol/client 17: option remote-host web-web3 18: option remote-subvolume /glusterfs/pub 19: option transport-type tcp 20: end-volume 21: 22: volume web-pub-client-3 23: type protocol/client 24: option remote-host web-web4 25: option remote-subvolume /glusterfs/pub 26: option transport-type tcp 27: end-volume 28: 29: volume web-pub-replicate-0 30: type cluster/replicate 31: subvolumes web-pub-client-0 web-pub-client-1 web-pub-client-2 web-pub-client-3 32: end-volume 33: 34: volume web-pub-write-behind 35: type performance/write-behind 36: subvolumes web-pub-replicate-0 37: end-volume 38: 39: volume web-pub-read-ahead 40: type performance/read-ahead 41: subvolumes web-pub-write-behind 42: end-volume 43: 44: volume web-pub-io-cache 45: type performance/io-cache 46: option cache-size 256MB 47: subvolumes web-pub-read-ahead 48: end-volume 49: 50: volume web-pub-quick-read 51: type performance/quick-read 52: option cache-size 256MB 53: subvolumes web-pub-io-cache 54: end-volume 55: 56: volume web-pub 57: type debug/io-stats 58: option latency-measurement off 59: option count-fop-hits off 60: subvolumes web-pub-quick-read 61: end-volume 62: 63: volume nfs-server 64: type nfs/server 65: option nfs.dynamic-volumes on 66: option rpc-auth.addr.web-pub.allow * 67: option nfs3.web-pub.volume-id ac556d2e-e8a9-4857-bd17-cab603820fcb 68: subvolumes web-pub 69: end-volume Any ideas or help would be greatly appreciated. sean -- Sean Fulton GCN Publishing, Inc. Internet Design, Development and Consulting For Today's Media Companies http://www.gcnpublishing.com (203) 665-6211, x203