On Tue, Nov 24, 2015 at 1:23 AM, Audrius Butkevicius <audrius.butkevicius at gmail.com> wrote:> Hi, > > I've got a geo-replicated gluster volume, with a few hundred thousand > images, which get generated on demand. > > I started getting replication failures in the status detail view, but it's > not obvious to me where to find the actual errors or how to actually fix > them.Chris here[1] mentioned about a bug in rsync (thanks!). Could that be the issue here? Mind checking rsync version used? [1]: http://www.gluster.org/pipermail/gluster-users/2015-November/024423.html> > The docs seem to be secretive about this as well. It seems if I tear the > geo-replication down, and do a force create from scratch, it goes back in > sync again, but as the files get generated, it starts getting failures again > at some point. > > Can someone provide me with information on how to check which files are > causing failures, and what are the actual failures? Or point me to the > relevant part in the docs? > > Version 3.7.5-ubuntu1~trusty1 > > Related SO question: > http://stackoverflow.com/questions/33839056/gluster-geo-replication-debugging-failures > > Thanks, > > Audrius. > > > _______________________________________________ > Gluster-users mailing list > Gluster-users at gluster.org > http://www.gluster.org/mailman/listinfo/gluster-users
Audrius Butkevicius
2015-Nov-24 21:38 UTC
[Gluster-users] Debugging georeplication failures
So the version of rsync is 3.1.0, but the bug mentioned only applies to large files, where as in my case the files are less than a MB. I've started digging through the logs and found a bunch of these on the slave: [2015-11-20 11:40:46.730805] W [fuse-bridge.c:1978:fuse_create_cbk] 0-glusterfs-fuse: 1882288: /.gfid/31d66429-c700-4a10-bb32-35e1b36a479f => -1 (Operation not permitted) [2015-11-20 12:39:59.269844] W [fuse-bridge.c:1978:fuse_create_cbk] 0-glusterfs-fuse: 1918306: /.gfid/6802a0c6-1f62-4213-a70d-7b46d9ff8f3a => -1 (Operation not permitted) So something funky was happening for an hour 4 days ago. Given the volume is on EBS, maybe there was some glitch there. I can also find the corresponding failures on the master: [2015-11-20 11:40:14.93090] W [master(/data/media):803:log_failures] _GMaster: ENTRY FAILED: ({'uid': 33, 'gfid': '31d66429-c700-4a10-bb32-35e1b36a479f', 'gid': 33, 'mode': 33206, 'entry': '.gfid/b1dc6c6d-dac7-4da9-9577-4614942a72a0/official-nightmare-before-christmas-vampire-teddy-girls-dress-body-web.jpg', 'op': 'CREATE'}, 17, 'df0e67f5-f2ce-45c3-b4f1-224aa3059ec7') [2015-11-20 11:40:14.265054] W [master(/data/media):803:log_failures] _GMaster: META FAILED: ({'go': '.gfid/31d66429-c700-4a10-bb32-35e1b36a479f', 'stat': {'atime': 1448019600.232466, 'gid': 33, 'mtime': 1448019600.316466, 'mode': 33279, 'uid': 33}, 'op': 'META'}, 2) If I grep for SKIPPED GFID I get the following: [2015-11-20 11:40:40.704817] W [master(/data/media):1014:process] _GMaster: SKIPPED GFID 192632af-28c5-4e03-a62d-458fe7f3b5f9,7ea8d7a8-524b-4dd0-b97a-dc7d3481f341,204f6112-0e8d-4f6d-855b-bf10f9c63b62,7e626e8f-edad-4f39-a6c6-547a1da34aa1,1f0d0208-1962-4eb1-91d4-cf7ed297d8e3,95d389c4-3258-4ca0-8fc4-26b8427b1eaf,425cedc6-6343-4326-8540-996d2d56dc9c,5955928b-2b8f-4cc9-a336-3eac4382789b,8932efcd-ba90-46ec-84c8-5e9e51cc84e9,2530275d-5f03-4143-9abf-d07cc79bf80a,73574466-86f3-4ab2-b5da-c31ac28c27c1,776e5e8f-5c6a-46b1-ad54-733e157d2097,008a69f3-217c-4dbc-a469-5a5bc8ecd589,dca8d8d9-03cf-4793-92e4-bfcfddd262f6,c85b7a29-73af-4f44-a07e-a44082d7a93a,6c1f56d6-4ea6-4910-9677-ea33edd35d28,0ea56588-87fa-4355-9403-e311525454fc,c8ce76c9-e21d-46ce-a2b5-14dfd0070f64,db9e6484-0e5e-4f6e-815b-3c2b273deee5,35d10752-43b5-4398-be5f-17cb9de73a6b,396e5faf-74a1-4849-97e3-009dbfb22836,d148e7d5-c2f3-4d06-8cd6-8588e6aac196,404d20c5-1c6c-4aad-98be-2c23930173b3,f1fae11c-db8e-4cd5-8e47-a3870316f89c,d8daa413-e57f-44fb-b907-b1a497f2dcfa,5f6ee8c2-84fb-432e-95cd-e428ab256e83,6bf54dcd-c3b4-4187-a390-eca841e46570,335c07ca-d339-4d3a-aa88-3b5753d24fbf,8fdbac00-6628-4f22-8fb4-b7a6524cae49,31d66429-c700-4a10-bb32-35e1b36a479f [2015-11-20 11:41:35.907850] W [master(/data/media):1014:process] _GMaster: SKIPPED GFID 03069c7f-8eaa-45b0-92ed-50cb648cd912,788f5ed1-923e-4b86-9696-2a6de07ebb2e,43d12b40-b6e2-43c4-8883-85e89dc81321 [2015-11-20 12:11:55.492068] W [master(/data/media):1014:process] _GMaster: SKIPPED GFID eb02369f-7ca8-480a-b00c-768964410ed8,17045ac9-27dd-4bf9-9f90-d7b146070dd5,265e3d9c-1657-45cb-bbf6-db439eb18ccf,553c420f-b3cc-47f2-8d5f-cfc2ffdd1a92 [2015-11-20 12:12:53.372432] W [master(/data/media):1014:process] _GMaster: SKIPPED GFID 66c5878e-8c00-4f7d-a3ad-4adec84a5e22,f4dc086d-9c2b-449c-9e31-bbae9ebcdea7,f99317b2-72e8-49e3-b676-647abad508b1 [2015-11-20 12:37:55.773813] W [master(/data/media):1014:process] _GMaster: SKIPPED GFID 4af54f1c-e8e1-4915-9328-a458d5d35d5d,acbe1f12-87e8-4192-b864-d90030269bba,7d27a795-da63-4742-9e91-abd8fa543612,8d4e642d-fd40-44d6-8419-8d3459df7ce3 [2015-11-20 12:39:28.852575] W [master(/data/media):1014:process] _GMaster: SKIPPED GFID d90dc121-02e7-4a79-bc03-1bd8fddd9f48,54bb563f-ab44-4e91-a46b-764a122ce7fa,088141de-7545-40f9-b776-751738a89740,2dab3faf-4a6c-407a-88cd-cddef6f55299,d887806f-23b4-4389-a4dc-f9027702a2df,fc5a9bc8-ea62-4677-baed-16510541373a,33136ad2-c5b4-448c-991d-1e72fefef021,cf3e2675-e41b-4782-9478-91773eb0a4aa,6412d878-e0f1-4700-84df-05f4af35962f,ec3cf6e1-7f27-4650-b978-8a5a7f620389,d3651bb9-cd2d-4c5f-93e6-fe4fb1cdf5db,ecb0415e-1524-40f4-870e-1fd0f8371b1d,a118aaae-bd3e-4b19-a0e0-891aa9edb09a,7642d3f3-f1e5-4aca-bcfe-bdb3c44779a9,2e29f3f8-c460-48eb-9db5-b281b67cc2bf,e61db54b-3979-488a-8789-a5d0615c5a97,4212d840-9c22-4d9e-b61b-5e35271dfe80,dad1c60b-9da6-4e57-b014-daa1aca73ce3,93699a3d-40b8-4bbd-b78f-aabf965df57f,4fad7468-91f2-4deb-aaf7-6401068c9e6d,c9738295-46cc-4fe7-b359-dc94f5815ce9,91853c5c-4877-4c9e-9481-c86368942f78,59deed8e-d3d0-4ab7-854e-53a8dd455de0,20b86c13-7df1-4d13-bac1-7d628a00d6ce,b7b86a2d-7963-41a4-a423-14e25d1e78c4,3c17d7fe-bb7f-489c-a525-5c8b7bb93c3e,e230d207-7c68-4983-a958-f2dcfc1ce694,fa8bf3c0-abae-446c-83c5-45ef8bcaa4b8,14089102-8106-45d9-a3f1-d1446b568f4e,6802a0c6-1f62-4213-a70d-7b46d9ff8f3a,0a253bbc-ef98-4da0-951f-e17c5a7f5858,ef054b76-986b-4a89-b8e6-b4988221aaa2,48c0a153-708c-44ee-b186-cf255936a02b,fa2646a6-807c-4e9d-8f2b-a9cdf2674e0c,1ed4a563-4f6a-4b5a-9866-89025fe7afd5,0f293cf7-bc32-4f8a-87d5-388a4bffb4af,f4126726-667b-451d-8214-a18bb3f468cd,e23dc8b3-da1c-4d18-aec9-22e0aa174d81,40b9f10d-7304-4c0b-8498-bef23b305d03,15c25d1e-2a62-495e-887f-14d0cb0527b1,67371804-9084-4801-b664-44e88bea8ac3,4750fa3f-d1a4-4472-b10d-3f75d0b451dc [2015-11-23 09:18:10.43391] W [master(/data/media):1014:process] _GMaster: SKIPPED GFID 228843f3-62f0-4687-b5eb-6d1e21257ad0,b0078359-fbf0-4709-8f40-8383a11d7875,60cff4d5-8b5d-4f7f-8bc1-27081a011458,bedb6ac4-208d-47e1-812c-5547c84ab841,da6810d9-4883-45e1-b73e-55a7ff17b5e7,e03b5c03-b25c-49ba-86f0-8a709a9c2658,053673a0-c1cc-4057-83fa-f97740cb5d4f,dbd6ea84-8f24-4a47-ac41-22c3fd788ecf,43caa3e7-ca04-47ab-b950-105606b313a4,62d8b1d0-fc89-4fb1-a41a-957dcb34d325,4e8fe1fa-60cd-47fa-bad6-f617c312f53b,6c3d6cf3-62ae-4ab8-9dc3-7815552401fe,f79be814-7e78-4985-bcdd-688da23d1808,c4186455-0f06-4b5d-89be-3c5ccbdeb6f0,f9c4ccdb-2337-479d-845d-ee4d85b69ece,bcd14726-1bab-4d97-8915-ec8bbe8faf8c,cca82341-a430-4a59-a900-1af66dcf7bb8,b7043a8e-4286-4831-91ec-c146e40bc6be,995ffeb6-a906-4078-88c6-404a2b38aad4,227f9987-5057-4133-848a-2b22aca5dde1,90b35242-32db-4570-8070-cf9dd49322a5,c6863c8f-1914-4a2d-814b-6e5853134faf,e2d19b1a-fc07-441c-b110-ca816b46fc40,9a3d0c0b-7d84-416f-9f3e-21b32a11ba1d,d8163f6b-8c40-418c-9c06-b3743af24e4e,522d7247-a75b-4af9-acb2-52a99eeced89,4b56ea9d-413a-4e24-b44e-433f7603ad6d There are also the following lines on the master, which might have some impact: E [MSGID: 108008] [afr-read-txn.c:89:afr_read_txn_refresh_done] 0-media-replicate-0: Failing READ on gfid abdc7d5e-9187-4916-ae83-a8b615e32a17: split-brain observed. [Input/output error] E [MSGID: 108008] [afr-read-txn.c:89:afr_read_txn_refresh_done] 0-media-replicate-0: Failing GETXATTR on gfid abdc7d5e-9187-4916-ae83-a8b615e32a17: split-brain observed. [Input/output error] E [mem-pool.c:417:mem_get0] (-->/usr/lib/x86_64-linux-gnu/libglusterfs.so.0(+0x809a2) [0x7f79e436b9a2] -->/usr/lib/x86_64-linux-gnu/libglusterfs.so.0(_gf_msg+0x79f) [0x7f79e430cb1f] -->/usr/lib/x86_64-linux-gnu/libglusterfs.so.0(mem_get0+0x81) [0x7f79e433e4a1] ) 0-mem-pool: invalid argument [Invalid argument] E [mem-pool.c:417:mem_get0] (-->/usr/lib/x86_64-linux-gnu/libglusterfs.so.0(recursive_rmdir+0x192) [0x7f79e4329b32] -->/usr/lib/x86_64-linux-gnu/libglusterfs.so.0(_gf_msg+0x79f) [0x7f79e430cb1f] -->/usr/lib/x86_64-linux-gnu/libglusterfs.so.0(mem_get0+0x81) [0x7f79e433e4a1] ) 0-mem-pool: invalid argument [Invalid argument] E [resource(/data/media):222:errlog] Popen: command "ssh -oPasswordAuthentication=no -oStrictHostKeyChecking=no -i /var/lib/glusterd/geo-replication/secret.pem -oControlMaster=auto -S /tmp/gsyncd-aux-ssh-dpY5cI/8216bb7da58a00926f369bb7ac8c7e03.sock root at us-west-gluster.server.com /usr/lib/x86_64-linux-gnu/glusterfs/gsyncd --session-owner 6922055e-49a1-4afd-a3a0-a47960d6ba54 -N --listen --timeout 120 gluster://localhost:media" returned with 143, saying: E [resource(/data/media):226:logerr] Popen: ssh> [2015-11-18 21:57:19.772896] I [cli.c:721:main] 0-cli: Started running /usr/sbin/gluster with version 3.7.5 E [resource(/data/media):226:logerr] Popen: ssh> [2015-11-18 21:57:19.772955] I [cli.c:608:cli_rpc_init] 0-cli: Connecting to remote glusterd at localhost E [resource(/data/media):226:logerr] Popen: ssh> [2015-11-18 21:57:19.871930] I [MSGID: 101190] [event-epoll.c:632:event_dispatch_epoll_worker] 0-epoll: Started thread with index 1 E [resource(/data/media):226:logerr] Popen: ssh> [2015-11-18 21:57:19.872018] I [socket.c:2355:socket_event_handler] 0-transport: disconnecting now E [resource(/data/media):226:logerr] Popen: ssh> [2015-11-18 21:57:19.872898] I [cli-rpc-ops.c:6348:gf_cli_getwd_cbk] 0-cli: Received resp to getwd E [resource(/data/media):226:logerr] Popen: ssh> [2015-11-18 21:57:19.872963] I [input.c:36:cli_batch] 0-: Exiting with: 0 Status detail shows the following: root at eu-gluster-1:/var/log/glusterfs/geo-replication/media# gluster volume geo-replication media root at us-west-gluster.websitewebsitewebs.com::media status detail MASTER NODE MASTER VOL MASTER BRICK SLAVE USER SLAVE SLAVE NODE STATUS CRAWL STATUS LAST_SYNCED ENTRY DATA META FAILURES CHECKPOINT TIME CHECKPOINT COMPLETED CHECKPOINT COMPLETION TIME ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- eu-gluster-1.websitewebsitewebs.com media /data/media root us-west-gluster.websitewebsitewebs.com::media us-west-gluster.websitewebsitewebs.com Active Changelog Crawl 2015-11-24 20:59:25 0 0 0 633 N/A N/A N/A eu-gluster-2.websitewebsitewebs.com media /data/media root us-west-gluster.websitewebsitewebs.com::media us-west-gluster.websitewebsitewebs.com Passive N/A N/A N/A N/A N/A N/A N/A N/A N/A What is the right way to retry failed items? Can I get a list of them somehow so that I could touch them in hopes to fix this? I wonder why does it not retry the items automatically? On Tue, Nov 24, 2015 at 6:11 AM, Venky Shankar <vshankar at redhat.com> wrote:> On Tue, Nov 24, 2015 at 1:23 AM, Audrius Butkevicius > <audrius.butkevicius at gmail.com> wrote: > > Hi, > > > > I've got a geo-replicated gluster volume, with a few hundred thousand > > images, which get generated on demand. > > > > I started getting replication failures in the status detail view, but > it's > > not obvious to me where to find the actual errors or how to actually fix > > them. > > Chris here[1] mentioned about a bug in rsync (thanks!). Could that be > the issue here? > > Mind checking rsync version used? > > [1]: > http://www.gluster.org/pipermail/gluster-users/2015-November/024423.html > > > > > The docs seem to be secretive about this as well. It seems if I tear the > > geo-replication down, and do a force create from scratch, it goes back in > > sync again, but as the files get generated, it starts getting failures > again > > at some point. > > > > Can someone provide me with information on how to check which files are > > causing failures, and what are the actual failures? Or point me to the > > relevant part in the docs? > > > > Version 3.7.5-ubuntu1~trusty1 > > > > Related SO question: > > > http://stackoverflow.com/questions/33839056/gluster-geo-replication-debugging-failures > > > > Thanks, > > > > Audrius. > > > > > > _______________________________________________ > > Gluster-users mailing list > > Gluster-users at gluster.org > > http://www.gluster.org/mailman/listinfo/gluster-users >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://www.gluster.org/pipermail/gluster-users/attachments/20151124/e4612032/attachment.html>