Hi, I have a gluster volume that consists of 22Bricks and includes a single folder with 3.6 Million files. Yesterday the volume crashed and turned out to be completely unresposible and I was forced to perform a hard reboot on all gluster servers because they were not able to execute a reboot command issued by the shell because they were that heavy overloaded. Each gluster server has 12 CPU cores and each of them was maxxed out. The servers log looks like this: http://pastebin.com/rv0WQ14E The client log looks like this: http://filebin.ca/YLiU6yxURKd/client-log.txt The most interesting parts from the server log are: 2013-02-25 19:48:01.672143] W [socket.c:195:__socket_rwv] 0-tcp.adata-server: writev failed (Connection reset by peer) [2013-02-25 19:48:01.742239] I [server-helpers.c:474:do_fd_cleanup] 0-adata-server: fd cleanup on /files/random-file.dat [2013-02-25 19:48:01.749871] E [server.c:176:server_submit_reply] (-->/usr/lib/glusterfs/3.3.1/xlator/features/marker.so(marker_lookup_cbk+0x103) [0x7ffdfce19e53] (-->/usr/lib/glusterfs/3.3.1/xlator/debug/io-stats.so(io_stats_lookup_cbk+0xff) [0x7ffdfcc00f9f] (-->/usr/lib/glusterfs/3.3.1/xlator/protocol/server.so(server_lookup_cbk+0x39d) [0x7ffdfc9e480d]))) 0-: Reply submission failed [2013-02-25 19:48:01.805132] I [server-helpers.c:330:do_lock_table_cleanup] 0-adata-server: finodelk released on /files/random-file.dat [2013-02-25 19:59:46.438680] W [socket.c:195:__socket_rwv] 0-tcp.adata-server: readv failed (Connection timed out) [2013-02-25 20:03:01.825280] I [server-helpers.c:394:do_lock_table_cleanup] 0-adata-server: entrylk released on /files [2013-02-25 20:03:01.825309] I [server-helpers.c:474:do_fd_cleanup] 0-adata-server: fd cleanup on /files The most interesting parts from the client log are: [2013-02-25 19:44:43.568372] W [client3_1-fops.c:5306:client3_1_finodelk] 0-adata-client-7: (b468c831-5c04-4bbd-8b03-b94d7f937e55) remote_fd is -1. EBADFD [2013-02-25 19:44:43.568405] W [client3_1-fops.c:4098:client3_1_flush] 0-adata-client-7: (b468c831-5c04-4bbd-8b03-b94d7f937e55) remote_fd is -1. EBADFD [2013-02-25 19:44:55.423638] I [afr-self-heal-common.c:1189:afr_sh_missing_entry_call_impunge_recreate] 0-adata-replicate-8: no missing files - /files/random-file.dat. proceeding to metadata check [2013-02-25 19:44:59.816525] I [afr-self-heal-common.c:1941:afr_sh_post_nb_entrylk_conflicting_sh_cbk] 0-adata-replicate-1: Non blocking entrylks failed. [2013-02-25 19:44:59.816554] E [afr-self-heal-common.c:2160:afr_self_heal_completion_cbk] 0-adata-replicate-1: background data missing-entry gfid self-heal failed on /files/random-file.dat [2013-02-25 19:47:01.548989] I [afr-self-heal-common.c:1941:afr_sh_post_nb_entrylk_conflicting_sh_cbk] 0-adata-replicate-8: Non blocking entrylks failed. [2013-02-25 19:47:01.548996] E [afr-self-heal-common.c:2160:afr_self_heal_completion_cbk] 0-adata-replicate-8: background data missing-entry gfid self-heal failed on /files/random-file.dat [2013-02-25 19:56:20.623378] I [dht-layout.c:593:dht_layout_normalize] 0-adata-dht: found anomalies in /files. holes=1 overlaps=0 [2013-02-25 19:56:20.623399] W [dht-layout.c:186:dht_layout_search] 0-adata-dht: no subvolume for hash (value) = 2967408968 [2013-02-25 19:56:20.623404] W [dht-selfheal.c:882:dht_selfheal_directory] 0-adata-dht: 1 subvolumes have unrecoverable errors [2013-02-25 19:56:20.623413] E [dht-common.c:1372:dht_lookup] 0-adata-dht: Failed to get hashed subvol for /files/random-file.dat [2013-02-25 19:57:02.780965] W [fuse-bridge.c:292:fuse_entry_cbk] 0-glusterfs-fuse: 3960898: LOOKUP() /files/random-file.dat => -1 (Invalid argument) [2013-02-25 19:57:02.781385] W [dht-layout.c:186:dht_layout_search] 0-adata-dht: no subvolume for hash (value) = 2967408968 [2013-02-25 20:00:32.715869] I [dht-layout.c:593:dht_layout_normalize] 0-adata-dht: found anomalies in /files. holes=2 overlaps=0 [2013-02-25 20:00:32.715886] W [dht-selfheal.c:882:dht_selfheal_directory] 0-adata-dht: 2 subvolumes have unrecoverable errors [2013-02-25 20:00:47.566817] I [dht-common.c:543:dht_revalidate_cbk] 0-adata-dht: subvolume adata-replicate-9 for /files returned -1 (Input/output error) What could be the reason for this? I've seen that the folder which contains the 3M files was locked and I am restructuring the directory layout so there is a directory tree with approximately 100-500 files within every directory. Would this prevent a lock of all files? -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://supercolony.gluster.org/pipermail/gluster-users/attachments/20130226/a7508752/attachment.html>