I am running glusterfs on a bunch of linux (openSUSE) machines. I installed from source and have it running fine. However, when I try to add a mount to fstab, the system complains about glusterfs being an unknown file system. I probably missed a step in the installation or some other detail. Any pointers? Thanks, Sean
Hi Sean, Do you have mount.glusterfs properly installed in "/sbin" directory and please state the version of glusterfs you are using. Regards On Fri, Dec 19, 2008 at 9:27 PM, Sean Davis <sdavis2 at mail.nih.gov> wrote:> I am running glusterfs on a bunch of linux (openSUSE) machines. I > installed from source and have it running fine. However, when I try > to add a mount to fstab, the system complains about glusterfs being an > unknown file system. I probably missed a step in the installation or > some other detail. Any pointers? > > Thanks, > Sean > > _______________________________________________ > Gluster-users mailing list > Gluster-users at gluster.org > http://zresearch.com/cgi-bin/mailman/listinfo/gluster-users >-- Harshavardhana [y4m4 on #gluster at irc.freenode.net] "Samudaya TantraShilpi" Z Research Inc - http://www.zresearch.com -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://supercolony.gluster.org/pipermail/gluster-users/attachments/20081220/5d025921/attachment.html>
On Fri, Dec 19, 2008 at 6:56 PM, Harshavardhana Ranganath <harsha at zresearch.com> wrote:> Hi Sean, > > Do you have mount.glusterfs properly installed in "/sbin" directory and > please state the version of glusterfs > you are using.Both you and Keith came to the same correct conclusion. I do not have mount.glusterfs in /sbin. I am using 1.3.13 and mainly for testing, so I will probably unmount things, uninstall, and then reinstall 1.4rcX. It sounds like there are a number of new features that would be worth testing. Thanks to you and Keith for the help. Sean> Regards > On Fri, Dec 19, 2008 at 9:27 PM, Sean Davis <sdavis2 at mail.nih.gov> wrote: >> >> I am running glusterfs on a bunch of linux (openSUSE) machines. I >> installed from source and have it running fine. However, when I try >> to add a mount to fstab, the system complains about glusterfs being an >> unknown file system. I probably missed a step in the installation or >> some other detail. Any pointers? >> >> Thanks, >> Sean >> >> _______________________________________________ >> Gluster-users mailing list >> Gluster-users at gluster.org >> http://zresearch.com/cgi-bin/mailman/listinfo/gluster-users > > > > -- > Harshavardhana > [y4m4 on #gluster at irc.freenode.net] > "Samudaya TantraShilpi" > Z Research Inc - http://www.zresearch.com > >
Most definitely use 1.4 unless there''s a particular need to use 1.3 (of which I can''t think of any). it''s more efficient in nearly all aspects from what I can tell, especially with AFR. It''s got many more features and seems to be quite a bit more stable (using the RC4 in my environment currently). also, I''d look at the configure options and disable the features you''re sure you wont need (--disable-bdb, for example). if you''re not sure what you don''t need, just go for the whole thing :) Keith At 06:15 PM 12/19/2008, Sean Davis wrote:>On Fri, Dec 19, 2008 at 6:56 PM, Harshavardhana Ranganath ><harsha at zresearch.com> wrote: > > Hi Sean, > > > > Do you have mount.glusterfs properly installed in "/sbin" directory and > > please state the version of glusterfs > > you are using. > >Both you and Keith came to the same correct conclusion. I do not have >mount.glusterfs in /sbin. I am using 1.3.13 and mainly for testing, >so I will probably unmount things, uninstall, and then reinstall >1.4rcX. It sounds like there are a number of new features that would >be worth testing. > >Thanks to you and Keith for the help. > >Sean > > > Regards > > On Fri, Dec 19, 2008 at 9:27 PM, Sean Davis <sdavis2 at mail.nih.gov> wrote: > >> > >> I am running glusterfs on a bunch of linux (openSUSE) machines. I > >> installed from source and have it running fine. However, when I try > >> to add a mount to fstab, the system complains about glusterfs being an > >> unknown file system. I probably missed a step in the installation or > >> some other detail. Any pointers? > >> > >> Thanks, > >> Sean > >> > >> _______________________________________________ > >> Gluster-users mailing list > >> Gluster-users at gluster.org > >> http://zresearch.com/cgi-bin/mailman/listinfo/gluster-users > > > > > > > > -- > > Harshavardhana > > [y4m4 on #gluster at irc.freenode.net] > > "Samudaya TantraShilpi" > > Z Research Inc - http://www.zresearch.com > > > > > >_______________________________________________ >Gluster-users mailing list >Gluster-users at gluster.org >http://zresearch.com/cgi-bin/mailman/listinfo/gluster-users
Most definitely use 1.4 unless there''s a particular need to use 1.3 (of which I can''t think of any). it''s more efficient in nearly all aspects from what I can tell, especially with AFR. It''s got many more features and seems to be quite a bit more stable (using the RC4 in my environment currently). also, I''d look at the configure options and disable the features you''re sure you wont need (--disable-bdb, for example). if you''re not sure what you don''t need, just go for the whole thing :) Keith At 06:15 PM 12/19/2008, Sean Davis wrote:>On Fri, Dec 19, 2008 at 6:56 PM, Harshavardhana Ranganath ><harsha at zresearch.com> wrote: > > Hi Sean, > > > > Do you have mount.glusterfs properly installed in "/sbin" directory and > > please state the version of glusterfs > > you are using. > >Both you and Keith came to the same correct conclusion. I do not have >mount.glusterfs in /sbin. I am using 1.3.13 and mainly for testing, >so I will probably unmount things, uninstall, and then reinstall >1.4rcX. It sounds like there are a number of new features that would >be worth testing. > >Thanks to you and Keith for the help. > >Sean > > > Regards > > On Fri, Dec 19, 2008 at 9:27 PM, Sean Davis <sdavis2 at mail.nih.gov> wrote: > >> > >> I am running glusterfs on a bunch of linux (openSUSE) machines. I > >> installed from source and have it running fine. However, when I try > >> to add a mount to fstab, the system complains about glusterfs being an > >> unknown file system. I probably missed a step in the installation or > >> some other detail. Any pointers? > >> > >> Thanks, > >> Sean > >> > >> _______________________________________________ > >> Gluster-users mailing list > >> Gluster-users at gluster.org > >> http://zresearch.com/cgi-bin/mailman/listinfo/gluster-users > > > > > > > > -- > > Harshavardhana > > [y4m4 on #gluster at irc.freenode.net] > > "Samudaya TantraShilpi" > > Z Research Inc - http://www.zresearch.com > > > > > >_______________________________________________ >Gluster-users mailing list >Gluster-users at gluster.org >http://zresearch.com/cgi-bin/mailman/listinfo/gluster-users
Most definitely use 1.4 unless there''s a particular need to use 1.3 (of which I can''t think of any). it''s more efficient in nearly all aspects from what I can tell, especially with AFR. It''s got many more features and seems to be quite a bit more stable (using the RC4 in my environment currently). also, I''d look at the configure options and disable the features you''re sure you wont need (--disable-bdb, for example). if you''re not sure what you don''t need, just go for the whole thing :) Keith At 06:15 PM 12/19/2008, Sean Davis wrote:>On Fri, Dec 19, 2008 at 6:56 PM, Harshavardhana Ranganath ><harsha at zresearch.com> wrote: > > Hi Sean, > > > > Do you have mount.glusterfs properly installed in "/sbin" directory and > > please state the version of glusterfs > > you are using. > >Both you and Keith came to the same correct conclusion. I do not have >mount.glusterfs in /sbin. I am using 1.3.13 and mainly for testing, >so I will probably unmount things, uninstall, and then reinstall >1.4rcX. It sounds like there are a number of new features that would >be worth testing. > >Thanks to you and Keith for the help. > >Sean > > > Regards > > On Fri, Dec 19, 2008 at 9:27 PM, Sean Davis <sdavis2 at mail.nih.gov> wrote: > >> > >> I am running glusterfs on a bunch of linux (openSUSE) machines. I > >> installed from source and have it running fine. However, when I try > >> to add a mount to fstab, the system complains about glusterfs being an > >> unknown file system. I probably missed a step in the installation or > >> some other detail. Any pointers? > >> > >> Thanks, > >> Sean > >> > >> _______________________________________________ > >> Gluster-users mailing list > >> Gluster-users at gluster.org > >> http://zresearch.com/cgi-bin/mailman/listinfo/gluster-users > > > > > > > > -- > > Harshavardhana > > [y4m4 on #gluster at irc.freenode.net] > > "Samudaya TantraShilpi" > > Z Research Inc - http://www.zresearch.com > > > > > >_______________________________________________ >Gluster-users mailing list >Gluster-users at gluster.org >http://zresearch.com/cgi-bin/mailman/listinfo/gluster-users
On Sat, Dec 20, 2008 at 6:09 AM, Keith Freedman <freedman at freeformit.com> wrote:> Most definitely use 1.4 unless there's a particular need to use 1.3 (of > which I can't think of any). > > it's more efficient in nearly all aspects from what I can tell, especially > with AFR. > It's got many more features and seems to be quite a bit more stable (using > the RC4 in my environment currently). > > also, I'd look at the configure options and disable the features you're sure > you wont need (--disable-bdb, for example). if you're not sure what you > don't need, just go for the whole thing :)I have installed 1.4rc6. No compile errors, etc. However, I still do not have mount.glusterfs in /sbin. I used a custom install location (--prefix=/usr/local) as it is a shared file system for the cluster. Can I simply copy the mount.glusterfs file to the various /sbin locations for each machine and expect things to work? Sean> At 06:15 PM 12/19/2008, Sean Davis wrote: >> >> On Fri, Dec 19, 2008 at 6:56 PM, Harshavardhana Ranganath >> <harsha at zresearch.com> wrote: >> > Hi Sean, >> > >> > Do you have mount.glusterfs properly installed in "/sbin" directory >> > and >> > please state the version of glusterfs >> > you are using. >> >> Both you and Keith came to the same correct conclusion. I do not have >> mount.glusterfs in /sbin. I am using 1.3.13 and mainly for testing, >> so I will probably unmount things, uninstall, and then reinstall >> 1.4rcX. It sounds like there are a number of new features that would >> be worth testing. >> >> Thanks to you and Keith for the help. >> >> Sean >> >> > Regards >> > On Fri, Dec 19, 2008 at 9:27 PM, Sean Davis <sdavis2 at mail.nih.gov> >> > wrote: >> >> >> >> I am running glusterfs on a bunch of linux (openSUSE) machines. I >> >> installed from source and have it running fine. However, when I try >> >> to add a mount to fstab, the system complains about glusterfs being an >> >> unknown file system. I probably missed a step in the installation or >> >> some other detail. Any pointers? >> >> >> >> Thanks, >> >> Sean >> >> >> >> _______________________________________________ >> >> Gluster-users mailing list >> >> Gluster-users at gluster.org >> >> http://zresearch.com/cgi-bin/mailman/listinfo/gluster-users >> > >> > >> > >> > -- >> > Harshavardhana >> > [y4m4 on #gluster at irc.freenode.net] >> > "Samudaya TantraShilpi" >> > Z Research Inc - http://www.zresearch.com >> > >> > >> >> _______________________________________________ >> Gluster-users mailing list >> Gluster-users at gluster.org >> http://zresearch.com/cgi-bin/mailman/listinfo/gluster-users > >
At 11:30 AM 12/20/2008, Sean Davis wrote:>On Sat, Dec 20, 2008 at 6:09 AM, Keith Freedman ><freedman at freeformit.com> wrote: > > Most definitely use 1.4 unless there''s a particular need to use 1.3 (of > > which I can''t think of any). > > > > it''s more efficient in nearly all aspects from what I can tell, especially > > with AFR. > > It''s got many more features and seems to be quite a bit more stable (using > > the RC4 in my environment currently). > > > > also, I''d look at the configure options and disable the features > you''re sure > > you wont need (--disable-bdb, for example). if you''re not sure what you > > don''t need, just go for the whole thing :) > >I have installed 1.4rc6. No compile errors, etc. However, I still do >not have mount.glusterfs in /sbin. I used a custom install location >(--prefix=/usr/local) as it is a shared file system for the cluster. >Can I simply copy the mount.glusterfs file to the various /sbin >locations for each machine and expect things to work?yeah, it''s probably in /usr/local/sbin I''d either copy it or put a hard link. test it on one and see if that solves the problem.
At 11:30 AM 12/20/2008, Sean Davis wrote:>On Sat, Dec 20, 2008 at 6:09 AM, Keith Freedman ><freedman at freeformit.com> wrote: > > Most definitely use 1.4 unless there''s a particular need to use 1.3 (of > > which I can''t think of any). > > > > it''s more efficient in nearly all aspects from what I can tell, especially > > with AFR. > > It''s got many more features and seems to be quite a bit more stable (using > > the RC4 in my environment currently). > > > > also, I''d look at the configure options and disable the features > you''re sure > > you wont need (--disable-bdb, for example). if you''re not sure what you > > don''t need, just go for the whole thing :) > >I have installed 1.4rc6. No compile errors, etc. However, I still do >not have mount.glusterfs in /sbin. I used a custom install location >(--prefix=/usr/local) as it is a shared file system for the cluster. >Can I simply copy the mount.glusterfs file to the various /sbin >locations for each machine and expect things to work?yeah, it''s probably in /usr/local/sbin I''d either copy it or put a hard link. test it on one and see if that solves the problem.
At 11:30 AM 12/20/2008, Sean Davis wrote:>On Sat, Dec 20, 2008 at 6:09 AM, Keith Freedman ><freedman at freeformit.com> wrote: > > Most definitely use 1.4 unless there''s a particular need to use 1.3 (of > > which I can''t think of any). > > > > it''s more efficient in nearly all aspects from what I can tell, especially > > with AFR. > > It''s got many more features and seems to be quite a bit more stable (using > > the RC4 in my environment currently). > > > > also, I''d look at the configure options and disable the features > you''re sure > > you wont need (--disable-bdb, for example). if you''re not sure what you > > don''t need, just go for the whole thing :) > >I have installed 1.4rc6. No compile errors, etc. However, I still do >not have mount.glusterfs in /sbin. I used a custom install location >(--prefix=/usr/local) as it is a shared file system for the cluster. >Can I simply copy the mount.glusterfs file to the various /sbin >locations for each machine and expect things to work?yeah, it''s probably in /usr/local/sbin I''d either copy it or put a hard link. test it on one and see if that solves the problem.
On Sun, Dec 21, 2008 at 3:41 AM, Keith Freedman <freedman at freeformit.com> wrote:> At 11:30 AM 12/20/2008, Sean Davis wrote: >> >> On Sat, Dec 20, 2008 at 6:09 AM, Keith Freedman <freedman at freeformit.com> >> wrote: >> > Most definitely use 1.4 unless there's a particular need to use 1.3 (of >> > which I can't think of any). >> > >> > it's more efficient in nearly all aspects from what I can tell, >> > especially >> > with AFR. >> > It's got many more features and seems to be quite a bit more stable >> > (using >> > the RC4 in my environment currently). >> > >> > also, I'd look at the configure options and disable the features you're >> > sure >> > you wont need (--disable-bdb, for example). if you're not sure what you >> > don't need, just go for the whole thing :) >> >> I have installed 1.4rc6. No compile errors, etc. However, I still do >> not have mount.glusterfs in /sbin. I used a custom install location >> (--prefix=/usr/local) as it is a shared file system for the cluster. >> Can I simply copy the mount.glusterfs file to the various /sbin >> locations for each machine and expect things to work? > > yeah, it's probably in /usr/local/sbin > I'd either copy it or put a hard link. > test it on one and see if that solves the problem.That's the weird thing. It isn't there. The only place I find it is in the original src directory. It did get made, but never got copied to ANY destination, it appears. I'll just link out from there--not a problem. I'm no expert on makefiles, but I guess I can look through there to see if there is anything funny. Thanks again, Sean
Keith Freedman
2008-Dec-23 12:43 UTC
[Gluster-users] 1.4.0RC6 AFR problems (backtrace info attached)
here''s the backtrace info from 2 of my crashes: a logfile excerpt is at the end (gdb) bt #0 0x0000000000e6dbf2 in afr_truncate_wind (frame=0x7fada8ad9330, this=0xe6e770) at afr-inode-write.c:1145 #1 0x0000000000e72c7d in afr_write_pending_pre_op_cbk (frame=0x7fada8ad9330, cookie=0x8, this=0x185e740, op_ret=<value optimized out>, op_errno=<value optimized out>, xattr=<value optimized out>) at afr-transaction.c:431 #2 0x00000000001212e0 in default_xattrop_cbk (frame=<value optimized out>, cookie=<value optimized out>, this=<value optimized out>, op_ret=0, op_errno=-1465023696, dict=0xe72b30) at defaults.c:1015 #3 0x000000000060edb0 in posix_xattrop (frame=0x7fada8ad8f10, this=0x1857920, loc=0x7fada8ad96d0, optype=GF_XATTROP_ADD_ARRAY, xattr=0x7fada8ada440) at posix.c:2474 #4 0x0000000000122090 in default_xattrop (frame=0x7fada8ad79a0, this=0x185c9d0, loc=0x7fada8ad96d0, flags=GF_XATTROP_ADD_ARRAY, dict=0x7fada8ada440) at defaults.c:1026 #5 0x0000000000e7374b in afr_write_pending_pre_op (frame=0x7fada8ad9330, this=0x185e740) at afr-transaction.c:494 #6 0x0000000000e73985 in afr_lock_rec (frame=0x7fada8ad9330, this=0x185e740, child_index=2) at afr-transaction.c:690 #7 0x0000000000e74044 in afr_lock_cbk (frame=0x7fada8ad9330, cookie=<value optimized out>, this=0x185e740, op_ret=<value optimized out>, op_errno=0) at afr-transaction.c:617 #8 0x000000000081ed2c in pl_inodelk (frame=0x7fada8ad88d0, this=0x185c9d0, loc=<value optimized out>, cmd=7, flock=0x412e6e70) at internal.c:157 #9 0x0000000000e73d4b in afr_lock_rec (frame=0x7fada8ad9330, this=<value optimized out>, child_index=0) at afr-transaction.c:709 #10 0x0000000000e73f40 in afr_transaction (frame=0x7fada8ad9330, this=0x185e740, type=AFR_DATA_TRANSACTION) at afr-transaction.c:856 #11 0x0000000000e6f062 in afr_truncate (frame=0x7fada8ada480, this=0x185e740, loc=<value optimized out>, offset=0) at afr-inode-write.c:1229 #12 0x00000000018d6bc0 in fuse_setattr (req=<value optimized out>, ino=<value optimized out>, attr=0x412e7000, valid=<value optimized out>, fi=<value optimized out>) at fuse-bridge.c:810 #13 0x0000000001099173 in do_setattr (req=0x7fada8ad91f0, nodeid=214525249896, inarg=<value optimized out>) at fuse_lowlevel.c:486 #14 0x00000000018d7d35 in fuse_thread_proc (data=0x185f070) at fuse-bridge.c:2506 #15 0x00000031f360729a in start_thread () from /lib64/libpthread.so.0 #16 0x00000031f2ae439d in clone () from /lib64/libc.so.6 (gdb) bt #0 0x0000000000e6dbf2 in afr_truncate_wind (frame=0x1917520, this=0xe6e770) at afr-inode-write.c:1145 #1 0x0000000000e72c7d in afr_write_pending_pre_op_cbk (frame=0x1917520, cookie=0x8, this=0x1718740, op_ret=<value optimized out>, op_errno=<value optimized out>, xattr=<value optimized out>) at afr-transaction.c:431 #2 0x00000000001212e0 in default_xattrop_cbk (frame=<value optimized out>, cookie=<value optimized out>, this=<value optimized out>, op_ret=0, op_errno=26304544, dict=0x0) at defaults.c:1015 #3 0x000000000060edb0 in posix_xattrop (frame=0x19176e0, this=0x1711920, loc=0x1918520, optype=GF_XATTROP_ADD_ARRAY, xattr=0x1917610) at posix.c:2474 #4 0x0000000000122090 in default_xattrop (frame=0x1915f60, this=0x17169d0, loc=0x1918520, flags=GF_XATTROP_ADD_ARRAY, dict=0x1917610) at defaults.c:1026 #5 0x0000000000e7374b in afr_write_pending_pre_op (frame=0x1917520, this=0x1718740) at afr-transaction.c:494 #6 0x0000000000e73985 in afr_lock_rec (frame=0x1917520, this=0x1718740, child_index=2) at afr-transaction.c:690 #7 0x0000000000e74044 in afr_lock_cbk (frame=0x1917520, cookie=<value optimized out>, this=0x1718740, op_ret=<value optimized out>, op_errno=0) at afr-transaction.c:617 #8 0x000000000081ed2c in pl_inodelk (frame=0x1915b00, this=0x17169d0, loc=<value optimized out>, cmd=7, flock=0x42f4be70) at internal.c:157 #9 0x0000000000e73d4b in afr_lock_rec (frame=0x1917520, this=<value optimized out>, child_index=0) at afr-transaction.c:709 #10 0x0000000000e73f40 in afr_transaction (frame=0x1917520, this=0x1718740, type=AFR_DATA_TRANSACTION) at afr-transaction.c:856 #11 0x0000000000e6f062 in afr_truncate (frame=0x19183c0, this=0x1718740, loc=<value optimized out>, offset=0) at afr-inode-write.c:1229 #12 0x0000000005f00bc0 in fuse_setattr (req=<value optimized out>, ino=<value optimized out>, attr=0x42f4c000, valid=<value optimized out>, fi=<value optimized out>) at fuse-bridge.c:810 #13 0x00000000039d7173 in do_setattr (req=0x19179f0, nodeid=214525249896, inarg=<value optimized out>) at fuse_lowlevel.c:486 #14 0x0000000005f01d35 in fuse_thread_proc (data=0x1719070) at fuse-bridge.c:2506 #15 0x00000031f360729a in start_thread () from /lib64/libpthread.so.0 #16 0x00000031f2ae439d in clone () from /lib64/libc.so.6 logfile excerpt: 60: #end-volume +----- 2008-12-23 00:28:38 E [socket.c:708:socket_connect_finish] home2: connection failed (Connection timed out) pending frames: Signal received: 11 configuration details:argp 1 backtrace 1 dlfcn 1 fdatasync 1 libpthread 1 llistxattr 1 setfsid 1 spinlock 1 epoll.h 1 xattr.h 1 tv_nsec 1 package-string: glusterfs 1.4.0rc6 /lib64/libc.so.6[0x31f2a322a0] /usr/local/lib/glusterfs/1.4.0rc6/xlator/cluster/afr.so(afr_truncate_wind+0x72)[0xe6dbf2] /usr/local/lib/glusterfs/1.4.0rc6/xlator/cluster/afr.so(afr_write_pending_pre_op_cbk+0xcd)[0xe72c7d] /usr/local/lib/libglusterfs.so.0(default_xattrop_cbk+0x20)[0x1212e0] /usr/local/lib/glusterfs/1.4.0rc6/xlator/storage/posix.so(posix_xattrop+0x1e0)[0x60edb0] /usr/local/lib/libglusterfs.so.0(default_xattrop+0xc0)[0x122090] /usr/local/lib/glusterfs/1.4.0rc6/xlator/cluster/afr.so(afr_write_pending_pre_op+0x4fb)[0xe7374b] /usr/local/lib/glusterfs/1.4.0rc6/xlator/cluster/afr.so[0xe73985] /usr/local/lib/glusterfs/1.4.0rc6/xlator/cluster/afr.so(afr_lock_cbk+0xa4)[0xe74044] /usr/local/lib/glusterfs/1.4.0rc6/xlator/features/posix-locks.so(pl_inodelk+0x11c)[0x81ed2c] /usr/local/lib/glusterfs/1.4.0rc6/xlator/cluster/afr.so[0xe73d4b] /usr/local/lib/glusterfs/1.4.0rc6/xlator/cluster/afr.so(afr_transaction+0x110)[0xe73f40] /usr/local/lib/glusterfs/1.4.0rc6/xlator/cluster/afr.so(afr_truncate+0x1f2)[0xe6f062] /usr/local/lib/glusterfs/1.4.0rc6/xlator/mount/fuse.so[0x6753bc0] /usr/local/lib/libfuse.so.2[0x1099173] /usr/local/lib/glusterfs/1.4.0rc6/xlator/mount/fuse.so[0x6754d35] /lib64/libpthread.so.0[0x31f360729a] /lib64/libc.so.6(clone+0x6d)[0x31f2ae439d] --------- Version : glusterfs 1.4.0rc6 built on Dec 23 2008 00:22:39 TLA Revision : glusterfs--mainline--3.0--patch-792 Starting Time: 2008-12-23 00:41:14 Command line : /usr/local/sbin/glusterfs --log-level=WARNING --volfile=/etc/glusterfs/glusterfs-home.vol /home given volfile +-----
Keith Freedman
2008-Dec-23 12:43 UTC
[Gluster-users] 1.4.0RC6 AFR problems (backtrace info attached)
here''s the backtrace info from 2 of my crashes: a logfile excerpt is at the end (gdb) bt #0 0x0000000000e6dbf2 in afr_truncate_wind (frame=0x7fada8ad9330, this=0xe6e770) at afr-inode-write.c:1145 #1 0x0000000000e72c7d in afr_write_pending_pre_op_cbk (frame=0x7fada8ad9330, cookie=0x8, this=0x185e740, op_ret=<value optimized out>, op_errno=<value optimized out>, xattr=<value optimized out>) at afr-transaction.c:431 #2 0x00000000001212e0 in default_xattrop_cbk (frame=<value optimized out>, cookie=<value optimized out>, this=<value optimized out>, op_ret=0, op_errno=-1465023696, dict=0xe72b30) at defaults.c:1015 #3 0x000000000060edb0 in posix_xattrop (frame=0x7fada8ad8f10, this=0x1857920, loc=0x7fada8ad96d0, optype=GF_XATTROP_ADD_ARRAY, xattr=0x7fada8ada440) at posix.c:2474 #4 0x0000000000122090 in default_xattrop (frame=0x7fada8ad79a0, this=0x185c9d0, loc=0x7fada8ad96d0, flags=GF_XATTROP_ADD_ARRAY, dict=0x7fada8ada440) at defaults.c:1026 #5 0x0000000000e7374b in afr_write_pending_pre_op (frame=0x7fada8ad9330, this=0x185e740) at afr-transaction.c:494 #6 0x0000000000e73985 in afr_lock_rec (frame=0x7fada8ad9330, this=0x185e740, child_index=2) at afr-transaction.c:690 #7 0x0000000000e74044 in afr_lock_cbk (frame=0x7fada8ad9330, cookie=<value optimized out>, this=0x185e740, op_ret=<value optimized out>, op_errno=0) at afr-transaction.c:617 #8 0x000000000081ed2c in pl_inodelk (frame=0x7fada8ad88d0, this=0x185c9d0, loc=<value optimized out>, cmd=7, flock=0x412e6e70) at internal.c:157 #9 0x0000000000e73d4b in afr_lock_rec (frame=0x7fada8ad9330, this=<value optimized out>, child_index=0) at afr-transaction.c:709 #10 0x0000000000e73f40 in afr_transaction (frame=0x7fada8ad9330, this=0x185e740, type=AFR_DATA_TRANSACTION) at afr-transaction.c:856 #11 0x0000000000e6f062 in afr_truncate (frame=0x7fada8ada480, this=0x185e740, loc=<value optimized out>, offset=0) at afr-inode-write.c:1229 #12 0x00000000018d6bc0 in fuse_setattr (req=<value optimized out>, ino=<value optimized out>, attr=0x412e7000, valid=<value optimized out>, fi=<value optimized out>) at fuse-bridge.c:810 #13 0x0000000001099173 in do_setattr (req=0x7fada8ad91f0, nodeid=214525249896, inarg=<value optimized out>) at fuse_lowlevel.c:486 #14 0x00000000018d7d35 in fuse_thread_proc (data=0x185f070) at fuse-bridge.c:2506 #15 0x00000031f360729a in start_thread () from /lib64/libpthread.so.0 #16 0x00000031f2ae439d in clone () from /lib64/libc.so.6 (gdb) bt #0 0x0000000000e6dbf2 in afr_truncate_wind (frame=0x1917520, this=0xe6e770) at afr-inode-write.c:1145 #1 0x0000000000e72c7d in afr_write_pending_pre_op_cbk (frame=0x1917520, cookie=0x8, this=0x1718740, op_ret=<value optimized out>, op_errno=<value optimized out>, xattr=<value optimized out>) at afr-transaction.c:431 #2 0x00000000001212e0 in default_xattrop_cbk (frame=<value optimized out>, cookie=<value optimized out>, this=<value optimized out>, op_ret=0, op_errno=26304544, dict=0x0) at defaults.c:1015 #3 0x000000000060edb0 in posix_xattrop (frame=0x19176e0, this=0x1711920, loc=0x1918520, optype=GF_XATTROP_ADD_ARRAY, xattr=0x1917610) at posix.c:2474 #4 0x0000000000122090 in default_xattrop (frame=0x1915f60, this=0x17169d0, loc=0x1918520, flags=GF_XATTROP_ADD_ARRAY, dict=0x1917610) at defaults.c:1026 #5 0x0000000000e7374b in afr_write_pending_pre_op (frame=0x1917520, this=0x1718740) at afr-transaction.c:494 #6 0x0000000000e73985 in afr_lock_rec (frame=0x1917520, this=0x1718740, child_index=2) at afr-transaction.c:690 #7 0x0000000000e74044 in afr_lock_cbk (frame=0x1917520, cookie=<value optimized out>, this=0x1718740, op_ret=<value optimized out>, op_errno=0) at afr-transaction.c:617 #8 0x000000000081ed2c in pl_inodelk (frame=0x1915b00, this=0x17169d0, loc=<value optimized out>, cmd=7, flock=0x42f4be70) at internal.c:157 #9 0x0000000000e73d4b in afr_lock_rec (frame=0x1917520, this=<value optimized out>, child_index=0) at afr-transaction.c:709 #10 0x0000000000e73f40 in afr_transaction (frame=0x1917520, this=0x1718740, type=AFR_DATA_TRANSACTION) at afr-transaction.c:856 #11 0x0000000000e6f062 in afr_truncate (frame=0x19183c0, this=0x1718740, loc=<value optimized out>, offset=0) at afr-inode-write.c:1229 #12 0x0000000005f00bc0 in fuse_setattr (req=<value optimized out>, ino=<value optimized out>, attr=0x42f4c000, valid=<value optimized out>, fi=<value optimized out>) at fuse-bridge.c:810 #13 0x00000000039d7173 in do_setattr (req=0x19179f0, nodeid=214525249896, inarg=<value optimized out>) at fuse_lowlevel.c:486 #14 0x0000000005f01d35 in fuse_thread_proc (data=0x1719070) at fuse-bridge.c:2506 #15 0x00000031f360729a in start_thread () from /lib64/libpthread.so.0 #16 0x00000031f2ae439d in clone () from /lib64/libc.so.6 logfile excerpt: 60: #end-volume +----- 2008-12-23 00:28:38 E [socket.c:708:socket_connect_finish] home2: connection failed (Connection timed out) pending frames: Signal received: 11 configuration details:argp 1 backtrace 1 dlfcn 1 fdatasync 1 libpthread 1 llistxattr 1 setfsid 1 spinlock 1 epoll.h 1 xattr.h 1 tv_nsec 1 package-string: glusterfs 1.4.0rc6 /lib64/libc.so.6[0x31f2a322a0] /usr/local/lib/glusterfs/1.4.0rc6/xlator/cluster/afr.so(afr_truncate_wind+0x72)[0xe6dbf2] /usr/local/lib/glusterfs/1.4.0rc6/xlator/cluster/afr.so(afr_write_pending_pre_op_cbk+0xcd)[0xe72c7d] /usr/local/lib/libglusterfs.so.0(default_xattrop_cbk+0x20)[0x1212e0] /usr/local/lib/glusterfs/1.4.0rc6/xlator/storage/posix.so(posix_xattrop+0x1e0)[0x60edb0] /usr/local/lib/libglusterfs.so.0(default_xattrop+0xc0)[0x122090] /usr/local/lib/glusterfs/1.4.0rc6/xlator/cluster/afr.so(afr_write_pending_pre_op+0x4fb)[0xe7374b] /usr/local/lib/glusterfs/1.4.0rc6/xlator/cluster/afr.so[0xe73985] /usr/local/lib/glusterfs/1.4.0rc6/xlator/cluster/afr.so(afr_lock_cbk+0xa4)[0xe74044] /usr/local/lib/glusterfs/1.4.0rc6/xlator/features/posix-locks.so(pl_inodelk+0x11c)[0x81ed2c] /usr/local/lib/glusterfs/1.4.0rc6/xlator/cluster/afr.so[0xe73d4b] /usr/local/lib/glusterfs/1.4.0rc6/xlator/cluster/afr.so(afr_transaction+0x110)[0xe73f40] /usr/local/lib/glusterfs/1.4.0rc6/xlator/cluster/afr.so(afr_truncate+0x1f2)[0xe6f062] /usr/local/lib/glusterfs/1.4.0rc6/xlator/mount/fuse.so[0x6753bc0] /usr/local/lib/libfuse.so.2[0x1099173] /usr/local/lib/glusterfs/1.4.0rc6/xlator/mount/fuse.so[0x6754d35] /lib64/libpthread.so.0[0x31f360729a] /lib64/libc.so.6(clone+0x6d)[0x31f2ae439d] --------- Version : glusterfs 1.4.0rc6 built on Dec 23 2008 00:22:39 TLA Revision : glusterfs--mainline--3.0--patch-792 Starting Time: 2008-12-23 00:41:14 Command line : /usr/local/sbin/glusterfs --log-level=WARNING --volfile=/etc/glusterfs/glusterfs-home.vol /home given volfile +-----
Keith Freedman
2008-Dec-23 12:43 UTC
[Gluster-users] 1.4.0RC6 AFR problems (backtrace info attached)
here''s the backtrace info from 2 of my crashes: a logfile excerpt is at the end (gdb) bt #0 0x0000000000e6dbf2 in afr_truncate_wind (frame=0x7fada8ad9330, this=0xe6e770) at afr-inode-write.c:1145 #1 0x0000000000e72c7d in afr_write_pending_pre_op_cbk (frame=0x7fada8ad9330, cookie=0x8, this=0x185e740, op_ret=<value optimized out>, op_errno=<value optimized out>, xattr=<value optimized out>) at afr-transaction.c:431 #2 0x00000000001212e0 in default_xattrop_cbk (frame=<value optimized out>, cookie=<value optimized out>, this=<value optimized out>, op_ret=0, op_errno=-1465023696, dict=0xe72b30) at defaults.c:1015 #3 0x000000000060edb0 in posix_xattrop (frame=0x7fada8ad8f10, this=0x1857920, loc=0x7fada8ad96d0, optype=GF_XATTROP_ADD_ARRAY, xattr=0x7fada8ada440) at posix.c:2474 #4 0x0000000000122090 in default_xattrop (frame=0x7fada8ad79a0, this=0x185c9d0, loc=0x7fada8ad96d0, flags=GF_XATTROP_ADD_ARRAY, dict=0x7fada8ada440) at defaults.c:1026 #5 0x0000000000e7374b in afr_write_pending_pre_op (frame=0x7fada8ad9330, this=0x185e740) at afr-transaction.c:494 #6 0x0000000000e73985 in afr_lock_rec (frame=0x7fada8ad9330, this=0x185e740, child_index=2) at afr-transaction.c:690 #7 0x0000000000e74044 in afr_lock_cbk (frame=0x7fada8ad9330, cookie=<value optimized out>, this=0x185e740, op_ret=<value optimized out>, op_errno=0) at afr-transaction.c:617 #8 0x000000000081ed2c in pl_inodelk (frame=0x7fada8ad88d0, this=0x185c9d0, loc=<value optimized out>, cmd=7, flock=0x412e6e70) at internal.c:157 #9 0x0000000000e73d4b in afr_lock_rec (frame=0x7fada8ad9330, this=<value optimized out>, child_index=0) at afr-transaction.c:709 #10 0x0000000000e73f40 in afr_transaction (frame=0x7fada8ad9330, this=0x185e740, type=AFR_DATA_TRANSACTION) at afr-transaction.c:856 #11 0x0000000000e6f062 in afr_truncate (frame=0x7fada8ada480, this=0x185e740, loc=<value optimized out>, offset=0) at afr-inode-write.c:1229 #12 0x00000000018d6bc0 in fuse_setattr (req=<value optimized out>, ino=<value optimized out>, attr=0x412e7000, valid=<value optimized out>, fi=<value optimized out>) at fuse-bridge.c:810 #13 0x0000000001099173 in do_setattr (req=0x7fada8ad91f0, nodeid=214525249896, inarg=<value optimized out>) at fuse_lowlevel.c:486 #14 0x00000000018d7d35 in fuse_thread_proc (data=0x185f070) at fuse-bridge.c:2506 #15 0x00000031f360729a in start_thread () from /lib64/libpthread.so.0 #16 0x00000031f2ae439d in clone () from /lib64/libc.so.6 (gdb) bt #0 0x0000000000e6dbf2 in afr_truncate_wind (frame=0x1917520, this=0xe6e770) at afr-inode-write.c:1145 #1 0x0000000000e72c7d in afr_write_pending_pre_op_cbk (frame=0x1917520, cookie=0x8, this=0x1718740, op_ret=<value optimized out>, op_errno=<value optimized out>, xattr=<value optimized out>) at afr-transaction.c:431 #2 0x00000000001212e0 in default_xattrop_cbk (frame=<value optimized out>, cookie=<value optimized out>, this=<value optimized out>, op_ret=0, op_errno=26304544, dict=0x0) at defaults.c:1015 #3 0x000000000060edb0 in posix_xattrop (frame=0x19176e0, this=0x1711920, loc=0x1918520, optype=GF_XATTROP_ADD_ARRAY, xattr=0x1917610) at posix.c:2474 #4 0x0000000000122090 in default_xattrop (frame=0x1915f60, this=0x17169d0, loc=0x1918520, flags=GF_XATTROP_ADD_ARRAY, dict=0x1917610) at defaults.c:1026 #5 0x0000000000e7374b in afr_write_pending_pre_op (frame=0x1917520, this=0x1718740) at afr-transaction.c:494 #6 0x0000000000e73985 in afr_lock_rec (frame=0x1917520, this=0x1718740, child_index=2) at afr-transaction.c:690 #7 0x0000000000e74044 in afr_lock_cbk (frame=0x1917520, cookie=<value optimized out>, this=0x1718740, op_ret=<value optimized out>, op_errno=0) at afr-transaction.c:617 #8 0x000000000081ed2c in pl_inodelk (frame=0x1915b00, this=0x17169d0, loc=<value optimized out>, cmd=7, flock=0x42f4be70) at internal.c:157 #9 0x0000000000e73d4b in afr_lock_rec (frame=0x1917520, this=<value optimized out>, child_index=0) at afr-transaction.c:709 #10 0x0000000000e73f40 in afr_transaction (frame=0x1917520, this=0x1718740, type=AFR_DATA_TRANSACTION) at afr-transaction.c:856 #11 0x0000000000e6f062 in afr_truncate (frame=0x19183c0, this=0x1718740, loc=<value optimized out>, offset=0) at afr-inode-write.c:1229 #12 0x0000000005f00bc0 in fuse_setattr (req=<value optimized out>, ino=<value optimized out>, attr=0x42f4c000, valid=<value optimized out>, fi=<value optimized out>) at fuse-bridge.c:810 #13 0x00000000039d7173 in do_setattr (req=0x19179f0, nodeid=214525249896, inarg=<value optimized out>) at fuse_lowlevel.c:486 #14 0x0000000005f01d35 in fuse_thread_proc (data=0x1719070) at fuse-bridge.c:2506 #15 0x00000031f360729a in start_thread () from /lib64/libpthread.so.0 #16 0x00000031f2ae439d in clone () from /lib64/libc.so.6 logfile excerpt: 60: #end-volume +----- 2008-12-23 00:28:38 E [socket.c:708:socket_connect_finish] home2: connection failed (Connection timed out) pending frames: Signal received: 11 configuration details:argp 1 backtrace 1 dlfcn 1 fdatasync 1 libpthread 1 llistxattr 1 setfsid 1 spinlock 1 epoll.h 1 xattr.h 1 tv_nsec 1 package-string: glusterfs 1.4.0rc6 /lib64/libc.so.6[0x31f2a322a0] /usr/local/lib/glusterfs/1.4.0rc6/xlator/cluster/afr.so(afr_truncate_wind+0x72)[0xe6dbf2] /usr/local/lib/glusterfs/1.4.0rc6/xlator/cluster/afr.so(afr_write_pending_pre_op_cbk+0xcd)[0xe72c7d] /usr/local/lib/libglusterfs.so.0(default_xattrop_cbk+0x20)[0x1212e0] /usr/local/lib/glusterfs/1.4.0rc6/xlator/storage/posix.so(posix_xattrop+0x1e0)[0x60edb0] /usr/local/lib/libglusterfs.so.0(default_xattrop+0xc0)[0x122090] /usr/local/lib/glusterfs/1.4.0rc6/xlator/cluster/afr.so(afr_write_pending_pre_op+0x4fb)[0xe7374b] /usr/local/lib/glusterfs/1.4.0rc6/xlator/cluster/afr.so[0xe73985] /usr/local/lib/glusterfs/1.4.0rc6/xlator/cluster/afr.so(afr_lock_cbk+0xa4)[0xe74044] /usr/local/lib/glusterfs/1.4.0rc6/xlator/features/posix-locks.so(pl_inodelk+0x11c)[0x81ed2c] /usr/local/lib/glusterfs/1.4.0rc6/xlator/cluster/afr.so[0xe73d4b] /usr/local/lib/glusterfs/1.4.0rc6/xlator/cluster/afr.so(afr_transaction+0x110)[0xe73f40] /usr/local/lib/glusterfs/1.4.0rc6/xlator/cluster/afr.so(afr_truncate+0x1f2)[0xe6f062] /usr/local/lib/glusterfs/1.4.0rc6/xlator/mount/fuse.so[0x6753bc0] /usr/local/lib/libfuse.so.2[0x1099173] /usr/local/lib/glusterfs/1.4.0rc6/xlator/mount/fuse.so[0x6754d35] /lib64/libpthread.so.0[0x31f360729a] /lib64/libc.so.6(clone+0x6d)[0x31f2ae439d] --------- Version : glusterfs 1.4.0rc6 built on Dec 23 2008 00:22:39 TLA Revision : glusterfs--mainline--3.0--patch-792 Starting Time: 2008-12-23 00:41:14 Command line : /usr/local/sbin/glusterfs --log-level=WARNING --volfile=/etc/glusterfs/glusterfs-home.vol /home given volfile +-----
Amar (ಅಮರ್ ತುಂಬಳ್ಳಿ)
2008-Dec-23 19:47 UTC
[Gluster-users] 1.4.0RC6 AFR problems (backtrace info attached)
Hi Keith, Thanks for these logs. Very helpful. Regards, 2008/12/23 Keith Freedman <freedman at freeformit.com>> here's the backtrace info from 2 of my crashes: a logfile excerpt is > at the end >-- Amar Tumballi Gluster/GlusterFS Hacker [bulde on #gluster/irc.gnu.org] http://www.zresearch.com - Commoditizing Super Storage! -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://supercolony.gluster.org/pipermail/gluster-users/attachments/20081223/1753f81a/attachment.html>
Anand Avati
2008-Dec-23 19:58 UTC
[Gluster-users] 1.4.0RC6 AFR problems (backtrace info attached)
Keith, Thanks for the reports. Fixed this bug. Will be available in the next release or off the tla. Thanks! avati 2008/12/23 Keith Freedman <freedman at freeformit.com>:> here's the backtrace info from 2 of my crashes: a logfile excerpt is > at the end > (gdb) bt
Hi Keith. Sorry for the previous email, it was a bit not in-place. Would you mind sharing how you recovered from this issue? I'm going to stress test a solution based on GlusterFS next week, including pulling live disk offline in middle of work, and would appreciate any hints you might share regarding recovering from the failures. Regards. 2008/12/23 Keith Freedman <freedman at freeformit.com> so, I had a drive failure on one of my boxes and it lead to discovery> of numerous issues today: > > 1) when a drive is failing and one of the AFR servers is dealing with > IO errors, the other one freaks out and sometimes crashes, but > doesn't seem to ever network timeout. > > 2) when starting gluster on the server with the new empty drive, it > gave me a bunch of errors about things being out of sync and to > delete a file from all but the preferred server. > this struck me as odd, since the thing was empty. > so I used the favorite child, but this isn't a preferred solution long > term. > > 3) one of the directories had 20GB of data in it.... I went to do an > ls of the directory and had to wait while it auto-healed all the > files.. while this is helpful, it would be nice to have gotten back > the directory listing without having to wait for 20GB of data to get > sent over the network. > > 4) while the other server was down, the up server kept failing.. > signal 11? and I had to constantly remount the filesystem. It was > giving me messages about the other node being down which was fine but > then it'd just die after a while.. consistently. > > > _______________________________________________ > Gluster-users mailing list > Gluster-users at gluster.org > http://zresearch.com/cgi-bin/mailman/listinfo/gluster-users >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://supercolony.gluster.org/pipermail/gluster-users/attachments/20081225/90d7137a/attachment.html>
Replies inline.> 1) when a drive is failing and one of the AFR servers is dealing with > IO errors, the other one freaks out and sometimes crashes, but > doesn't seem to ever network timeout. >This was same issue as (4)> > 2) when starting gluster on the server with the new empty drive, it > gave me a bunch of errors about things being out of sync and to > delete a file from all but the preferred server. > this struck me as odd, since the thing was empty. > so I used the favorite child, but this isn't a preferred solution long > term. >Sure, this should not happen.. Not yet fixed. Will be looking at it today.> > 3) one of the directories had 20GB of data in it.... I went to do an > ls of the directory and had to wait while it auto-healed all the > files.. while this is helpful, it would be nice to have gotten back > the directory listing without having to wait for 20GB of data to get > sent over the network. >Currently this behavior is not going to be changed (at least til 1.4.0), because, this can happen only if it is self-healing. And it will make sure things are ok when accessed first time. As it works fine now, we don't want to do a code change upto making a stable release.> > 4) while the other server was down, the up server kept failing.. > signal 11? and I had to constantly remount the filesystem. It was > giving me messages about the other node being down which was fine but > then it'd just die after a while.. consistently. >This is fixed in tla, we have made a qa release to internal team, once passes basic tests, will be making next 'RC' release. Regards, Amar -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://supercolony.gluster.org/pipermail/gluster-users/attachments/20081224/772501c2/attachment.html>
At 03:05 PM 12/24/2008, Amar Tumballi (bulde) wrote:>3) one of the directories had 20GB of data in it.... I went to do an >ls of the directory and had to wait while it auto-healed all the >files.. while this is helpful, it would be nice to have gotten back >the directory listing without having to wait for 20GB of data to get >sent over the network. > > >Currently this behavior is not going to be changed (at least til >1.4.0), because, this can happen only if it is self-healing. And it >will make sure things are ok when accessed first time. As it works >fine now, we don''t want to do a code change upto making a stable release.I understand the purpose of the functionality and normally would be fine with it, but it''s just an inconvenient approach. Ideally, (in 1.4.1, perhaps), it would return the directory listing to the request then do the actual data transfer in the background. since the directory listing doesn''t imply that one actually cares about the individual file data at this point in time. Also, if this is the case, then if one of the entries in the directory is a directory, does that whole directory get auto-healed at the same time, or just files within the current directory? in other words, will this cause an auto-heal of an entire directory tree, which would be terribly inconvenient if one has to wait all that time.>4) while the other server was down, the up server kept failing.. >signal 11? and I had to constantly remount the filesystem. It was >giving me messages about the other node being down which was fine but >then it''d just die after a while.. consistently. > > >This is fixed in tla, we have made a qa release to internal team, >once passes basic tests, will be making next ''RC'' release.I''ll do some testing once the next rc is available for download.
At 03:05 PM 12/24/2008, Amar Tumballi (bulde) wrote:>3) one of the directories had 20GB of data in it.... I went to do an >ls of the directory and had to wait while it auto-healed all the >files.. while this is helpful, it would be nice to have gotten back >the directory listing without having to wait for 20GB of data to get >sent over the network. > > >Currently this behavior is not going to be changed (at least til >1.4.0), because, this can happen only if it is self-healing. And it >will make sure things are ok when accessed first time. As it works >fine now, we don''t want to do a code change upto making a stable release.I understand the purpose of the functionality and normally would be fine with it, but it''s just an inconvenient approach. Ideally, (in 1.4.1, perhaps), it would return the directory listing to the request then do the actual data transfer in the background. since the directory listing doesn''t imply that one actually cares about the individual file data at this point in time. Also, if this is the case, then if one of the entries in the directory is a directory, does that whole directory get auto-healed at the same time, or just files within the current directory? in other words, will this cause an auto-heal of an entire directory tree, which would be terribly inconvenient if one has to wait all that time.>4) while the other server was down, the up server kept failing.. >signal 11? and I had to constantly remount the filesystem. It was >giving me messages about the other node being down which was fine but >then it''d just die after a while.. consistently. > > >This is fixed in tla, we have made a qa release to internal team, >once passes basic tests, will be making next ''RC'' release.I''ll do some testing once the next rc is available for download.
At 03:05 PM 12/24/2008, Amar Tumballi (bulde) wrote:>3) one of the directories had 20GB of data in it.... I went to do an >ls of the directory and had to wait while it auto-healed all the >files.. while this is helpful, it would be nice to have gotten back >the directory listing without having to wait for 20GB of data to get >sent over the network. > > >Currently this behavior is not going to be changed (at least til >1.4.0), because, this can happen only if it is self-healing. And it >will make sure things are ok when accessed first time. As it works >fine now, we don''t want to do a code change upto making a stable release.I understand the purpose of the functionality and normally would be fine with it, but it''s just an inconvenient approach. Ideally, (in 1.4.1, perhaps), it would return the directory listing to the request then do the actual data transfer in the background. since the directory listing doesn''t imply that one actually cares about the individual file data at this point in time. Also, if this is the case, then if one of the entries in the directory is a directory, does that whole directory get auto-healed at the same time, or just files within the current directory? in other words, will this cause an auto-heal of an entire directory tree, which would be terribly inconvenient if one has to wait all that time.>4) while the other server was down, the up server kept failing.. >signal 11? and I had to constantly remount the filesystem. It was >giving me messages about the other node being down which was fine but >then it''d just die after a while.. consistently. > > >This is fixed in tla, we have made a qa release to internal team, >once passes basic tests, will be making next ''RC'' release.I''ll do some testing once the next rc is available for download.
At 02:45 PM 12/24/2008, Stas Oskin wrote:>Hi Keith. > >Sorry for the previous email, it was a bit not in-place. > >Would you mind sharing how you recovered from this issue?I think Amar''s responses to my email will be helpful. Especially given that some of my issues were bugs that are fixed or being fixed, my particular method of recovery wouldn''t necessarily apply>I''m going to stress test a solution based on GlusterFS next week, >including pulling live disk offline in middle of work, and would >appreciate any hints you might share regarding recovering from the failures.I think with the fixed bugs, it should be as easy as I expected. once you have an empty underlying filesystem (with no gluster extended attributes), AFR should auto-heal the entire directory without a problem. it tried to do this but hit a bug, which was overcome by setting the option favorite-child in the AFR translator. this isn''t necessarily an ideal production run-time configuration, but it''s reasonable to set this to recover from a drive failure and then unset it after the recovery is complete. as for specifics of forcing auto-heal: I used the find command from the wiki: find /GLUSTERMOUNTPOINT -type f -exec head -1 {} \; > /dev/null it can be interesting if you tail -f the gluster logfile in another window while this goes on. I''ve found the script "whodir" posted a while go to be helpful to when I''m having troubles re-mounting the filesystem when gluster crashes. --whodir-- #!/bin/sh DIR=$1 find /proc 2>/dev/null | grep -E ''cwd|exe'' | xargs ls -l 2>/dev/null | grep "> $DIR" | sed ''s/ */ /g'' | cut -f8 -d'' '' | cut -f3 -d/ | sort | uniq | while read line; do echo $line $(cat /proc/$line/cmdline); done>Regards. > >2008/12/23 Keith Freedman ><<mailto:freedman at freeformit.com>freedman at freeformit.com> > >so, I had a drive failure on one of my boxes and it lead to discovery >of numerous issues today: > >1) when a drive is failing and one of the AFR servers is dealing with >IO errors, the other one freaks out and sometimes crashes, but >doesn''t seem to ever network timeout. > >2) when starting gluster on the server with the new empty drive, it >gave me a bunch of errors about things being out of sync and to >delete a file from all but the preferred server. >this struck me as odd, since the thing was empty. >so I used the favorite child, but this isn''t a preferred solution long term. > >3) one of the directories had 20GB of data in it.... I went to do an >ls of the directory and had to wait while it auto-healed all the >files.. while this is helpful, it would be nice to have gotten back >the directory listing without having to wait for 20GB of data to get >sent over the network. > >4) while the other server was down, the up server kept failing.. >signal 11? and I had to constantly remount the filesystem. It was >giving me messages about the other node being down which was fine but >then it''d just die after a while.. consistently. > > >_______________________________________________ >Gluster-users mailing list ><mailto:Gluster-users at gluster.org>Gluster-users at gluster.org >http://zresearch.com/cgi-bin/mailman/listinfo/gluster-users
At 02:45 PM 12/24/2008, Stas Oskin wrote:>Hi Keith. > >Sorry for the previous email, it was a bit not in-place. > >Would you mind sharing how you recovered from this issue?I think Amar''s responses to my email will be helpful. Especially given that some of my issues were bugs that are fixed or being fixed, my particular method of recovery wouldn''t necessarily apply>I''m going to stress test a solution based on GlusterFS next week, >including pulling live disk offline in middle of work, and would >appreciate any hints you might share regarding recovering from the failures.I think with the fixed bugs, it should be as easy as I expected. once you have an empty underlying filesystem (with no gluster extended attributes), AFR should auto-heal the entire directory without a problem. it tried to do this but hit a bug, which was overcome by setting the option favorite-child in the AFR translator. this isn''t necessarily an ideal production run-time configuration, but it''s reasonable to set this to recover from a drive failure and then unset it after the recovery is complete. as for specifics of forcing auto-heal: I used the find command from the wiki: find /GLUSTERMOUNTPOINT -type f -exec head -1 {} \; > /dev/null it can be interesting if you tail -f the gluster logfile in another window while this goes on. I''ve found the script "whodir" posted a while go to be helpful to when I''m having troubles re-mounting the filesystem when gluster crashes. --whodir-- #!/bin/sh DIR=$1 find /proc 2>/dev/null | grep -E ''cwd|exe'' | xargs ls -l 2>/dev/null | grep "> $DIR" | sed ''s/ */ /g'' | cut -f8 -d'' '' | cut -f3 -d/ | sort | uniq | while read line; do echo $line $(cat /proc/$line/cmdline); done>Regards. > >2008/12/23 Keith Freedman ><<mailto:freedman at freeformit.com>freedman at freeformit.com> > >so, I had a drive failure on one of my boxes and it lead to discovery >of numerous issues today: > >1) when a drive is failing and one of the AFR servers is dealing with >IO errors, the other one freaks out and sometimes crashes, but >doesn''t seem to ever network timeout. > >2) when starting gluster on the server with the new empty drive, it >gave me a bunch of errors about things being out of sync and to >delete a file from all but the preferred server. >this struck me as odd, since the thing was empty. >so I used the favorite child, but this isn''t a preferred solution long term. > >3) one of the directories had 20GB of data in it.... I went to do an >ls of the directory and had to wait while it auto-healed all the >files.. while this is helpful, it would be nice to have gotten back >the directory listing without having to wait for 20GB of data to get >sent over the network. > >4) while the other server was down, the up server kept failing.. >signal 11? and I had to constantly remount the filesystem. It was >giving me messages about the other node being down which was fine but >then it''d just die after a while.. consistently. > > >_______________________________________________ >Gluster-users mailing list ><mailto:Gluster-users at gluster.org>Gluster-users at gluster.org >http://zresearch.com/cgi-bin/mailman/listinfo/gluster-users
At 02:45 PM 12/24/2008, Stas Oskin wrote:>Hi Keith. > >Sorry for the previous email, it was a bit not in-place. > >Would you mind sharing how you recovered from this issue?I think Amar''s responses to my email will be helpful. Especially given that some of my issues were bugs that are fixed or being fixed, my particular method of recovery wouldn''t necessarily apply>I''m going to stress test a solution based on GlusterFS next week, >including pulling live disk offline in middle of work, and would >appreciate any hints you might share regarding recovering from the failures.I think with the fixed bugs, it should be as easy as I expected. once you have an empty underlying filesystem (with no gluster extended attributes), AFR should auto-heal the entire directory without a problem. it tried to do this but hit a bug, which was overcome by setting the option favorite-child in the AFR translator. this isn''t necessarily an ideal production run-time configuration, but it''s reasonable to set this to recover from a drive failure and then unset it after the recovery is complete. as for specifics of forcing auto-heal: I used the find command from the wiki: find /GLUSTERMOUNTPOINT -type f -exec head -1 {} \; > /dev/null it can be interesting if you tail -f the gluster logfile in another window while this goes on. I''ve found the script "whodir" posted a while go to be helpful to when I''m having troubles re-mounting the filesystem when gluster crashes. --whodir-- #!/bin/sh DIR=$1 find /proc 2>/dev/null | grep -E ''cwd|exe'' | xargs ls -l 2>/dev/null | grep "> $DIR" | sed ''s/ */ /g'' | cut -f8 -d'' '' | cut -f3 -d/ | sort | uniq | while read line; do echo $line $(cat /proc/$line/cmdline); done>Regards. > >2008/12/23 Keith Freedman ><<mailto:freedman at freeformit.com>freedman at freeformit.com> > >so, I had a drive failure on one of my boxes and it lead to discovery >of numerous issues today: > >1) when a drive is failing and one of the AFR servers is dealing with >IO errors, the other one freaks out and sometimes crashes, but >doesn''t seem to ever network timeout. > >2) when starting gluster on the server with the new empty drive, it >gave me a bunch of errors about things being out of sync and to >delete a file from all but the preferred server. >this struck me as odd, since the thing was empty. >so I used the favorite child, but this isn''t a preferred solution long term. > >3) one of the directories had 20GB of data in it.... I went to do an >ls of the directory and had to wait while it auto-healed all the >files.. while this is helpful, it would be nice to have gotten back >the directory listing without having to wait for 20GB of data to get >sent over the network. > >4) while the other server was down, the up server kept failing.. >signal 11? and I had to constantly remount the filesystem. It was >giving me messages about the other node being down which was fine but >then it''d just die after a while.. consistently. > > >_______________________________________________ >Gluster-users mailing list ><mailto:Gluster-users at gluster.org>Gluster-users at gluster.org >http://zresearch.com/cgi-bin/mailman/listinfo/gluster-users