Pavel,
running 6.1-stable with these patches
rebuilt kernel/world as of 8/28 @ 2p CST w/ these patches
gjournal6_20060808.patch
vfs_subr.c.3.patch
the backend RAID presents 4 luns, this is how we config'd it.
da1 - 8G
da2 - ~897G
da3 - 8G
da4 - ~897G
da2/4 have been partitioned in FreeBSD, then we did the following
gjournal label -v /dev/da2 /dev/da1
gjournal label -v /dev/da4 /dev/da3
newfs -U -L "scr09" /dev/da2.journal
newfs -U -L "scr10" /dev/da4.journal
so 1 -8 G journal for each data device.
now that the server is under load i'm seeing NFS not responding messages
on my clients. the message corresponds to the gjournal suspend/copy
operation, causing my clients to hang or give "no such file or
directory".
we copied 137G to /scr10 and it just finished, could this be some
remains of writes from the journal?
here is the time correlation
Aug 31 13:55:24 donkey kernel: GEOM_JOURNAL[1]: Starting copy of journal.
Aug 31 13:55:24 donkey kernel: GEOM_JOURNAL[1]: Switch time of da4:
0.002798s
Aug 31 13:55:24 donkey kernel: GEOM_JOURNAL[1]: Entire switch time:
14.030198s
Aug 31 13:55:24 donkey kernel: GEOM_JOURNAL[1]: Data has been copied.
Aug 31 13:55:33 donkey kernel: GEOM_JOURNAL[1]: Entire switch time:
0.000013s
Aug 31 13:55:44 donkey kernel: GEOM_JOURNAL[1]: Entire switch time:
0.000013s
Aug 31 13:56:04 donkey kernel: GEOM_JOURNAL[1]: Msync time of /scr09:
0.000010s
Aug 31 13:56:04 donkey kernel: GEOM_JOURNAL[1]: Sync time of /scr09:
0.000009s
Aug 31 13:56:04 donkey kernel: GEOM_JOURNAL[1]: Suspend time of /scr09:
0.000007s
Aug 31 13:56:04 donkey kernel: GEOM_JOURNAL[1]: Starting copy of journal.
Aug 31 13:56:04 donkey kernel: GEOM_JOURNAL[1]: Switch time of da2:
0.002302s
Aug 31 13:56:04 donkey kernel: GEOM_JOURNAL[1]: Data has been copied.
Aug 31 13:56:04 donkey kernel: GEOM_JOURNAL[1]: Msync time of /scr10:
0.029769s
Aug 31 13:56:04 donkey kernel: GEOM_JOURNAL[1]: Sync time of /scr10:
0.035259s
Aug 31 13:56:04 donkey kernel: GEOM_JOURNAL[1]: Suspend time of /scr10:
10.109732s
Aug 31 13:56:04 donkey kernel: GEOM_JOURNAL[1]: Starting copy of journal.
Aug 31 13:56:04 donkey kernel: GEOM_JOURNAL[1]: Switch time of da4:
0.002756s
Aug 31 13:56:04 donkey kernel: GEOM_JOURNAL[1]: Entire switch time:
10.182759s
Aug 31 13:56:04 donkey kernel: GEOM_JOURNAL[1]: Data has been copied.
Aug 31 13:56:14 donkey kernel: GEOM_JOURNAL[1]: Entire switch time:
0.000012s
Aug 31 13:56:24 donkey kernel: GEOM_JOURNAL[1]: Entire switch time:
0.000011s
Aug 31 13:56:46 donkey kernel: GEOM_JOURNAL[1]: Msync time of /scr09:
0.000010s
Aug 31 13:56:46 donkey kernel: GEOM_JOURNAL[1]: Sync time of /scr09:
0.000009s
Aug 31 13:56:46 donkey kernel: GEOM_JOURNAL[1]: Suspend time of /scr09:
0.000007s
Aug 31 13:56:46 donkey kernel: GEOM_JOURNAL[1]: Starting copy of journal.
Aug 31 13:56:46 donkey kernel: GEOM_JOURNAL[1]: Switch time of da2:
0.002364s
Aug 31 13:56:46 donkey kernel: GEOM_JOURNAL[1]: Data has been copied.
from syslog server
Aug 31 13:55:23 <user.notice> bowltest4 kernel: nfs: server donkey not
responding, still trying
Aug 31 13:55:23 <user.notice> bowltest4 kernel: nfs: server donkey OK
Aug 31 13:55:23 <user.notice> laybox32 kernel: nfs: server donkey OK
Aug 31 13:55:29 <user.notice> b-115-4 kernel: nfs: server donkey not
responding, still trying
Aug 31 13:55:29 <user.notice> b-115-4 kernel: nfs: server donkey OK
Aug 31 13:55:56 <user.notice> b-116-16 kernel: nfs: server donkey not
responding, still trying
Aug 31 13:55:56 <user.notice> b-204-40 kernel: nfs: server donkey not
responding, still trying
Aug 31 13:55:57 <user.notice> b-116-16 kernel: nfs: server donkey OK
Aug 31 13:55:57 <user.notice> lic2 kernel: nfs: server donkey not
responding, still trying
Aug 31 13:55:57 <user.notice> b-204-40 kernel: nfs: server donkey OK
Aug 31 13:55:57 <user.notice> lic2 kernel: nfs: server donkey OK
Aug 31 13:55:57 <user.notice> laybox29 kernel: nfs: server donkey not
responding, still trying
Aug 31 13:55:57 <user.notice> laybox26 kernel: nfs: server donkey not
responding, still trying
Aug 31 13:55:58 <user.notice> laybox19 kernel: nfs: server donkey not
responding, still trying
Aug 31 13:55:58 <user.notice> laybox37 kernel: nfs: server donkey not
responding, still trying
Aug 31 13:56:00 <user.notice> laybox19 kernel: nfs: server donkey OK
Aug 31 13:56:00 <user.notice> laybox26 kernel: nfs: server donkey OK
Aug 31 13:56:00 <user.notice> laybox37 kernel: nfs: server donkey OK
Aug 31 13:56:00 <user.notice> laybox29 kernel: nfs: server donkey OK
Aug 31 13:56:05 <daemon.info> ws-119-8 amd[2640]: file server
donkey20.centtech.com, type nfs, state not responding
Aug 31 13:56:05 <daemon.info> ws-119-8 amd[2640]: file server
donkey20.centtech.com, type nfs, state ok
Aug 31 13:56:36 <user.notice> b-116-17 kernel: nfs: server donkey not
responding, still trying
Aug 31 13:56:36 <user.notice> b-116-17 kernel: nfs: server donkey OK
Aug 31 13:56:40 <user.notice> b-210-17 kernel: nfs: server donkey not
responding, still trying
Aug 31 13:56:41 <user.notice> b-204-41 kernel: nfs: server donkey not
responding, still trying
Aug 31 13:56:41 <user.notice> laybox17 kernel: nfs: server donkey not
responding, still trying
Aug 31 13:56:44 <user.notice> b-204-38 kernel: nfs: server donkey not
responding, still trying
Aug 31 13:56:44 <user.notice> b-204-38 kernel: nfs: server donkey OK
Aug 31 13:56:44 <user.notice> bowltest3 kernel: nfs: server donkey not
responding, still trying
Aug 31 13:56:46 <user.notice> b-210-17 kernel: nfs: server donkey OK
Aug 31 13:56:46 <user.notice> laybox17 kernel: nfs: server donkey OK
are the journal devices not large enough? is there a formula for sizing?
sorry this is long. can i umount the data device, remove journaling and
mount as a regular device? what are those steps? thanks and sorry for
the long-winded posting..
------------------------------
Kevin Kramer
Sr. Systems Administrator
512.418.5725
Centaur Technology, Inc.
www.centtech.com