George Hartzell
2008-May-13 16:47 UTC
good/best practices for gmirror and gjournal on a pair of disks?
I've been running many of my systems for some time now using gmirror on a pair of identical disks, as described by Ralf at: http://people.freebsd.org/~rse/mirror/ Each disk has single slice that covers almost all of the disk. These slices are combined into the gmirror device (gm0), which is then carved up by bsdlabel into gm0a (/), gm0b (swap), gm0d (/var), gm0e (/tmp), and gm0f (/usr). My latest machine is using Seagate 1TB disks so I thought I should add gjournal to the mix to avoid ugly fsck's if/when the machine doesn't shut down cleanly. I ended up just creating a gm0f.journal and using it for /usr, which basically seems to be working. I'm left with a couple of questions though: - I've read in the gjournal man page that when it is "... configured on top of gmirror(8) or graid3(8) providers, it also keeps them in a consistent state..." I've been trying to figure out if this simply falls out of how gjournal works or if there's explicity collusion with gmirror/graid3 but can't come up with a satisfactory explanation. Can someone walk me through it? Since I'm only gjournal'ing a portion of the underlying gmirror device I assume that I don't get this benefit? - I've also read in the gjournal man page "... that sync(2) and fsync(2) system calls do not work as expected anymore." Does this invalidate any of the assumptions made by various database packages such as postgresql, sqlite, berkeley db, etc.... about if/when/whether their data is safely on the disk? - What's the cleanest gjournal adaptation of rse's two-disk-mirror-everything setup that would be able to avoid tedious gmirror sync's. The best I've come up with is to do two slices per disk, combine the slices into a pair of gmirror devices, bsdlabel the first into gm0a (/), gm0b (swap), gm0d (/var) and gm0e (/tmp) and bsdlabel the second into a gm1f which gets a gjournal device. Alternatively, would it work and/or make sense to give each disk a single slice, combine them into a gmirror, put a gjournal on top of that, then use bsdlabel to slice it up into partitions? Is anyone using gjournal and gmirror for all of the system on a pair of disks in some other configuration? Thanks, g.
Adam McDougall
2008-May-13 20:36 UTC
good/best practices for gmirror and gjournal on a pair of disks?
George Hartzell wrote:> I've been running many of my systems for some time now using gmirror > on a pair of identical disks, as described by Ralf at: > > http://people.freebsd.org/~rse/mirror/ > > Each disk has single slice that covers almost all of the disk. These > slices are combined into the gmirror device (gm0), which is then > carved up by bsdlabel into gm0a (/), gm0b (swap), gm0d (/var), gm0e > (/tmp), and gm0f (/usr). > > My latest machine is using Seagate 1TB disks so I thought I should add > gjournal to the mix to avoid ugly fsck's if/when the machine doesn't > shut down cleanly. I ended up just creating a gm0f.journal and using > it for /usr, which basically seems to be working. > > I'm left with a couple of questions though: > > - I've read in the gjournal man page that when it is "... configured > on top of gmirror(8) or graid3(8) providers, it also keeps them in > a consistent state..." I've been trying to figure out if this > simply falls out of how gjournal works or if there's explicity > collusion with gmirror/graid3 but can't come up with a > satisfactory explanation. Can someone walk me through it? > > Since I'm only gjournal'ing a portion of the underlying gmirror > device I assume that I don't get this benefit? > > - I've also read in the gjournal man page "... that sync(2) and > fsync(2) system calls do not work as expected anymore." Does this > invalidate any of the assumptions made by various database > packages such as postgresql, sqlite, berkeley db, etc.... about > if/when/whether their data is safely on the disk? > > - What's the cleanest gjournal adaptation of rse's > two-disk-mirror-everything setup that would be able to avoid > tedious gmirror sync's. The best I've come up with is to do two > slices per disk, combine the slices into a pair of gmirror > devices, bsdlabel the first into gm0a (/), gm0b (swap), gm0d > (/var) and gm0e (/tmp) and bsdlabel the second into a gm1f which > gets a gjournal device. > > Alternatively, would it work and/or make sense to give each disk a > single slice, combine them into a gmirror, put a gjournal on top > of that, then use bsdlabel to slice it up into partitions? > > Is anyone using gjournal and gmirror for all of the system on a pair > of disks in some other configuration? > > Thanks, > > g. > _______________________________________________ > freebsd-stable@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-stable > To unsubscribe, send any mail to "freebsd-stable-unsubscribe@freebsd.org" > >I am pasting below the instructions I would use to convert a recently installed system with only / (root) and swap to be using gmirror+gjournal. It is in mediawiki markup format so it could be pasted into one if desired. I based my gmirror steps on the instructions from http://people.freebsd.org/~rse/mirror/ so thats why some of the words sound familiar. I also have similar instructions for setting up a gmirrored da0s1a and da0s1b alongside a zfs mirror containing the rest. I decided to journal /usr /var /tmp and leave / as a standard UFS partition because it is so small, fsck doesn't take long anyway and hopefully doesn't get written to enough to cause damage by an abrupt reboot. Because I'm not journaling the root partition, I chose to ignore the possibility of gjournal marking the mirror clean. Sudden reboots don't happen enough on servers for me to care. And all my servers got abruptly rebooted this sunday and they all came up fine :) I believe gjournal uses 1G for journal (2x512) which seemed to be sufficient on all of the systems where I have used the default, but I quickly found that using a smaller journal is a bad idea and leads to panics that I was unable to avoid with tuning. Considering 1G was such a close value, I chose to go several times above the default journal size (disk is cheap and I want to be sure) but I ran into problems using gjournal label -s (size) rejecting my sizes or wrapping the value around to something too low. As a workaround I chose to use a separate partition for each journal. I quickly ran out of partitions in a bsd disklabel so I decided to partition each disk into two slices; the first for data and the second for journals. This also made it easier to line up disk devices so they made more sense as a pair, for example: gm0s1d(data) + gm0s2d(journal) = /usr. I will note that if you accidentally put a gjournal label in the 'wrong' spot on your disk, you might make a tough situation for yourself getting rid of it. I have had plenty of times where I applied a gjournal label, discovered something unideal with it, but every time I did 'gjournal stop foo' the label would automatically get detected as a child of a different part of the disk because it could be seen and I could not unload it. That is part of why I use -h for gjournal label, and use slices+partitions, and the first partition is at offset 16, some of which may have been for gmirror's sake too. ==Software raid on 72G disks with gjournal=5 min to setup, around 30 min to sync ===Prepare==*Clear any old mirror config including old gmirror labels sysctl kern.geom.debugflags=16 gmirror clear da0 gmirror clear da1 sysctl kern.geom.debugflags=0 dd if=/dev/zero of=/dev/da1 bs=512 count=79 *place a GEOM mirror label onto second disk gmirror label -v -n -b round-robin gm0 /dev/da1 *activate GEOM mirror kernel layer gmirror load ===Partition==*place a PC MBR onto the second disk to make it bootable. Also partition it with the majority of space as partition 1, and enough for your journal partitions as partition 2. '''You might get an error, such as "fdisk: Geom not found". If the next steps work, ignore the error.''' fdisk -v -B -I /dev/mirror/gm0 *Partition it into two slices. I think there is an easier way but I cannot remember how. Maybe I used a different method of using fdisk and ignored the end cyl values since they dont seem to make much sense anyway. sysinstall or sade could be used as an alternative. fdisk -i /dev/mirror/gm0 Do you want to change our idea of what BIOS thinks ? '''[n]''' The data for partition 1 is: sysid 165 (0xa5),(FreeBSD/NetBSD/386BSD) start 63, size 143363997 (70001 Meg), flag 80 (active) '' ^^^^^^^^^ A = 143363997'' beg: cyl 0/ head 1/ sector 1; end: cyl 731/ head 254/ sector 63 Do you want to change it? [n] '''y'''<br> ''We want to make partitions approx 60G(data) and 10G(journals).'' ''So take variable A, divide by 7 and multiply by 6 to get var B.'' ''B = 122883426''<br> Supply a decimal value for "sysid (165=FreeBSD)" '''[165]''' Supply a decimal value for "start" '''[63]''' Supply a decimal value for "size" [143363997] '''122883426''' ''^^^^^^^^^'' ''put B here'' fdisk: WARNING: partition does not end on a cylinder boundary fdisk: WARNING: this may confuse the BIOS or some operating systems Correct this automatically? [n] '''y''' fdisk: WARNING: adjusting size of partition to 122881122 Explicitly specify beg/end address ? '''[n]''' sysid 165 (0xa5),(FreeBSD/NetBSD/386BSD) start 63, size 122881122 (60000 Meg), flag 80 (active) ''^^^^^^^^^'' ''C = 122881122'' ''D = C + 63 = 122881122 + 63 = 122881185'' ''E = A - C = 143363997 - 122881185 = 20482812''<br> beg: cyl 0/ head 1/ sector 1; end: cyl 480/ head 254/ sector 63 Are we happy with this entry? [n] '''y''' The data for partition 2 is: <UNUSED> Do you want to change it? [n] '''y''' Supply a decimal value for "sysid (165=FreeBSD)" [0] '''165''' Supply a decimal value for "start" [0] '''122881185''' ''^^^^^^^^^'' ''put D here '' Supply a decimal value for "size" [0] '''20482812''' ''^^^^^^^^'' ''put E here'' Explicitly specify beg/end address ? '''[n]''' Are we happy with this entry? [n] '''y''' The data for partition 3 is: <UNUSED> Do you want to change it? '''[n]''' The data for partition 4 is: <UNUSED> Do you want to change it? '''[n]''' Partition 1 is marked active Do you want to change the active partition? '''[n]''' Should we write new partition table? [n] '''y''' '''You might get an error, such as "fdisk: Geom not found". If the next steps work, ignore the error.''' ===Disklabel==*place a BSD disklabel onto the mirrors bsdlabel -w -B /dev/mirror/gm0s1 bsdlabel -w /dev/mirror/gm0s2 NOTICE: figure out what partitions you want by referring to bsdlabel /dev/da0s1 and/or running bsdlabel /dev/mirror/gm0s1 on a different server that has already been mirrored and partition to your liking. Size can be specified with ##M, ##G or * for remainder, and offset should be * to make it calculate it. Paste the output into the editor and make whatever changes you want as long as it includes: start "a" partition at offset 16, "c" partition at offset 0) *Partition 1: bsdlabel -e /dev/mirror/gm0s1 Example: # size offset fstype [fsize bsize bps/cpg] a: 1G 16 4.2BSD b: 4G * swap c: * 0 unused # "raw" part, don't edit d: 10G * 4.2BSD e: * * 4.2BSD f: 4G * 4.2BSD *Partition 2: bsdlabel -e /dev/mirror/gm0s2 Example: # size offset fstype [fsize bsize bps/cpg] c: * 0 unused # "raw" part, don't edit d: 4G 16 4.2BSD e: 4G * 4.2BSD f: * * 4.2BSD ===Gjournal label==*Label the data and journals so the journaled partition is available. gjournal label -f -h mirror/gm0s1d mirror/gm0s2d gjournal label -f -h mirror/gm0s1e mirror/gm0s2e gjournal label -f -h mirror/gm0s1f mirror/gm0s2f *Load the kernel module so the journaled partitions are detected: gjournal load ===Newfs==*Format the devices with journaling support in UFS: newfs /dev/mirror/gm0s1a newfs -J /dev/mirror/gm0s1d.journal newfs -J /dev/mirror/gm0s1e.journal newfs -J /dev/mirror/gm0s1f.journal ===Mount==*Mount them temporarily: mount /dev/mirror/gm0s1a /mnt mkdir -p /mnt/usr /mnt/var /mnt/tmp mount -o async /dev/mirror/gm0s1d.journal /mnt/usr mount -o async /dev/mirror/gm0s1e.journal /mnt/var mount -o async /dev/mirror/gm0s1f.journal /mnt/tmp ===Copy Data==*Install rsync, if not already: pkg_add -r rsync *Copy the original boot drive to the new device: rehash rsync -avHSx --progress / /mnt/ (This will take about 1 minute.) ===Prepare mirror for booting==*Edit '''/mnt/etc/fstab''' replacing the following mountpoints: vi /mnt/etc/fstab Old: # Device Mountpoint FStype Options Dump Pass# /dev/da0s1b none swap sw 0 0 /dev/da0s1a / ufs rw 1 1 /dev/cd0 /cdrom cd9660 ro,noauto 0 0 /dev/acd0 /cdrom1 cd9660 ro,noauto 0 0 New: # Device Mountpoint FStype Options Dump Pass# /dev/mirror/gm0s1b none swap sw 0 0 /dev/mirror/gm0s1a / ufs rw 1 1 /dev/mirror/gm0s1d.journal /usr ufs rw,async 2 2 /dev/mirror/gm0s1e.journal /var ufs rw,async 2 2 /dev/mirror/gm0s1f.journal /tmp ufs rw,async 2 2 /dev/cd0 /cdrom cd9660 ro,noauto 0 0 /dev/acd0 /cdrom1 cd9660 ro,noauto 0 0 *Load necessary kernel modules at boot: echo 'geom_journal_load="YES"' >> /mnt/boot/loader.conf echo 'geom_mirror_load="YES"' >> /mnt/boot/loader.conf *instruct boot stage 2 loader on first disk to boot with the boot stage 3 loader from the second disk (mainly because BIOS might not allow easy booting from second ATA disk or at least requires manual intervention on the console) echo "1:da(1,a)/boot/loader" >/boot.config *We're done with the first stage, reboot: reboot ===Check results==*Login and run df. Should look like this: Filesystem 1K-blocks Used Avail Capacity Mounted on /dev/mirror/gm0s1a 1012974 201898 730040 22% / devfs 1 1 0 100% /dev /dev/mirror/gm0s1d.journal 10154156 144920 9196904 2% /usr /dev/mirror/gm0s1e.journal 40209204 322 36992146 0% /var /dev/mirror/gm0s1f.journal 4058060 12 3733404 0% /tmp ===Configure second disk into mirror==*Add the original boot disk to the mirror. Make sure the first disk is treated as a really fresh one dd if=/dev/zero of=/dev/da0 bs=512 count=79 *switch GEOM mirror to auto-synchronization and add first disk (first disk is now immediately synchronized with the second disk content) gmirror configure -a gm0 gmirror insert gm0 /dev/da0 *Wait for the GEOM mirror synchronization to complete, or check it manually with ''gmirror list'' sh -c 'while [ ".`gmirror list | grep SYNCHRONIZING`" != . ]; do sleep 1; done' *Reboot into the final two-disk GEOM mirror setup (now actually boots with the MBR and boot stages on first disk as it was synchronized from second disk) reboot ===Mirror check script==*Enable daily_status_gmirror_enable in /etc/periodic.conf or write your own script to monitor gmirror status