Paul B. Henson
2010-Oct-20 00:20 UTC
[zfs-discuss] live upgrade with lots of zfs filesystems -- still broken
A bit over a year ago I posted about a problem I was having with live upgrade on a system with lots of file systems mounted: http://opensolaris.org/jive/thread.jspa?messageID=411137񤘁 An official Sun support call was basically just closed with no resolution. I was quite fortunate that Jens Elkner had made a workaround available which made live upgrade actually usable for my deployment (thanks again, Jens!). I would have been pretty screwed without it. While still not exactly speedy, with the workaround in place live upgrade was fairly usable, and we''ve been using it for installing patches and upgrading to update releases with no problems. Until now; unfortunately, after installing the latest live upgrade patches on my existing U8 system in preparation for upgrading to U9, live upgrade has become even less usable than when I initially tried it without the workaround in place. While creating a new BE was still reasonably quick, mounting it took over *six* hours to complete 8-/. Whereas before the most amount of time expended was taken up by mounting/unmounting all the filesystems (resolved by Jens'' patch), now the majority of the six hours were spent spinning in /etc/lib/lu/plugins/lupi_bebasic. I don''t know exactly what it was doing (as the source code to live upgrade does not appear to be available), but for most of the six hours it seems it was comparing strings: # pstack 1670 1670: /etc/lib/lu/plugins/lupi_bebasic plugin fee05973 strcmp (8046474, 8046478) + 1c3 fef6ae45 lu_smlGetTagByName (806920c, 16ef, fefa0f30) + 74 fef71717 lu_tsfSearchFields (806920c, 0, 3, 2, 1, 88369f4) + 13f fef4e2da lu_beoGetFstblFilterSwapAndShared (80513bc, 8046978, 8069234, 806920c) + 1be fef4f1f7 lu_beoGetFstblToMountBe (80513bc, 80541e4, 80469c4, 80513fc) + 247 fef515cf lu_beoMountBeByBeName (80513bc, 8046a24, 805419c, 80513fc, 0, 0) + 39c 0804ba6c ???????? (804ef6c, 1, 8068dd4, 0, 8069ee4, 8069ee4) fef5fa2b ???????? (804ef6c, 8046f3c, 8069ee4, 8046ae8) fef5f5c3 ???????? (804ef6c, 8046f3c, 8069ee4, 8046ae8) fef5f397 ???????? (804ef6c, 8046f3c, 8069ee4) fef5f1c5 ???????? (804ef6c, 8046f3c, 8069ee4) fef603c6 ???????? (804ef6c) fef5ec12 lu_pluginProcessLoop (804ef6c) + 42 0804a028 main (2, 8046fa8, 8046fb4) + 2d3 08049cba ???????? (2, 80471d8, 80471f9, 0, 8069954, 8069914) Six hours, fully utilizing a CPU core, comparing strings 8-/. I considered opening a support ticket, but given the lack of response previously, I decided to poke around with it a bit myself first. truss of lumount revealed that getmntent was being called to enumerate mount points, so initially I tried preloading a shared library to interpose the getmntent call and skip all the mount points corresponding to my data file systems under /export. That didn''t make any difference. I then moved on to look at the multiple calls to the zfs binary made by lumount, which seemed potential sources of extraneous data which could cause unnecessary processing. Replacing /sbin/zfs with a wrapper script yielded quite unexpected results, as it seems there are many links to the zfs binary, which does different magic depending on the value of argv[0] 8-/. The path to the zfs binary is statically defined in /etc/lib/lu/liblu.so.1, so in a display of horrid kludginess ;), I edited the binary file and replaced all instances of /sbin/zfs with /sbin/zfb, and created /sbin/zfb with the content: ----------- #! /bin/sh . /etc/default/lu LUBIN=${LUBIN:=/usr/lib/lu} . $LUBIN/lulib if [ "$1" = "list" ] ; then /sbin/zfs $@ | /usr/bin/egrep -v -f `lulib_get_fs2ignore` else exec /sbin/zfs $@ fi ---------- This utilizes the configuration included in Jens'' patch to ignore the exact same set of file systems ignored by the rest of live upgrade with the patch installed. With this kludge in place lumount took *23 seconds*, three orders of magnitude less time. I tend to tilt at windmills, so I probably will end up opening another support ticket. The last time there seemed to be no interest in fixing live upgrade so it would actually scale :(, maybe this time I''ll have better luck. For those Oracle employees in the audience, if anyone could possibly explain exactly what processing lupi_bebasic is doing that results in six hours of string comparisons, I''m dying of curiosity :). And if anyone wants to jump up and champion the cause of getting live upgrade to work in an environment with many file systems, I''d be happy to help; it would be nice to have shipped code that works without breaking out the hex editor ;). Thanks... -- Paul B. Henson | (909) 979-6361 | http://www.csupomona.edu/~henson/ Operating Systems and Network Analyst | henson at csupomona.edu California State Polytechnic University | Pomona CA 91768