Hello all, I''d like some practical advice on migration of a Sun Fire X4500 (Thumper) from aging data disks to a set of newer disks. Some questions below are my own, others are passed from the customer and I may consider not all of them sane - but must ask anyway ;) 1) They hope to use 3Tb disks, and hotplugged an Ultrastar 3Tb for testing. However, the system only sees it as a 802Gb device, via Solaris format/fdisk as well as via parted [1]. Is that a limitation of the Marvell controller, disk, the current OS (snv_117)? Would it be cleared by a reboot and proper disk detection on POST (I''ll test tonight) or these big disks won''t work in X4500, period? [1] http://code.google.com/p/solaris-parted/downloads/detail?name=solaris-parted-0.2.tar.gz&can=2&q Gotta run now, will ask more in the evening :) Thanks for now, //Jim
Casper.Dik at oracle.com
2012-May-15 15:17 UTC
[zfs-discuss] Migration of a Thumper to bigger HDDs
>Hello all, I''d like some practical advice on migration of a >Sun Fire X4500 (Thumper) from aging data disks to a set of >newer disks. Some questions below are my own, others are >passed from the customer and I may consider not all of them >sane - but must ask anyway ;) > >1) They hope to use 3Tb disks, and hotplugged an Ultrastar 3Tb > for testing. However, the system only sees it as a 802Gb > device, via Solaris format/fdisk as well as via parted [1]. > Is that a limitation of the Marvell controller, disk, > the current OS (snv_117)? Would it be cleared by a reboot > and proper disk detection on POST (I''ll test tonight) or > these big disks won''t work in X4500, period? >Your old release of Solaris (nearly three years old) doesn''t support disks over 2TB, I would think. (A 3TB is 3E12, the 2TB limit is 2^41 and the difference is around 800Gb) Casper
Urgent interrupt processed, I got back to my questions :) Thanks Casper for his suggestion, the box is scheduled to reboot soon and I''ll try newer Solaris (oi_151a3 probably) as well. UPDATE: Yes, oi_151a3 has seen all "2.73Tb" of the disk, so my old question is resolved: the original Thumper (Sun Fire X4500) does see the 3Tb disks at least with a current OS, hardware limitations seem to be absent. The disk is recognized as "ATA-Hitachi HUA72303-A580-2.73Tb". Booted back into snv_117, the box again sees the smaller disk size - so it is an OS thing indeed. OS migration into upgrade plans, check! ;} 2012-05-15 13:41, Jim Klimov wrote:> Hello all, I''d like some practical advice on migration of a > Sun Fire X4500 (Thumper) from aging data disks to a set of > newer disks. Some questions below are my own, others are > passed from the customer and I may consider not all of them > sane - but must ask anyway ;) > > 1) They hope to use 3Tb disks, and hotplugged an Ultrastar 3Tb > for testing. However, the system only sees it as a 802Gb > device, via Solaris format/fdisk as well as via parted [1]. > Is that a limitation of the Marvell controller, disk, > the current OS (snv_117)? Would it be cleared by a reboot > and proper disk detection on POST (I''ll test tonight) or > these big disks won''t work in X4500, period? > > [1] > http://code.google.com/p/solaris-parted/downloads/detail?name=solaris-parted-0.2.tar.gz&can=2&qThe Thumper box has 48 250Gb disks, beginning to die off, now arranged as two zfs pools - an rpool built over the two bootable drives, and the data pool built as a 45-drive array as 9*(4+1) raidz1 striped, and one hotspare. AFAIK the number of raidz vdevs can not be brought down without compromising data integrity/protection, and it is the only server around with the ~9Tb storage capacity - so there are no backups or even nowhere to temporarily and safely to migrate to. Budget is tight. We are estimating assorted options, and would like suggestions - perhaps some of the list users have passed through similar transitions, and/or know which options to avoid like fire ;) We know that large redundancy is highly recommended for big HDDs, so in-place autoexpansion of the raidz1 pool onto 3Tb disks is out of the question. So far the plan is to migrate the current pool onto 3Tb drives, and it seems that with the recommended 3-disk redundancy for large drives, a raidz3 of 8+3 disks and one hotspare would fit nicely onto 6 controllers (2 disks each). Mirroring of 1+2 or 1+3 disks times 5 (minimum desired new volume) would fill most of the box and cost a lot for relatively little volume (reading would be fast though). What would the experienced people suggest - would raidz3 be good? Should SSDs help? I''m primarily thinking of L2ARC, though there is NFS serving and iSCSI serving that might benefit from ZILs as well. What SSD sizing and models would people suggest for the 16GB RAM server? AFAIK it might be possible to replace the RAM up to 32GB (maybe costly), but sadly no more can be installed according to docs and availability of compatible memory modules; should the RAM doubling be pursued? I know it is hard to give suggestions about something vague; The storage profile is "a bit of everything" in a software development company - homedirs, regular rolling backups, images of produced software, VM images for test systems (executed on remote VM hosts, use Thumper''s storage via ZFS/NFS and ZFS/iSCSI), some databases "of practically unlimited capacity" for the testbed systems. Fragmentation is rather high, resilver of one disk took 15 hours; weekly scrubs take about 85 hours. The server uses a 1Gbit LAN connection (might become a 4Gbit via aggregation, but the server did not produce such big bursts of disk storage even locally as to saturate the one uplink). Now on to the migration options we brainstormed... IDEA 1 By far, seems like the safest option: rent or buy a 12-disk eSATA enclosure and PCI-X adapter (model suggestions welcome - should support 3TB disks), configure the new pool in the enclosure, zfs send|zfs recv data, restart local zones with tasks (databases) and nfs/iscsi services from the new pool. Ultimately take out disks of old pool, plug the disks of new pool (and SSDs) inside Thumper, live happy and fast :) This option requires an enclosure and adapter, with no clues what to choose and how much that would cost above the raw disk price. IDEA 2 Split the original data into several pools, migrate onto mirrors starting with one big disk. This idea proposes that the one hotspare disk bay becomes populated by one new big disk at a time (first one already inside), and a pool is created on top of this one disk. Up to 3Tb of data is sent to the new pool, then a new disk and pool are inserted/created/sent. The original pool remains intact while the new pools are upgraded to Nway mirrors and if some sectors do become corrupt - the data can be restored with some manual fuss about plugging the old pool back in. This allows to enforce tiering of information (i.e. pour stale WORM data on some pools, and dynamic data tending to fragment - on another); however, free space would become individual to each such pool while cost overheads of mirrors may be considered prohibitive. IDEA 3 This idea involved possible unsafety to the data at some moments, however it allows to migrate the existing datasets onto a complete new raidz3 8+3 pool, and with little downtime. The idea is such: allocate 9 250Gb partitions on the new big drive, and resilver one disk from each raidz1 set onto a new partition. This way all raidz1 sets remain redundant, but the big disk becomes a SPOF if anything happens during data transition and makes all of the pool''s raidz sets non-redundant. However, this frees 9 HDD bays where we can stick 9 new big disks and set up the 8+3 array with 2 missing devices, so there is also strength for only one disk breaking down during migration. After the old pool''s data has been synced to the new pool, and if no two drives break during this time, the 250Gb disks can be taken out and the 8+3 set gets the remaining two disks and the hotspare (the disk with copies of original 9 partitions, it should remain untouched until the end). IDEA 4 Similar to IDEA 3 except it has less risks to data at expense of server uptime: the 9 partitions are DD''ed to the new big disk, and the pool is mounted read-only using these vdevs (hey, if it is possible to stick in restored images via lofi - then using partitions instead of original drives should be possible, right?) DDing works a lot faster, I estimate 15 hours for all 9 disks instead of 15hrs to resilver one. Then the readonly pool is zfs-sent to the new 8+1(+2missing) raidz3 pool as above. If anything bad happens during this migration, the original disks were not modified (unlike replacement of disks with partitions as in IDEA3) and can be easily reinstated. Main problem is that services would be downed for at least a week, although this can be remedied by migrating them off the box for a while. It is also questionable whether DD''ed images of pool disks would be picked up by zfs from many partitions on one disk. IDEA 5 Like ideas 3 or 4, but involving SVM mirrors as the block VDEVs for ZFS, instead of resilvering or DDing. These SVM mirrors would contain the current 9 disks on one side, and a partition on the new disk on another. Since the SVM metadevice remains defined, the backend storage can be juggled freely. ------ So, a few wild options have been discussed, some risky to data, some risky to uptime of the critical server, some rather costly - or it seems so. I please ask the community to take them seriously and not let my friends make some predictable fatal mistake ;) Are any of these options (other than IDEA1) viable and/or reasonable (i.e. would you do something similar, ever)? PS: Again, suggestions on L2ARC and ZIL are welcome for a 16GB RAM server with a big addressable storage and relatively small working set (perhaps a hundred Gb are regularly needed more than once). Or, do "16Gb RAM" and "~24Tb disks" not come together in one sentence? PPS: How much can a used X4540 be bargained for, as an orthogonal solution, and how much RAM can be put in it? //Jim Klimov
You forgot IDEA #6 where you take advantage of the fact that zfs can be told to use sparse files as partitions. This is rather like your IDEA #3 but does not require that disks be partitioned. This opens up many possibilities. Whole vdevs can be virtualized to files on (i.e. moved onto) remaining physical vdevs. Then the drives freed up can be replaced with larger drives and used to start a new pool. It might be easier to upgrade the existing drives in the pool first so that there is assured to be vast amounts of free space and the drives get some testing. There is not initially additional risk due to raidz1 in the pool since the drives will be about as full as before. I am not sure what additional risks are involved due to using files. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Jim Klimov <jimklimov at cos.ru> wrote:> We know that large redundancy is highly recommended for > big HDDs, so in-place autoexpansion of the raidz1 pool > onto 3Tb disks is out of the question.Before I started to use my thumper, I reconfigured it to use RAID-Z2. This allows me to just replace disks during operation without losing all redundancy while expanding. J?rg -- EMail:joerg at schily.isdn.cs.tu-berlin.de (home) J?rg Schilling D-13353 Berlin js at cs.tu-berlin.de (uni) joerg.schilling at fokus.fraunhofer.de (work) Blog: http://schily.blogspot.com/ URL: http://cdrecord.berlios.de/private/ ftp://ftp.berlios.de/pub/schily
2012-05-16 6:18, Bob Friesenhahn wrote:> You forgot IDEA #6 where you take advantage of the fact that zfs can be > told to use sparse files as partitions. This is rather like your IDEA #3 > but does not require that disks be partitioned.This is somewhat the method of making "missing devices" when creating a ZFS pool (i.e. 8+1(+2missing) as in my earlier mail).> This opens up many possibilities. Whole vdevs can be virtualized to > files on (i.e. moved onto) remaining physical vdevs.This is a nifty idea in general, but this pool''s practice keeps it quite full - about 100Gb free by df/zfs list account, however with zfs reserved space the figure jumps to 740Gb free in zpool list reports (hopefully that''s what leaves the system performing quite well despite the full fragmented pool). > Then the drives> freed up can be replaced with larger drives and used to start a new > pool. It might be easier to upgrade the existing drives in the pool > first so that there is assured to be vast amounts of free space and the > drives get some testing. There is not initially additional risk due to > raidz1 in the pool since the drives will be about as full as before.Your idea actually evolved for me into another (#7?), which is simple and apparent enough to be ingenious ;) DO use the partitions, but split the "2.73Tb" drives into a roughly "2.5Tb" partition followed by a "250Gb" partition of the same size as vdevs of the original old pool. Then the new drives can replace a dozen of original small disks one by one, in a one-to-one fashion resilvering, with no worsening of the situation in regard of downtime or original/new pools'' integrity tradeoffs (in fact, several untrustworthy old disks will be replaced by newer ones). When the new dozen of disks is in place, the complete 8+3 new pool can be created with no compromises, and old data migrated onto it, and then the old pool can be destroyed after everything has been checked to be properly accessible. The remaining 250Gb disks can be repurposed, while the tailing partitions on new disks can join the big pool by autoexpansion (i.e. remove second partitions, expand first partitions in the label table, autoexpand pool - did that a few times on other occasions). In fact, this scenario seems like the best of all worlds to me now, unless someone talks me out of this with some pretty good reasoning. So thanks for keeping the dialog and thought-flow going :)> I am not sure what additional risks are involved due to using files.Well, ZFS docs and blogs pose files as a testing technique more than one inclined for production, due to possible issued between ZFS and disks brought in by the filesystem underneath. I believe the same reasoning should apply to other similar methods though, like iSCSI from remote storage, or lofi-devices, or SVM as I thought of (ab)using in this migration. Thanks, //Jim
2012-05-16 13:30, Joerg Schilling ???????:> Jim Klimov<jimklimov at cos.ru> wrote: > >> We know that large redundancy is highly recommended for >> big HDDs, so in-place autoexpansion of the raidz1 pool >> onto 3Tb disks is out of the question. > > Before I started to use my thumper, I reconfigured it to use RAID-Z2. > > This allows me to just replace disks during operation without losing all > redundancy while expanding.Makes sense; however this choice was not made originally in favor of getting more useable disk space. Besides, with the recommended 3-disk redundancy, a raidz2 pool would still face the same migration requirement to a new pool with different disk layout. But thanks for your comment, nonetheless, //Jim
On Wed, May 16, 2012 at 1:45 PM, Jim Klimov <jimklimov at cos.ru> wrote:> Your idea actually evolved for me into another (#7?), which > is simple and apparent enough to be ingenious ;) > DO use the partitions, but split the "2.73Tb" drives into a > roughly "2.5Tb" partition followed by a "250Gb" partition of > the same size as vdevs of the original old pool. Then the > new drives can replace a dozen of original small disks one > by one, in a one-to-one fashion resilvering, with no worsening > of the situation in regard of downtime or original/new pools'' > integrity tradeoffs (in fact, several untrustworthy old disks > will be replaced by newer ones).Err, why go to all that trouble? Replace one disk per pool. Wait for resilver to finish. Replace next disk. Once all/enough disks have been replaced, turn on autoexpand, and you''re done. -- http://www.glumbert.com/media/shift http://www.youtube.com/watch?v=tGvHNNOLnCk "This officer''s men seem to follow him merely out of idle curiosity." -- Sandhurst officer cadet evaluation. "Securing an environment of Windows platforms from abuse - external or internal - is akin to trying to install sprinklers in a fireworks factory where smoking on the job is permitted."? -- Gene Spafford learn french:? http://www.youtube.com/watch?v=30v_g83VHK4
Hello fellow BOFH, I also went by that title in a previous life ;) 2012-05-16 21:58, bofh wrote:> Err, why go to all that trouble? Replace one disk per pool. Wait for > resilver to finish. Replace next disk. Once all/enough disks have > been replaced, turn on autoexpand, and you''re done.As I wrote in the start of the thread, the original pool had 45 250Gb disks laid out as raidz1. This level of redundancy is too small for big disks - i.e. the recent resilver of a 250Gb disk, when it did finally succeed, took 15 hours. It seems likely that a 3Tb disk would take at least 12x time, about a week (likely more) that a raidz1 dataset would remain unprotected in the face of another failure. So an in-place upgrade by autoexpansion would require: 1) keep the raidz1 layout which is unsafe; 2) upgrade all 45 disks which is too much storage for current needs, and a big cost paid for no benefit to the buyer. So this method was ruled out for this situation. Thanks, //Jim
There''s something going on then. I have 7x 3TB disk at home, in raidz3, so about 12TB usable. 2.5TB actually used. Scrubbing takes about 2.5 hours. I had done the resilvering as well, and that did not take 15 hours/drive. Copying 3TBs onto 2.5" SATA drives did take more than a day, but a 2.5" drive''s performance is about 1/4 of the 3.5" drives from the limited testing I''ve done. Additionally, if you''re only replacing one drive at a time, you''re only resilvering 250GB at a time, regardless of the size of the new drive. If you already have 45X 3TB drives waiting to go in, bite the bullet and get that eSATA cage, since you want to re-do your zpools. You can reuse it for offsite backups in the future. As a side note, on my x4540, I get writes of up to 1.2 gigabytes/second (but that''s just writing zeros to an uncompressed pool). Real performance is lower, of course. On Wed, May 16, 2012 at 2:08 PM, Jim Klimov <jimklimov at cos.ru> wrote:> Hello fellow BOFH, > ?I also went by that title in a previous life ;) >:) -- http://www.glumbert.com/media/shift http://www.youtube.com/watch?v=tGvHNNOLnCk "This officer''s men seem to follow him merely out of idle curiosity." -- Sandhurst officer cadet evaluation. "Securing an environment of Windows platforms from abuse - external or internal - is akin to trying to install sprinklers in a fireworks factory where smoking on the job is permitted."? -- Gene Spafford learn french:? http://www.youtube.com/watch?v=30v_g83VHK4
On Wed, 16 May 2012, Jim Klimov wrote:> > Your idea actually evolved for me into another (#7?), which > is simple and apparent enough to be ingenious ;) > DO use the partitions, but split the "2.73Tb" drives into a > roughly "2.5Tb" partition followed by a "250Gb" partition of > the same size as vdevs of the original old pool. Then the > new drives can replace a dozen of original small disks one > by one, in a one-to-one fashion resilvering, with no worsening > of the situation in regard of downtime or original/new pools'' > integrity tradeoffs (in fact, several untrustworthy old disks > will be replaced by newer ones).I like this idea since it allows running two complete pools on the same disks without using files. Due to using partitions, the disk write cache will be disabled unless you specifically enable it. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
bofh <goodb0fh at gmail.com> wrote:> There''s something going on then. I have 7x 3TB disk at home, in > raidz3, so about 12TB usable. 2.5TB actually used. Scrubbing takes > about 2.5 hours. I had done the resilvering as well, and that did not > take 15 hours/drive. Copying 3TBs onto 2.5" SATA drives did take more > than a day, but a 2.5" drive''s performance is about 1/4 of the 3.5" > drives from the limited testing I''ve done.The performance of a thumper depends on whether you set it up correctly. A thumper offers 6 independent SATA concrollers that are able to do independent DMA simultanesously. For this reason, I set up each "row" for ZFS with 6 drives. $ drives for the net capacity and two parity drives. I get a sustained local read performance of 600 MB/s this way.> Additionally, if you''re only replacing one drive at a time, you''re > only resilvering 250GB at a time, regardless of the size of the new > drive. > > If you already have 45X 3TB drives waiting to go in, bite the bullet > and get that eSATA cage, since you want to re-do your zpools. You can > reuse it for offsite backups in the future.This is a miss-interpretation. If you have 7 raid-z2 rows with 6 drives each, you may replace up tu 7 drives at once. I did not yet test this but I am sure that this will finish in less than a day, so the upgrade may take aprox. a week.> As a side note, on my x4540, I get writes of up to 1.2 > gigabytes/second (but that''s just writing zeros to an uncompressed > pool). Real performance is lower, of course.With the original drives delivered by Sun? J?rg -- EMail:joerg at schily.isdn.cs.tu-berlin.de (home) J?rg Schilling D-13353 Berlin js at cs.tu-berlin.de (uni) joerg.schilling at fokus.fraunhofer.de (work) Blog: http://schily.blogspot.com/ URL: http://cdrecord.berlios.de/private/ ftp://ftp.berlios.de/pub/schily
2012-05-16 22:21, bofh wrote:> There''s something going on then. I have 7x 3TB disk at home, in > raidz3, so about 12TB usable. 2.5TB actually used. Scrubbing takes > about 2.5 hours. I had done the resilvering as well, and that did not > take 15 hours/drive.That is the critical moment ;) The system we plan to upgrade runs nearly full for a couple of years now, with some overflowing data spilling out to another server or performant workstation. This system has a lot of writes and rewrites (rolling backups, VM images, home dirs and compilations, document storage and so on) compounded by zfs-auto-snap services. Due to the fact that there are many rewrites like this, fragmentation is a substantial hit to scrub and resilver performance which both need to walk the whole block-pointer tree which yields lots of random reads of small blocks from all over the pool. I began writing a letter with guesses/questions about that, but it seems to be lengthy and may get published in a few days ;)> Additionally, if you''re only replacing one drive at a time, you''re > only resilvering 250GB at a time, regardless of the size of the new > drive.That is true.> If you already have 45X 3TB drives waiting to go in, bite the bullet > and get that eSATA cage, since you want to re-do your zpools. You can > reuse it for offsite backups in the future.That is a good plan, except that the server''s owners plan to upgrade a dozen drives for now. Even this would triple the current pool''s size in a quarter of disk bays used. They only have one for testing on hands and plan to buy a dozen more; there are no 45 drives waiting. There is no spoon, either ;) I will try to get them into making a (local/offsite) backup box as well. Since a new server (from x4540 to a handmade Supermicro) would likely be more performant and have more RAM, I expect that this Thumper would be the backup box for a new server, ultimately. Thanks, //Jim
2012-05-15 19:17, Casper.Dik at oracle.com wrote:> Your old release of Solaris (nearly three years old) doesn''t support > disks over 2TB, I would think. > > (A 3TB is 3E12, the 2TB limit is 2^41 and the difference is around 800Gb)While this was proven correct by my initial experiments, it seems that things are even weirder: as I wrote, I did boot the Thumper into oi_151a3 yesterday, and it saw the big disk as 2.73Tb. I made a GPT partition for the whole disk size and booted back into OpenSolaris SXCE snv_117. I wrote that it still sees the disk as being smaller, and it does in the headers of fdisk and format programs. The partition is seen as "EFI" by snv_117 fdisk, sized "48725 cylinders of 32130 (512 byte) blocks" each, which computes to 801553536000 bytes. However, when I drilled down into the partition/slice table today, format complained a bit but saw the whole disk. So I laid it out as 2.5Tb and 250Gb slices and will give them a go as test pools to see if writing to one would corrupt another. If this works, I guess I should DD the GPT table around to new 3Tb drives in the IDEA7 setup... Format''s complaints: 1) When opening the disk: Error: can''t open disk ''/dev/rdsk/c1t2d0p0''. No Solaris fdisk partition found. Error: can''t open disk ''/dev/rdsk/c1t2d0p0''. No Solaris fdisk partition found. 2) When labeling the disk: partition> label Ready to label disk, continue? y no reserved partition found ---- Here''s my new slice table (no slice8 indeed - unlike old disks): partition> p Current partition table (unnamed): Total disk sectors available: 5860516750 + 16384 (reserved sectors) Part Tag Flag First Sector Size Last Sector 0 usr wm 256 2.50TB 5372126207 1 usr wm 5372126415 232.87GB 5860500366 2 unassigned wm 0 0 0 3 unassigned wm 0 0 0 4 unassigned wm 0 0 0 5 unassigned wm 0 0 0 6 usr wm 5860500367 8.00MB 5860516750 This table did get saved, test pools created with no hiccups: # zpool create test c1t2d0s0 # zpool create test2 c1t2d0s1 # zpool status ... pool: test state: ONLINE scrub: none requested config: NAME STATE READ WRITE CKSUM test ONLINE 0 0 0 c1t2d0s0 ONLINE 0 0 0 errors: No known data errors pool: test2 state: ONLINE scrub: none requested config: NAME STATE READ WRITE CKSUM test2 ONLINE 0 0 0 c1t2d0s1 ONLINE 0 0 0 # zpool list NAME SIZE USED AVAIL CAP HEALTH ALTROOT pond 10.2T 9.49T 724G 93% ONLINE - rpool 232G 120G 112G 51% ONLINE - test 2.50T 76.5K 2.50T 0% ONLINE - test2 232G 76.5K 232G 0% ONLINE - Now writing stuff into the new test pools to see if any conflicts arise in snv_117''s support of the disk size. Thanks, //Jim
A small follow-up on my tests, just in case readers are interested in some numbers: the UltraStar 3Tb disk got filled up by a semi-random selection of data from our old pool in 24 hours sharp, including large dump files and small sourcedirs via rsync, and zome recursive zfs sends of VM storage including autosnaps ranging from near-zero size to considerable increments. Overall the write speed on the on-disk pool ranged from about 3-6Mb/s for small files to 40-95Mb/s for larger ones (i.e. ISOs and VM disk images). The resulting zpools include a bit of spare space (AFAIK to fight fragmentation), roughly 4Gb per 250Gb of pool size, but no more userdata can be added into datasets: # zpool list ... test 2.50T 2.46T 40.0G 98% ONLINE - test2 232G 228G 4.37G 98% ONLINE - # df -k /test /test2 Filesystem kbytes used avail capacity Mounted on test 2642411542 2372859619 0 100% /test test2 239468544 238689091 778716 100% /test2 The two filled-up pools are scrubbing now in search for disk errors as well as feared-expected errors due to possible overflow in some LBA-address counter or something like that which prevented snv_117 from seeing the full disk size in the first place. Current impressions are that all is ok, knocking on wood. Scrubbing reads at 35-90Mb/s, leaning more to the ~75, with the disk processing over 600IOps at 100%busy in iostat. Little fragmentation from one write with no deletions so far is oh-so-good! ;) 2012-05-17 1:21, Jim Klimov wrote:> 2012-05-15 19:17, Casper.Dik at oracle.com wrote: >> Your old release of Solaris (nearly three years old) doesn''t support >> disks over 2TB, I would think. >> >> (A 3TB is 3E12, the 2TB limit is 2^41 and the difference is around 800Gb)> # zpool list > NAME SIZE USED AVAIL CAP HEALTH ALTROOT > pond 10.2T 9.49T 724G 93% ONLINE - > rpool 232G 120G 112G 51% ONLINE - > test 2.50T 76.5K 2.50T 0% ONLINE - > test2 232G 76.5K 232G 0% ONLINE - > > > Now writing stuff into the new test pools to see if any > conflicts arise in snv_117''s support of the disk size.
New question: if the snv_117 does see the 3Tb disks well, the matter of upgrading the OS becomes not so urgent - we might prefer to delay that until the next stable release of OpenIndiana or so. Now that I think of it, when was raidz3 introduced?.. I don''t see it in the zpool manpage as of SXCE snv_117, but it is in SXCE snv_129 with zpool v22 on another box :) (and the "zpool upgrade -v" there says triple-parity raidz was added in zpool v17). Even so, LiveUpgrading SXCE to its latest release seems like an easier (faster) solution than migrating the whole OS and its local zones into the IPS paradigm right away. THE QUESTION: Would there be substantial issues if we start out making and filling the new raidz3 8+3 pool in SXCE snv_129 (with zpool v22) or snv_130, and later upgrade the big zpool along with the major OS migration, that can be avoided by a preemptive upgrade to oi_151a or later (oi_151a3?) Perhaps, some known pool corruption issues or poor data layouts in older ZFS software releases?.. Thanks, //Jim
2012-05-18 1:39, Jim Klimov ???????:> A small follow-up on my tests, just in case readers are > interested in some numbers: the UltraStar 3Tb disk got > filled up by a semi-random selection of data from our old > pool in 24 hours sharpOne more number: the smaller pool completed its scrub in 57 minutes of nice looking sequential reads of 70Mb/s on the average, no errors. The bigger pool had bulkier files from the start, and iostat reports up to 150Mb/s (in up to 1200 IOps) while scrubbing this part of the disk. Wow! :) Since there were many small files as well, I expect the speeds to drop. Writing the pool to its limits averaged to 33Mb/s (2.73Tb/86400s), not too bad either compared (say) to what I saw on my 6-disk raidz2 at home ;) //Jim
On Fri, 18 May 2012, Jim Klimov wrote:> > Would there be substantial issues if we start out making > and filling the new raidz3 8+3 pool in SXCE snv_129 (with > zpool v22) or snv_130, and later upgrade the big zpool > along with the major OS migration, that can be avoided > by a preemptive upgrade to oi_151a or later (oi_151a3?) > > Perhaps, some known pool corruption issues or poor data > layouts in older ZFS software releases?..I can''t attest as to potential issues, but the newer software surely fixes many bugs and it is also likely that the data layout improves in newer software. Improved data layout would result in better performance. It seems safest to upgrade the OS before moving a lot of data. Leave a fallback path in case the OS upgrade does not work as expected. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
On Thu, May 17, 2012 at 2:50 PM, Jim Klimov <jimklimov at cos.ru> wrote:> New question: if the snv_117 does see the 3Tb disks well, > the matter of upgrading the OS becomes not so urgent - we > might prefer to delay that until the next stable release > of OpenIndiana or so.There were some pretty major fixes and new features added between snv_117 and snv_134 (the last OpenSolaris release). It might be worth updating to snv_134 at the very least. -B -- Brandon High : bhigh at freaks.com