I''ve read that it''s supposed to go at full speed, i.e. as fast as possible. I''m doing a disk replace and what zpool reports kind of surprises me. The resilver goes on at 1.6MB/s. Did resilvering get throttled at some point between the builds, or is my ATA controller having bigger issues? Thanks, -mg This message posted from opensolaris.org
Hello Mario, Wednesday, May 9, 2007, 5:56:18 PM, you wrote: MG> I''ve read that it''s supposed to go at full speed, i.e. as fast as MG> possible. I''m doing a disk replace and what zpool reports kind of MG> surprises me. The resilver goes on at 1.6MB/s. Did resilvering get MG> throttled at some point between the builds, or is my ATA controller having bigger issues? Lot of small files perhaps? What kind of protection have you used? -- Best regards, Robert mailto:rmilkowski at task.gda.pl http://milek.blogspot.com
Robert Milkowski wrote:> Hello Mario, > > Wednesday, May 9, 2007, 5:56:18 PM, you wrote: > > MG> I''ve read that it''s supposed to go at full speed, i.e. as fast as > MG> possible. I''m doing a disk replace and what zpool reports kind of > MG> surprises me. The resilver goes on at 1.6MB/s. Did resilvering get > MG> throttled at some point between the builds, or is my ATA controller having bigger issues? > > Lot of small files perhaps? What kind of protection have you used?Good question. Remember that resilvering is done in time order and from the top-level metadata down, not by sequentially blasting bits. Jeff Bonwick describes this as top-down resilvering. http://blogs.sun.com/bonwick/entry/smokin_mirrors From a MTTR and performance perspective this means that ZFS recovery time is a function of the amount of space used, where it is located (!), and the validity of the surviving or regenerated data. The big win is the amount of space used, as most file systems are not full. -- richard
Hello Richard, Wednesday, May 9, 2007, 9:10:22 PM, you wrote: RE> Robert Milkowski wrote:>> Hello Mario, >> >> Wednesday, May 9, 2007, 5:56:18 PM, you wrote: >> >> MG> I''ve read that it''s supposed to go at full speed, i.e. as fast as >> MG> possible. I''m doing a disk replace and what zpool reports kind of >> MG> surprises me. The resilver goes on at 1.6MB/s. Did resilvering get >> MG> throttled at some point between the builds, or is my ATA controller having bigger issues? >> >> Lot of small files perhaps? What kind of protection have you used?RE> Good question. Remember that resilvering is done in time order and from RE> the top-level metadata down, not by sequentially blasting bits. Jeff RE> Bonwick describes this as top-down resilvering. RE> http://blogs.sun.com/bonwick/entry/smokin_mirrors RE> From a MTTR and performance perspective this means that ZFS recovery time RE> is a function of the amount of space used, where it is located (!), and the RE> validity of the surviving or regenerated data. The big win is the amount of RE> space used, as most file systems are not full. Nevertheless with lot of small files written over many months (some were removed) resilvering in raid-z2 is SLOOOW, even if there''s no other activity in a pool (7-10 days on x4500 with 11disk in raidz2 group). Either it''s inherit in such environments or something else is wrong. -- Best regards, Robert mailto:rmilkowski at task.gda.pl http://milek.blogspot.com
> Robert Milkowski wrote: > > Hello Mario, > > > > Wednesday, May 9, 2007, 5:56:18 PM, you wrote: > > > > MG> I''ve read that it''s supposed to go at full speed, i.e. as fast as > > MG> possible. I''m doing a disk replace and what zpool reports kind of > > MG> surprises me. The resilver goes on at 1.6MB/s. Did resilvering get > > MG> throttled at some point between the builds, or is my ATA controller hav > ing bigger issues? > > > > Lot of small files perhaps? What kind of protection have you used? > > Good question. Remember that resilvering is done in time order and from > the top-level metadata down, not by sequentially blasting bits. Jeff > Bonwick describes this as top-down resilvering. > http://blogs.sun.com/bonwick/entry/smokin_mirrors > > From a MTTR and performance perspective this means that ZFS recovery time > is a function of the amount of space used, where it is located (!), and the > validity of the surviving or regenerated data. The big win is the amount of > space used, as most file systems are not full. > -- richardIt seems to me that once you copy meta data, you can indeed copy all live data sequentially. Given that a vast majority of disk blocks in use will typically contain data, this is a winning strategy from a performance point of view and still allows you to retrieve a fair bit of data in case of a second disk failure (checksumming will catch a case where good metadata points to as yet uncopied data block). If amount of live data is > 50% of disk space you may as well do a disk copy, perhaps skipping over already copied meta data. Not only that, you can even start using the disk being resilvered right away for writes, The new write will be either to a) an already copied block or b) as yet uncopied block. In case a) there is nothing more to do. In case b) the copied-from-block will have the new data so in both cases the right thing happens. Any potential window between reading a copied-from block and writing to copied-to block can be closed with careful coding/locking. If a second disk fails during copying, the current strategy doesn''t buy you much in most any case. You really don''t want to go through a zillion files looking for survivors. If you have a backup, you will restore from that rather than look through the debris. Not to mention you have made the window of a potentially catastrophic failure much larger if resilvering is significantly slower. Comments?
On 9-May-07, at 3:44 PM, Bakul Shah wrote:>> Robert Milkowski wrote: >>> Hello Mario, >>> >>> Wednesday, May 9, 2007, 5:56:18 PM, you wrote: >>> >>> MG> I''ve read that it''s supposed to go at full speed, i.e. as >>> fast as >>> MG> possible. I''m doing a disk replace and what zpool reports >>> kind of >>> MG> surprises me. The resilver goes on at 1.6MB/s. Did >>> resilvering get >>> MG> throttled at some point between the builds, or is my ATA >>> controller hav >> ing bigger issues? >>> >>> Lot of small files perhaps? What kind of protection have you used? >> >> Good question. Remember that resilvering is done in time order >> and from >> the top-level metadata down, not by sequentially blasting bits. Jeff >> Bonwick describes this as top-down resilvering. >> http://blogs.sun.com/bonwick/entry/smokin_mirrors >> >> From a MTTR and performance perspective this means that ZFS >> recovery time >> is a function of the amount of space used, where it is located >> (!), and the >> validity of the surviving or regenerated data. The big win is the >> amount of >> space used, as most file systems are not full. >> -- richard > > It seems to me that once you copy meta data, you can indeed > copy all live data sequentially.I don''t see this, given the top down strategy. For instance, if I understand the transactional update process, you can''t commit the metadata until the data is in place. Can you explain in more detail your reasoning?> Given that a vast majority > of disk blocks in use will typically contain data, this is a > winning strategy from a performance point of view and still > allows you to retrieve a fair bit of data in case of a second > disk failure (checksumming will catch a case where good > metadata points to as yet uncopied data block). If amount of > live data is > 50% of disk space you may as well do a disk > copy, perhaps skipping over already copied meta data. > > Not only that, you can even start using the disk being > resilvered right away for writes, The new write will be > either to a) an already copied blockHow can that be, under a COW r?gime? --Toby> or b) as yet uncopied > block. In case a) there is nothing more to do. In case b) > the copied-from-block will have the new data so in both cases > the right thing happens. Any potential window between > reading a copied-from block and writing to copied-to block > can be closed with careful coding/locking. > > If a second disk fails during copying, the current strategy > doesn''t buy you much in most any case. You really don''t want > to go through a zillion files looking for survivors. If you > have a backup, you will restore from that rather than look > through the debris. Not to mention you have made the window > of a potentially catastrophic failure much larger if > resilvering is significantly slower. > > Comments? > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
> Lot of small files perhaps? What kind of protection > have you used?No protection, and as much small files as a full distro install has, plus some more source code for some libs. It''s just 28GB that needs to be resilvered, yet it takes like 6 hours at this abysmal speed. At first I thought it was intentional, to keep the system responsive, but then I read something about full speed. The replacement disk is on the same IDE channel as the one to be replaced, so performance won''t be too high, but 1.6MB/s is just poor. I guess it''s a general problem with my disk subsystem, because when I migrated my data from the NTFS drives, I also got write speeds along the 1.2-1.6MB/s line. I blamed the NTFS driver at first, but then that weird writing behaviour I posted about earlier plus this poor resilvering speed lets me believe that something else is wrong. I could configure ata.conf with the driveX_block_factor parameters like described in man ata, since it defaults to 1 only, but doing so puts me in a boot loop (there isn''t even an ata driver, just pci-ide, and putting those params in its config doesn''t do anything). Thanks for any help. -mg This message posted from opensolaris.org
Oh god I found it. So freakin'' bizarre. I''m pushing now 27MB/s average, instead of meager 1.6MB/s. That''s more like it. This is what happened: Back in the day when I bought my first SATA drive, incidentally a WD Raptor, I wanted Windows to boot off it, including bootloader placement on it and everything. This however forced me to disable the IDE drives in the BIOS, because otherwise, MBR, bootloader and whatever else was placed on the first IDE drives. This setup works just fine in Windows. However in Solaris, it doesn''t. It appears that if the drives are not enabled in the BIOS, things like IDE block mode, IDE prefetching and such aren''t activated for the drives. Windows'' ATA driver seems to do this on its own, Solaris'' ATA driver however doesn''t. Should I file a bug about this? Not that it''s common for IDE drives to be intentionally disabled, but this issue almost made me tear my hair out. -mg This message posted from opensolaris.org
> > It seems to me that once you copy meta data, you can indeed > > copy all live data sequentially. > > I don''t see this, given the top down strategy. For instance, if I > understand the transactional update process, you can''t commit the > metadata until the data is in place. > > Can you explain in more detail your reasoning?First realize that this is just a thought experiment -- I haven''t read much source code in any detail as yet and it is entirely likely what I am suggesting can not work or there is no need for it! With that caveat.... http://blogs.sun.com/bonwick/entry/smokin_mirrors talks about top-down resilvering. That is, copy the root blocks first (for example the uberblock), then the blocks they point to and so on [1]. A major goal here is to minimize data loss in case of a second failure. Completely losing a metadata block means you can''t access anything it points to so metadata blocks are far more precious than data blocks. This is different from a normal update transaction, where the copy-on-write proceeds bottom up -- which is what you are talking about. A major goal for a normal update is to ensure that at all times a consistent filesystem structure is seen by all. All I was suggesting is that once all the metadata is copied (or "resilvered"), switch to sequential copying to maximize performance. This does make checking the validity of a data block more complicated. So instead of copying data of file1 and then file2 and so on, just copy blocks in the most efficient order, save their checksums and periodically validate a whole bunch. In fact since metadata is read first, you can roughly figure which metadata blocks will be needed when to check data block validity (because you know where data blocks are stored).> > Given that a vast majority > > of disk blocks in use will typically contain data, this is a > > winning strategy from a performance point of view and still > > allows you to retrieve a fair bit of data in case of a second > > disk failure (checksumming will catch a case where good > > metadata points to as yet uncopied data block). If amount of > > live data is > 50% of disk space you may as well do a disk > > copy, perhaps skipping over already copied meta data. > > > > Not only that, you can even start using the disk being > > resilvered right away for writes, The new write will be > > either to a) an already copied block > > How can that be, under a COW regime?I was talking about resilvering, not a normal update. Copy on write happens only for a normal update. I was speculating that you can do normal updates during resilvering. Not sure if this is clear to anyone! -- bakul [1] Top down resilvering seems very much like a copying garbage collector. That similarity make me wonder if the physical layout can be rearranged in some way for a more efficient access to data -- the idea is to resilver and compactify at the same time on one of the mirrors and then make it the master and resilver the other mirrors. Nah... probably not worth the hassle. [Again, I suspect no one else understands what I am talking about:-)]
On 10 May, 2007 - Bakul Shah sent me these 3,2K bytes:> [1] Top down resilvering seems very much like a copying > garbage collector. That similarity make me wonder if the > physical layout can be rearranged in some way for a more > efficient access to data -- the idea is to resilver and > compactify at the same time on one of the mirrors and then > make it the master and resilver the other mirrors. Nah... > probably not worth the hassle. [Again, I suspect no one else > understands what I am talking about:-)]Doing mirror rebuild and "defrag" at the same time sounds dangerous.. How about a separate "defrag" thing that you can run whenever you feel like it instead.. /Tomas -- Tomas ?gren, stric at acc.umu.se, http://www.acc.umu.se/~stric/ |- Student at Computing Science, University of Ume? `- Sysadmin at {cs,acc}.umu.se