A while back we had a Sun engineer come to our office and talk about the benefits of ZFS. I asked him the question "Can the uber block become corrupt and what happeneds if it does?", to which he did not have the answer but swore to me that he would get it to me. I still haven''t gotten that answer and was wondering if someone here could enlighten me? This message posted from opensolaris.org
IANA ZFS guru, but I have read explanations like this: When ZFS reads in the uberblock, it computes the uberblock''s checksum and compares it against the stored checksum for that block. If they don''t match, it uses another copy of the uberblock. Ross Hosman wrote:> A while back we had a Sun engineer come to our office and talk about the > benefits of ZFS. I asked him the question "Can the uber block become corrupt > and what happeneds if it does?", to which he did not have the answer but swore > to me that he would get it to me. I still haven''t gotten that answer and was > wondering if someone here could enlighten me? > >-- -------------------------------------------------------------------------- Jeff VICTOR Sun Microsystems jeff.victor @ sun.com OS Ambassador Sr. Technical Specialist Solaris 10 Zones FAQ: http://www.opensolaris.org/os/community/zones/faq --------------------------------------------------------------------------
> A while back we had a Sun engineer come to our office and talk about > the benefits of ZFS. I asked him the question "Can the uber block > become corrupt and what happeneds if it does?", to which he did not > have the answer but swore to me that he would get it to me. I still > haven''t gotten that answer and was wondering if someone here could > enlighten me?Any data can become corrupt through a variety of processes. To reduce the chance of it affecting the integrety of the filesystem, there are multiple copies of the UB written, each with a checksum and a generation number. When starting up a pool, the oldest generation copy that checks properly will be used. If the import can''t find any valid UB, then it''s not going to have access to any data. Think of a UFS filesystem where all copies of the superblock are corrupt. So ''a'' UB can become corrupt, but it is unlikely that ''all'' UBs will become corrupt through something that doesn''t also make all the data also corrupt or inaccessible. -- Darren Dunham ddunham at taos.com Senior Technical Consultant TAOS http://www.taos.com/ Got some Dr Pepper? San Francisco, CA bay area < This line left intentionally blank to confuse you. >
Hello Darren, Tuesday, December 12, 2006, 2:10:30 AM, you wrote:>> A while back we had a Sun engineer come to our office and talk about >> the benefits of ZFS. I asked him the question "Can the uber block >> become corrupt and what happeneds if it does?", to which he did not >> have the answer but swore to me that he would get it to me. I still >> haven''t gotten that answer and was wondering if someone here could >> enlighten me?DD> Any data can become corrupt through a variety of processes. DD> To reduce the chance of it affecting the integrety of the filesystem, DD> there are multiple copies of the UB written, each with a checksum and a DD> generation number. When starting up a pool, the oldest generation copy DD> that checks properly will be used. If the import can''t find any valid DD> UB, then it''s not going to have access to any data. Think of a UFS DD> filesystem where all copies of the superblock are corrupt. Actually the latest UB, not the oldest. -- Best regards, Robert mailto:rmilkowski at task.gda.pl http://milek.blogspot.com
> DD> To reduce the chance of it affecting the integrety of the filesystem, > DD> there are multiple copies of the UB written, each with a checksum and a > DD> generation number. When starting up a pool, the oldest generation copy > DD> that checks properly will be used. If the import can''t find any valid > DD> UB, then it''s not going to have access to any data. Think of a UFS > DD> filesystem where all copies of the superblock are corrupt. > > Actually the latest UB, not the oldest.My *other* oldest... yeah. -- Darren Dunham ddunham at taos.com Senior Technical Consultant TAOS http://www.taos.com/ Got some Dr Pepper? San Francisco, CA bay area < This line left intentionally blank to confuse you. >
>So ''a'' UB can become corrupt, but it is unlikely that ''all'' UBs will >become corrupt through something that doesn''t also make all the data >also corrupt or inaccessible.So how does this work for data which is freed and overwritten; does the system make sure that none of the data referenced by any of the old ueberblocks is ever overwritten? Casper
Hello Casper, Tuesday, December 12, 2006, 10:54:27 AM, you wrote:>>So ''a'' UB can become corrupt, but it is unlikely that ''all'' UBs will >>become corrupt through something that doesn''t also make all the data >>also corrupt or inaccessible.CDSC> So how does this work for data which is freed and overwritten; does CDSC> the system make sure that none of the data referenced by any of the CDSC> old ueberblocks is ever overwritten? Why it should? If blocks are not used due to current UB I guess you can safely assume they are free. -- Best regards, Robert mailto:rmilkowski at task.gda.pl http://milek.blogspot.com
>Hello Casper, > >Tuesday, December 12, 2006, 10:54:27 AM, you wrote: > >>>So ''a'' UB can become corrupt, but it is unlikely that ''all'' UBs will >>>become corrupt through something that doesn''t also make all the data >>>also corrupt or inaccessible. > > >CDSC> So how does this work for data which is freed and overwritten; does >CDSC> the system make sure that none of the data referenced by any of the >CDSC> old ueberblocks is ever overwritten? > >Why it should? If blocks are not used due to current UB I guess you >can safely assume they are free.What if a newer UB is corrupted and you fall back to an older one? Casper
Also note that the UB is written to every vdev (4 per disk) so the chances of all UBs being corrupted is rather low. Thanks, George Darren Dunham wrote:>> DD> To reduce the chance of it affecting the integrety of the filesystem, >> DD> there are multiple copies of the UB written, each with a checksum and a >> DD> generation number. When starting up a pool, the oldest generation copy >> DD> that checks properly will be used. If the import can''t find any valid >> DD> UB, then it''s not going to have access to any data. Think of a UFS >> DD> filesystem where all copies of the superblock are corrupt. >> >> Actually the latest UB, not the oldest. > > My *other* oldest... yeah. >
Casper.Dik at Sun.COM wrote:>>Hello Casper, >> >>Tuesday, December 12, 2006, 10:54:27 AM, you wrote: >> >> >>>>So ''a'' UB can become corrupt, but it is unlikely that ''all'' UBs will >>>>become corrupt through something that doesn''t also make all the data >>>>also corrupt or inaccessible. >> >> >>CDSC> So how does this work for data which is freed and overwritten; does >>CDSC> the system make sure that none of the data referenced by any of the >>CDSC> old ueberblocks is ever overwritten? >> >>Why it should? If blocks are not used due to current UB I guess you >>can safely assume they are free. > > > > What if a newer UB is corrupted and you fall back to an older one? > > Casper >A block freed in transaction group N cannot be reused until transaction group N+3; so there is no possibility of referencing an overwritten block unless you have to back off more than two uberblocks. At this point, blocks that have been overwritten will show up as corrupted (bad checksums). -Mark
On 12-Dec-06, at 9:46 AM, George Wilson wrote:> Also note that the UB is written to every vdev (4 per disk) so the > chances of all UBs being corrupted is rather low.Furthermore the time window where UBs are mutually inconsistent would be very short, since they''d be updated together? --Toby> > Thanks, > George > > Darren Dunham wrote: >>> DD> To reduce the chance of it affecting the integrety of the >>> filesystem, >>> DD> there are multiple copies of the UB written, each with a >>> checksum and a >>> DD> generation number. When starting up a pool, the oldest >>> generation copy >>> DD> that checks properly will be used. If the import can''t find >>> any valid >>> DD> UB, then it''s not going to have access to any data. Think of >>> a UFS >>> DD> filesystem where all copies of the superblock are corrupt. >>> >>> Actually the latest UB, not the oldest. >> My *other* oldest... yeah. > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
> [...] there is no possibility of referencing an overwritten > block unless you have to back off more than two uberblocks. At this > point, blocks that have been overwritten will show up as corrupted (bad > checksums).Hmmm. Is there some way we can warn the user to scrub their pool because we had trouble reading an ?berblock? (Maybe some FMA rules about what to do if an ?berblock read fails?) This message posted from opensolaris.org
Hello Toby, Tuesday, December 12, 2006, 4:18:54 PM, you wrote: TT> On 12-Dec-06, at 9:46 AM, George Wilson wrote:>> Also note that the UB is written to every vdev (4 per disk) so the >> chances of all UBs being corrupted is rather low.It depends actually - if all your vdevs are on the same array with write back cache set to on you actually can end-up with all UB corrupted - at least in theory. -- Best regards, Robert mailto:rmilkowski at task.gda.pl http://milek.blogspot.com
> Hello Toby, > > Tuesday, December 12, 2006, 4:18:54 PM, you wrote: > TT> On 12-Dec-06, at 9:46 AM, George Wilson wrote: > > >> Also note that the UB is written to every vdev (4 per disk) so the > >> chances of all UBs being corrupted is rather low. > > It depends actually - if all your vdevs are on the same array with > write back cache set to on you actually can end-up with all UB > corrupted - at least in theory.Do such caches respond to explicit flushes? My understanding is that it should try to flush between writing the front 2 and the back 2. Not that even that would guarantee anything if there are real bugs in the cache code, but it would improve the odds. -- Darren Dunham ddunham at taos.com Senior Technical Consultant TAOS http://www.taos.com/ Got some Dr Pepper? San Francisco, CA bay area < This line left intentionally blank to confuse you. >
> Also note that the UB is written to every vdev (4 per disk) so the > chances of all UBs being corrupted is rather low.The chances that they''re corrupted by the storage system, yes. However, they are all sourced from the same in-memory buffer, so an undetected in-memory error (e.g. kernel bug) will be replicated to all vdevs. This message posted from opensolaris.org
> > Also note that the UB is written to every vdev (4 per disk) so the > > chances of all UBs being corrupted is rather low. > > The chances that they''re corrupted by the storage system, yes. > > However, they are all sourced from the same in-memory buffer, so an > undetected in-memory error (e.g. kernel bug) will be replicated to all > vdevs.Does a scrub attempt to read/verify UBs from the disk? Does it only read the current generation? -- Darren Dunham ddunham at taos.com Senior Technical Consultant TAOS http://www.taos.com/ Got some Dr Pepper? San Francisco, CA bay area < This line left intentionally blank to confuse you. >
Anton B. Rang wrote:>> Also note that the UB is written to every vdev (4 per disk) so the >> chances of all UBs being corrupted is rather low. > > The chances that they''re corrupted by the storage system, yes. > > However, they are all sourced from the same in-memory buffer, so > an undetected in-memory error (e.g. kernel bug) will be replicated > to all vdevs.I view undetected in-memory errors from a hardware perspective, not as a software bug. Clearly, software bugs can exist, but we presume testing will find these. I''ll go out on a limb and predict that this particular code is regularly exercised :-) For the hardware, there are some gotchas. Most of the low-end (eg. buy at Fry''s) systems don''t have ECC memory. I would be very wary of using non-ECC memory in a server where data integrity is important. Spend the extra money, and you will be much happier. -- richard
> I view undetected in-memory errors from a hardware perspective, > not as a software bug. Clearly, software bugs can exist, but > we presume testing will find these.Sure. My point is simply that, given that we have a monolithic kernel, any bug in kernel or driver code can corrupt any memory in the kernel. I once saw a UFS superblock with an embedded TCP packet. Hardware errors are possible as well (particularly without parity or ECC) but IMHO less likely. But ECC memory is a very, very good idea. Anton This message posted from opensolaris.org