Michelle Sullivan http://www.mhix.org/ Sent from my iPad> On 30 Apr 2019, at 18:44, rainer at ultra-secure.de wrote: > > Am 2019-04-30 10:09, schrieb Michelle Sullivan: > >> Now, yes most production environments have multiple backing stores so >> will have a server or ten to switch to whilst the store is being >> recovered, but it still wouldn?t be a pleasant experience... not to >> mention the possibility that if one store is corrupted there is a >> chance that the other store(s) would also be affected in the same way >> if in the same DC... (Eg a DC fire - which I have seen) .. and if you >> have multi DC stores to protect from that.. size of the pipes between >> DCs comes clearly into play. > > > I have one customer with about 13T of ZFS - and because it would take a while to restore (actual backups), it zfs-sends delta-snapshots every hour to a standby-system. > > It was handy when we had to rebuild the system with different HBAs. > >I wonder what would happen if you scaled that up by just 10 (storage) and had the master blow up where it needs to be restored from backup.. how long would one be praying to higher powers that there is no problem with the backup...? (As in no outage or error causing a complete outAge.)... don?t get me wrong.. we all get to that position at sometime, but in my recent experience 2 issues colliding at the same time results in disaster. 13T is really not something I have issues with as I can usually cobble something together with 16T.. (at least until 6T drives became a viable (cost and availability at short notice) option... even 10T is becoming easier to get a hold of now.. but I have a measly 96T here and it takes weeks even with gigabit bonded interfaces when I need to restore.
Am 2019-04-30 11:05, schrieb Michelle Sullivan:> I wonder what would happen if you scaled that up by just 10 (storage) > and had the master blow up where it needs to be restored from backup.. > how long would one be praying to higher powers that there is no > problem with the backup...?Well, the backup itself takes over a day, AFAIK. I'm not sure why it is that slow. Maybe the old SAS6 HBAs. It's also lots of files.> (As in no outage or error causing a > complete outAge.)... don?t get me wrong.. we all get to that position > at sometime, but in my recent experience 2 issues colliding at the > same time results in disaster. 13T is really not something I have > issues with as I can usually cobble something together with 16T.. (at > least until 6T drives became a viable (cost and availability at short > notice) option... even 10T is becoming easier to get a hold of now.. > but I have a measly 96T here and it takes weeks even with gigabit > bonded interfaces when I need to restore.Those are all SAS drives, actually. 600 and 1200 GB.
On Tue, Apr 30, 2019 at 5:08 PM Michelle Sullivan <michelle at sorbs.net> wrote:> but in my recent experience 2 issues colliding at the same time results in > disaster >Do we know exactly what kind of corruption happen to your pool? If you see it twice in a row, it might suggest a software bug that should be investigated. Note that ZFS stores multiple copies of its essential metadata, and in my experience with my old, consumer grade crappy hardware (non-ECC RAM, with several faulty, single hard drive pool: bad enough to crash almost monthly and damages my data from time to time), I've never seen a corruption this bad and I was always able to recover the pool. At previous employer, the only case that we had the pool corrupted enough to the point that mount was not allowed was because two host nodes happen to import the pool at the same time, which is a situation that can be avoided with SCSI reservation; their hardware was of much better quality, though. Speaking for a tool like 'fsck': I think I'm mostly convinced that it's *not* necessary, because at the point ZFS says the metadata is corrupted, it means that these metadata was really corrupted beyond repair (all replicas were corrupted; otherwise it would recover by finding out the right block and rewrite the bad ones). An interactive tool may be useful (e.g. "I saw data structure version 1, 2, 3 available, and all with bad checksum, choose which one you would want to try"), but I think they wouldn't be very practical for use with large data pools -- unlike traditional filesystems, ZFS uses copy-on-write and heavily depends on the metadata to find where the data is, and a regular "scan" is not really useful. I'd agree that you need a full backup anyway, regardless what storage system is used, though.
On Apr 30, 2019, at 5:05 AM, Michelle Sullivan <michelle at sorbs.net> wrote:> > > Michelle Sullivan > http://www.mhix.org/ > Sent from my iPad > >> On 30 Apr 2019, at 18:44, rainer at ultra-secure.de wrote: >> >> Am 2019-04-30 10:09, schrieb Michelle Sullivan: >> >>> Now, yes most production environments have multiple backing stores so >>> will have a server or ten to switch to whilst the store is being >>> recovered, but it still wouldn?t be a pleasant experience... not to >>> mention the possibility that if one store is corrupted there is a >>> chance that the other store(s) would also be affected in the same way >>> if in the same DC... (Eg a DC fire - which I have seen) .. and if you >>> have multi DC stores to protect from that.. size of the pipes between >>> DCs comes clearly into play. >> >> >> I have one customer with about 13T of ZFS - and because it would take a >> while to restore (actual backups), it zfs-sends delta-snapshots every >> hour to a standby-system. >> >> It was handy when we had to rebuild the system with different HBAs. > > I wonder what would happen if you scaled that up by just 10 (storage) and > had the master blow up where it needs to be restored from backup.. how > long would one be praying to higher powers that there is no problem with > the backup...? (As in no outage or error causing a complete outAge.)... > don?t get me wrong.. we all get to that position at sometime, but in my > recent experience 2 issues colliding at the same time results in > disaster. 13T is really not something I have issues with as I can > usually cobble something together with 16T.. (at least until 6T drives > became a viable (cost and availability at short notice) option... even > 10T is becoming easier to get a hold of now.. but I have a measly 96T > here and it takes weeks even with gigabit bonded interfaces when I need > to restore.Such is the curse of large-scale storage when disaster befalls it. I guess you need to invent a home brew version of Amazon Snowball or Amazon Snowmobile. ;-) Cheers, Paul.
Xin LI wrote:> > On Tue, Apr 30, 2019 at 5:08 PM Michelle Sullivan <michelle at sorbs.net > <mailto:michelle at sorbs.net>> wrote: > > but in my recent experience 2 issues colliding at the same time > results in disaster > > > Do we know exactly what kind of corruption happen to your pool? If > you see it twice in a row, it might suggest a software bug that should > be investigated. >Oh I did spot one interesting bug... though it is benign... Check out the following (note the difference between 'zpool status' and 'zpool status -v'): root at colossus:/mnt # zpool status pool: storage state: ONLINE status: One or more devices is currently being resilvered. The pool will continue to function, possibly in a degraded state. action: Wait for the resilver to complete. scan: resilver in progress since Mon Apr 29 20:22:03 2019 6.54T scanned at 0/s, 6.54T issued at 0/s, 28.8T total 445G resilvered, 22.66% done, no estimated completion time config: NAME STATE READ WRITE CKSUM storage ONLINE 0 0 5 raidz2-0 ONLINE 0 0 20 mfid11 ONLINE 0 0 0 mfid10 ONLINE 0 0 0 mfid8 ONLINE 0 0 0 mfid7 ONLINE 0 0 0 mfid0 ONLINE 0 0 0 mfid5 ONLINE 0 0 0 mfid4 ONLINE 0 0 0 mfid3 ONLINE 0 0 0 mfid2 ONLINE 0 0 0 mfid14 ONLINE 0 0 0 mfid15 ONLINE 0 0 0 mfid6 ONLINE 0 0 0 mfid9 ONLINE 0 0 0 mfid13 ONLINE 0 0 0 mfid1 ONLINE 0 0 0 errors: 4 data errors, use '-v' for a list root at colossus:/mnt # zpool status pool: storage state: ONLINE status: One or more devices is currently being resilvered. The pool will continue to function, possibly in a degraded state. action: Wait for the resilver to complete. scan: resilver in progress since Mon Apr 29 20:22:03 2019 6.54T scanned at 0/s, 6.54T issued at 0/s, 28.8T total 445G resilvered, 22.66% done, no estimated completion time config: NAME STATE READ WRITE CKSUM storage ONLINE 0 0 5 raidz2-0 ONLINE 0 0 20 mfid11 ONLINE 0 0 0 mfid10 ONLINE 0 0 0 mfid8 ONLINE 0 0 0 mfid7 ONLINE 0 0 0 mfid0 ONLINE 0 0 0 mfid5 ONLINE 0 0 0 mfid4 ONLINE 0 0 0 mfid3 ONLINE 0 0 0 mfid2 ONLINE 0 0 0 mfid14 ONLINE 0 0 0 mfid15 ONLINE 0 0 0 mfid6 ONLINE 0 0 0 mfid9 ONLINE 0 0 0 mfid13 ONLINE 0 0 0 mfid1 ONLINE 0 0 0 errors: 4 data errors, use '-v' for a list root at colossus:/mnt # zpool status -v pool: storage state: ONLINE status: One or more devices is currently being resilvered. The pool will continue to function, possibly in a degraded state. action: Wait for the resilver to complete. scan: resilver in progress since Mon Apr 29 20:22:03 2019 6.54T scanned at 0/s, 6.54T issued at 0/s, 28.8T total 445G resilvered, 22.66% done, no estimated completion time config: NAME STATE READ WRITE CKSUM storage ONLINE 0 0 5 raidz2-0 ONLINE 0 0 20 mfid11 ONLINE 0 0 0 mfid10 ONLINE 0 0 0 mfid8 ONLINE 0 0 0 mfid7 ONLINE 0 0 0 mfid0 ONLINE 0 0 0 mfid5 ONLINE 0 0 0 mfid4 ONLINE 0 0 0 mfid3 ONLINE 0 0 0 mfid2 ONLINE 0 0 0 mfid14 ONLINE 0 0 0 mfid15 ONLINE 0 0 0 mfid6 ONLINE 0 0 0 mfid9 ONLINE 0 0 0 mfid13 ONLINE 0 0 0 mfid1 ONLINE 0 0 0 errors: Permanent errors have been detected in the following files: <metadata>:<0x3e> <metadata>:<0x5d> storage:<0x0> storage at now:<0x0> root at colossus:/mnt # zpool status -v pool: storage state: ONLINE status: One or more devices is currently being resilvered. The pool will continue to function, possibly in a degraded state. action: Wait for the resilver to complete. scan: resilver in progress since Mon Apr 29 20:22:03 2019 6.54T scanned at 0/s, 6.54T issued at 0/s, 28.8T total 445G resilvered, 22.66% done, no estimated completion time config: NAME STATE READ WRITE CKSUM storage ONLINE 0 0 7 raidz2-0 ONLINE 0 0 28 mfid11 ONLINE 0 0 0 mfid10 ONLINE 0 0 0 mfid8 ONLINE 0 0 0 mfid7 ONLINE 0 0 0 mfid0 ONLINE 0 0 0 mfid5 ONLINE 0 0 0 mfid4 ONLINE 0 0 0 mfid3 ONLINE 0 0 0 mfid2 ONLINE 0 0 0 mfid14 ONLINE 0 0 0 mfid15 ONLINE 0 0 0 mfid6 ONLINE 0 0 0 mfid9 ONLINE 0 0 0 mfid13 ONLINE 0 0 0 mfid1 ONLINE 0 0 0 errors: Permanent errors have been detected in the following files: <metadata>:<0x3e> <metadata>:<0x5d> storage:<0x0> storage at now:<0x0> root at colossus:/mnt # zpool status -v pool: storage state: ONLINE status: One or more devices is currently being resilvered. The pool will continue to function, possibly in a degraded state. action: Wait for the resilver to complete. scan: resilver in progress since Mon Apr 29 20:22:03 2019 6.54T scanned at 0/s, 6.54T issued at 0/s, 28.8T total 445G resilvered, 22.66% done, no estimated completion time config: NAME STATE READ WRITE CKSUM mfid10 ONLINE 0 0 0 mfid8 ONLINE 0 0 0 mfid7 ONLINE 0 0 0 mfid0 ONLINE 0 0 0 mfid5 ONLINE 0 0 0 mfid4 ONLINE 0 0 0 mfid3 ONLINE 0 0 0 mfid2 ONLINE 0 0 0 mfid14 ONLINE 0 0 0 mfid15 ONLINE 0 0 0 mfid6 ONLINE 0 0 0 mfid9 ONLINE 0 0 0 mfid13 ONLINE 0 0 0 mfid1 ONLINE 0 0 0 errors: Permanent errors have been detected in the following files: <metadata>:<0x3e> <metadata>:<0x5d> storage:<0x0> storage at now:<0x0> root at colossus:/mnt # zpool status -v pool: storage state: ONLINE status: One or more devices is currently being resilvered. The pool will continue to function, possibly in a degraded state. action: Wait for the resilver to complete. scan: resilver in progress since Mon Apr 29 20:22:03 2019 6.54T scanned at 0/s, 6.54T issued at 0/s, 28.8T total 445G resilvered, 22.66% done, no estimated completion time config: NAME STATE READ WRITE CKSUM storage ONLINE 0 0 11 raidz2-0 ONLINE 0 0 44 mfid11 ONLINE 0 0 0 mfid10 ONLINE 0 0 0 mfid8 ONLINE 0 0 0 mfid7 ONLINE 0 0 0 mfid0 ONLINE 0 0 0 mfid5 ONLINE 0 0 0 mfid4 ONLINE 0 0 0 mfid3 ONLINE 0 0 0 mfid2 ONLINE 0 0 0 mfid14 ONLINE 0 0 0 mfid15 ONLINE 0 0 0 mfid6 ONLINE 0 0 0 mfid9 ONLINE 0 0 0 mfid13 ONLINE 0 0 0 mfid1 ONLINE 0 0 0 errors: Permanent errors have been detected in the following files: <metadata>:<0x3e> <metadata>:<0x5d> storage:<0x0> storage at now:<0x0> root at colossus:/mnt # -- Michelle Sullivan http://www.mhix.org/