This may be of interest for the fail cause aswel as how to recover... I have a known good 2TB (4kByte physical sectors) HDD that supports sata3 (6Gbit/s). Writing data via rsync at the 6Gbit/s sata rate caused IO errors for just THREE sectors... Yet btrfsck bombs out with LOTs of errors... How best to recover from this? (This is a ''backup'' disk so not ''critical'' but it would be nice to avoid rewriting about 1.5TB of data over the network...) Is there an obvious sequence/recipe to follow for recovery? Thanks, Martin Further details: Linux 3.10.7-gentoo-r1 #2 SMP Fri Sep 27 23:38:06 BST 2013 x86_64 AMD E-450 APU with Radeon(tm) HD Graphics AuthenticAMD GNU/Linux # btrfs version Btrfs v0.20-rc1-358-g194aa4a Single 2TB HDD using default mkbtrfs. Entire disk (/dev/sdc) is btrfs (no partitions). The IO errors were: kernel: end_request: I/O error, dev sdc, sector 3215049328 kernel: end_request: I/O error, dev sdc, sector 3215049328 kernel: end_request: I/O error, dev sdc, sector 3215049328 kernel: end_request: I/O error, dev sdc, sector 3215049328 kernel: end_request: I/O error, dev sdc, sector 3215049328 kernel: end_request: I/O error, dev sdc, sector 3206563752 kernel: end_request: I/O error, dev sdc, sector 3206563752 kernel: end_request: I/O error, dev sdc, sector 3206563752 kernel: end_request: I/O error, dev sdc, sector 3206563752 kernel: end_request: I/O error, dev sdc, sector 3206563752 kernel: end_request: I/O error, dev sdc, sector 3213925248 kernel: end_request: I/O error, dev sdc, sector 3213925248 kernel: end_request: I/O error, dev sdc, sector 3213925248 kernel: end_request: I/O error, dev sdc, sector 3213925248 kernel: end_request: I/O error, dev sdc, sector 3213925248 Lots of sata error noise omitted. The sata problem was fixed by limiting libata to 3Gbit/s: libata.force=3.0G added onto the Grub kernel line. Running "badblocks" twice in succession (non-destructive data test!) shows no surface errors and no further errors on the sata interface. Running btrfsck twice gives the same result, giving a failure with: Ignoring transid failure btrfsck: cmds-check.c:1066: process_file_extent: Assertion `!(rec->ino != key->objectid || rec->refs > 1)'' failed. An abridged summary is: checking extents parent transid verify failed on 907185082368 wanted 15935 found 12264 parent transid verify failed on 907185082368 wanted 15935 found 12264 parent transid verify failed on 907185127424 wanted 15935 found 12264 parent transid verify failed on 907185127424 wanted 15935 found 12264 leaf parent key incorrect 907185135616 bad block 907185135616 parent transid verify failed on 915444707328 wanted 16974 found 13021 parent transid verify failed on 915444707328 wanted 16974 found 13021 parent transid verify failed on 915445092352 wanted 16974 found 13021 parent transid verify failed on 915445092352 wanted 16974 found 13021 leaf parent key incorrect 915444883456 bad block 915444883456 leaf parent key incorrect 915445014528 bad block 915445014528 parent transid verify failed on 907185082368 wanted 15935 found 12264 parent transid verify failed on 907185082368 wanted 15935 found 12264 parent transid verify failed on 907185127424 wanted 15935 found 12264 parent transid verify failed on 907185127424 wanted 15935 found 12264 leaf parent key incorrect 907183771648 bad block 907183771648 leaf parent key incorrect 907183779840 bad block 907183779840 leaf parent key incorrect 907183783936 bad block 907183783936 [...] leaf parent key incorrect 907185913856 bad block 907185913856 leaf parent key incorrect 907185917952 bad block 907185917952 parent transid verify failed on 915431579648 wanted 16974 found 16972 parent transid verify failed on 915431579648 wanted 16974 found 16972 parent transid verify failed on 915432382464 wanted 16974 found 16972 parent transid verify failed on 915432382464 wanted 16974 found 16972 parent transid verify failed on 915444707328 wanted 16974 found 13021 parent transid verify failed on 915444707328 wanted 16974 found 13021 parent transid verify failed on 915445092352 wanted 16974 found 13021 parent transid verify failed on 915445092352 wanted 16974 found 13021 parent transid verify failed on 915445100544 wanted 16974 found 13021 parent transid verify failed on 915445100544 wanted 16974 found 13021 parent transid verify failed on 915432734720 wanted 16974 found 16972 parent transid verify failed on 915432734720 wanted 16974 found 16972 parent transid verify failed on 915433144320 wanted 16974 found 16972 parent transid verify failed on 915433144320 wanted 16974 found 16972 parent transid verify failed on 915431862272 wanted 16974 found 16972 parent transid verify failed on 915431862272 wanted 16974 found 16972 parent transid verify failed on 915444715520 wanted 16974 found 13021 parent transid verify failed on 915444715520 wanted 16974 found 13021 parent transid verify failed on 915445166080 wanted 16974 found 13021 parent transid verify failed on 915445166080 wanted 16974 found 13021 parent transid verify failed on 915444740096 wanted 16974 found 13021 parent transid verify failed on 915444740096 wanted 16974 found 13021 bad block 915431026688 leaf parent key incorrect 915431141376 bad block 915431141376 leaf parent key incorrect 915431161856 [...] leaf parent key incorrect 915445100544 bad block 915445100544 leaf parent key incorrect 915445166080 bad block 915445166080 leaf parent key incorrect 915445268480 bad block 915445268480 parent transid verify failed on 915444973568 wanted 16974 found 13021 parent transid verify failed on 915444973568 wanted 16974 found 13021 parent transid verify failed on 915444977664 wanted 16974 found 13021 parent transid verify failed on 915444977664 wanted 16974 found 13021 parent transid verify failed on 915444981760 wanted 16974 found 13021 parent transid verify failed on 915444981760 wanted 16974 found 13021 parent transid verify failed on 915432701952 wanted 16974 found 16972 parent transid verify failed on 915432701952 wanted 16974 found 16972 parent transid verify failed on 915444678656 wanted 16974 found 13021 parent transid verify failed on 915444678656 wanted 16974 found 13021 parent transid verify failed on 915444682752 wanted 16974 found 13021 parent transid verify failed on 915444682752 wanted 16974 found 13021 ref mismatch on [712708972544 4096] extent item 0, found 1 Backref 712708972544 parent 5 root 5 not found in extent tree backpointer mismatch on [712708972544 4096] ref mismatch on [712708988928 4096] extent item 0, found 1 Backref 712708988928 parent 5 root 5 not found in extent tree backpointer mismatch on [712708988928 4096] ref mismatch on [712708993024 4096] extent item 0, found 1 Backref 712708993024 parent 5 root 5 not found in extent tree backpointer mismatch on [712708993024 4096] ref mismatch on [712708997120 4096] extent item 0, found 1 Backref 712708997120 parent 5 root 5 not found in extent tree backpointer mismatch on [712708997120 4096] ref mismatch on [712709001216 4096] extent item 0, found 1 Backref 712709001216 parent 5 root 5 not found in extent tree backpointer mismatch on [712709001216 4096] [...] ref mismatch on [712709062656 4096] extent item 0, found 1 Backref 712709062656 parent 5 root 5 not found in extent tree backpointer mismatch on [712709062656 4096] ref mismatch on [712709066752 4096] extent item 0, found 1 Backref 712709066752 parent 5 root 5 not found in extent tree backpointer mismatch on [712709066752 4096] ref mismatch on [907178082304 4096] extent item 1, found 0 Backref 907178082304 root 5 not referenced back 0x1b96f2a0 Incorrect global backref count on 907178082304 found 1 wanted 0 backpointer mismatch on [907178082304 4096] owner ref check failed [907178082304 4096] ref mismatch on [907178090496 4096] extent item 1, found 0 Backref 907178090496 root 5 not referenced back 0x1b98aed0 Incorrect global backref count on 907178090496 found 1 wanted 0 backpointer mismatch on [907178090496 4096] owner ref check failed [907178090496 4096] ref mismatch on [907178156032 4096] extent item 1, found 0 Backref 907178156032 root 5 not referenced back 0x3ffe5ce0 Incorrect global backref count on 907178156032 found 1 wanted 0 backpointer mismatch on [907178156032 4096] owner ref check failed [907178156032 4096] ref mismatch on [907178160128 4096] extent item 1, found 0 Backref 907178160128 root 5 not referenced back 0x5fbf8b0 Incorrect global backref count on 907178160128 found 1 wanted 0 backpointer mismatch on [907178160128 4096] owner ref check failed [907178160128 4096] [...] ref mismatch on [907180011520 4096] extent item 1, found 0 Backref 907180011520 root 5 not referenced back 0x5980c7e0 Incorrect global backref count on 907180011520 found 1 wanted 0 backpointer mismatch on [907180011520 4096] owner ref check failed [907180011520 4096] owner ref check failed [907183771648 4096] owner ref check failed [907183779840 4096] owner ref check failed [907183783936 4096] owner ref check failed [907183792128 4096] owner ref check failed [907183796224 4096] owner ref check failed [907183841280 4096] owner ref check failed [907183874048 4096] owner ref check failed [907183878144 4096] owner ref check failed [907183882240 4096] owner ref check failed [907183886336 4096] owner ref check failed [907183894528 4096] owner ref check failed [907183898624 4096] owner ref check failed [907183902720 4096] owner ref check failed [907183906816 4096] owner ref check failed [907183910912 4096] owner ref check failed [907185057792 4096] owner ref check failed [907185082368 4096] owner ref check failed [907185135616 4096] ref mismatch on [907185139712 4096] extent item 1, found 0 Backref 907185139712 root 5 not referenced back 0x470fa690 Incorrect global backref count on 907185139712 found 1 wanted 0 backpointer mismatch on [907185139712 4096] owner ref check failed [907185139712 4096] [...] ref mismatch on [934316011520 4096] extent item 0, found 1 Backref 934316011520 parent 5 root 5 not found in extent tree backpointer mismatch on [934316011520 4096] ref mismatch on [934316019712 4096] extent item 0, found 1 Backref 934316019712 parent 5 root 5 not found in extent tree backpointer mismatch on [934316019712 4096] ref mismatch on [934316032000 4096] extent item 0, found 1 Backref 934316032000 parent 5 root 5 not found in extent tree backpointer mismatch on [934316032000 4096] ref mismatch on [1128365600768 8192] extent item 1, found 0 Incorrect local backref count on 1128365600768 root 5 owner 889187 offset 0 found 0 wanted 1 back 0x6bb76d90 Backref disk bytenr does not match extent record, bytenr=1128365600768, ref bytenr=17613768628740554752 backpointer mismatch on [1128365600768 8192] owner ref check failed [1128365600768 8192] ref mismatch on [1128365608960 8192] extent item 1, found 0 Incorrect local backref count on 1128365608960 root 5 owner 889188 offset 0 found 0 wanted 1 back 0x6bb76ec0 Backref disk bytenr does not match extent record, bytenr=1128365608960, ref bytenr=8848955218968205284 backpointer mismatch on [1128365608960 8192] owner ref check failed [1128365608960 8192] ref mismatch on [1128365617152 8192] extent item 1, found 0 Incorrect local backref count on 1128365617152 root 5 owner 889189 offset 0 found 0 wanted 1 back 0x6bb76ff0 Backref disk bytenr does not match extent record, bytenr=1128365617152, ref bytenr=1928784803178016523 backpointer mismatch on [1128365617152 8192] owner ref check failed [1128365617152 8192] ref mismatch on [1128365625344 4096] extent item 1, found 0 Incorrect local backref count on 1128365625344 root 5 owner 889190 offset 0 found 0 wanted 1 back 0x6bb77120 Backref disk bytenr does not match extent record, bytenr=1128365625344, ref bytenr=3735616339648328182 backpointer mismatch on [1128365625344 4096] owner ref check failed [1128365625344 4096] [...] ref mismatch on [1454133166080 12288] extent item 1, found 0 Incorrect local backref count on 1454133166080 root 5 owner 2096965 offset 0 found 0 wanted 1 back 0x50c68ad0 Backref disk bytenr does not match extent record, bytenr=1454133166080, ref bytenr=64 backpointer mismatch on [1454133166080 12288] owner ref check failed [1454133166080 12288] Errors found in extent allocation tree or chunk allocation checking free space cache Checking filesystem on /dev/sdc UUID: 38a60270-f9c6-4ed4-8421-4bf1253ae0b3 free space inode generation (0) did not match free space cache generation (505) free space inode generation (0) did not match free space cache generation (486) free space inode generation (0) did not match free space cache generation (486) free space inode generation (0) did not match free space cache generation (486) free space inode generation (0) did not match free space cache generation (484) free space inode generation (0) did not match free space cache generation (486) free space inode generation (0) did not match free space cache generation (486) free space inode generation (0) did not match free space cache generation (484) free space inode generation (0) did not match free space cache generation (486) free space inode generation (0) did not match free space cache generation (484) free space inode generation (0) did not match free space cache generation (516) free space inode generation (0) did not match free space cache generation (486) free space inode generation (0) did not match free space cache generation (487) free space inode generation (0) did not match free space cache generation (486) free space inode generation (0) did not match free space cache generation (501) free space inode generation (0) did not match free space cache generation (531) free space inode generation (0) did not match free space cache generation (498) free space inode generation (0) did not match free space cache generation (498) free space inode generation (0) did not match free space cache generation (484) free space inode generation (0) did not match free space cache generation (532) free space inode generation (0) did not match free space cache generation (502) free space inode generation (0) did not match free space cache generation (532) free space inode generation (0) did not match free space cache generation (502) [...] free space inode generation (0) did not match free space cache generation (1612) free space inode generation (0) did not match free space cache generation (1612) free space inode generation (0) did not match free space cache generation (1613) free space inode generation (0) did not match free space cache generation (1599) free space inode generation (0) did not match free space cache generation (1606) free space inode generation (0) dparent transid verify failed on 907185127424 wanted 15935 found 12264 parent transid verify failed on 907185127424 wanted 15935 found 12264 parent transid verify failed on 915431579648 wanted 16974 found 16972 parent transid verify failed on 915431579648 wanted 16974 found 16972 parent transid verify failed on 915432382464 wanted 16974 found 16972 parent transid verify failed on 915432382464 wanted 16974 found 16972 parent transid verify failed on 915444707328 wanted 16974 found 13021 parent transid verify failed on 915444707328 wanted 16974 found 13021 parent transid verify failed on 915445092352 wanted 16974 found 13021 parent transid verify failed on 915445092352 wanted 16974 found 13021 parent transid verify failed on 915445100544 wanted 16974 found 13021 parent transid verify failed on 915445100544 wanted 16974 found 13021 parent transid verify failed on 915432734720 wanted 16974 found 16972 parent transid verify failed on 915432734720 wanted 16974 found 16972 parent transid verify failed on 915433144320 wanted 16974 found 16972 parent transid verify failed on 915433144320 wanted 16974 found 16972 parent transid verify failed on 915431862272 wanted 16974 found 16972 parent transid verify failed on 915431862272 wanted 16974 found 16972 parent transid verify failed on 915444715520 wanted 16974 found 13021 parent transid verify failed on 915444715520 wanted 16974 found 13021 parent transid verify failed on 915445166080 wanted 16974 found 13021 parent transid verify failed on 915445166080 wanted 16974 found 13021 parent transid verify failed on 915444740096 wanted 16974 found 13021 parent transid verify failed on 915444740096 wanted 16974 found 13021 parent transid verify failed on 915444973568 wanted 16974 found 13021 parent transid verify failed on 915444973568 wanted 16974 found 13021 parent transid verify failed on 915444977664 wanted 16974 found 13021 parent transid verify failed on 915444977664 wanted 16974 found 13021 parent transid verify failed on 915444981760 wanted 16974 found 13021 parent transid verify failed on 915444981760 wanted 16974 found 13021 parent transid verify failed on 915432701952 wanted 16974 found 16972 parent transid verify failed on 915432701952 wanted 16974 found 16972 parent transid verify failed on 915444678656 wanted 16974 found 13021 parent transid verify failed on 915444678656 wanted 16974 found 13021 parent transid verify failed on 915444682752 wanted 16974 found 13021 parent transid verify failed on 915444682752 wanted 16974 found 13021 checking fs roots parent transid verify failed on 907185082368 wanted 15935 found 12264 parent transid verify failed on 907185082368 wanted 15935 found 12264 parent transid verify failed on 907185082368 wanted 15935 found 12264 parent transid verify failed on 907185082368 wanted 15935 found 12264 parent transid verify failed on 907185127424 wanted 15935 found 12264 parent transid verify failed on 907185127424 wanted 15935 found 12264 parent transid verify failed on 907185127424 wanted 15935 found 12264 parent transid verify failed on 907185127424 wanted 15935 found 12264 [...] parent transid verify failed on 915444523008 wanted 16974 found 13021 parent transid verify failed on 915444523008 wanted 16974 found 13021 parent transid verify failed on 915444523008 wanted 16974 found 13021 parent transid verify failed on 915444523008 wanted 16974 found 13021 Ignoring transid failure btrfsck: cmds-check.c:1066: process_file_extent: Assertion `!(rec->ino != key->objectid || rec->refs > 1)'' failed. id not match free space cache generation (1625) free space inode generation (0) did not match free space cache generation (1607) free space inode generation (0) did not match free space cache generation (1604) free space inode generation (0) did not match free space cache generation (1606) free space inode generation (0) did not match free space cache generation (1620) free space inode generation (0) did not match free space cache generation (1626) free space inode generation (0) did not match free space cache generation (1609) free space inode generation (0) did not match free space cache generation (1653) free space inode generation (0) did not match free space cache generation (1628) free space inode generation (0) did not match free space cache generation (1628) free space inode generation (0) did not match free space cache generation (1649) End of output -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Chris Murphy
2013-Sep-28 20:51 UTC
Re: Corrupt btrfs filesystem recovery... (Due to *sata* errors)
On Sep 28, 2013, at 1:26 PM, Martin <m_btrfs@ml1.co.uk> wrote:> Writing data via rsync at the 6Gbit/s sata rate caused > IO errors for just THREE sectors... > > Yet btrfsck bombs out with LOTs of errors…Any fs will bomb out on write errors.> How best to recover from this?Why you''re getting I/O errors at SATA 6Gbps link speed needs to be understood. Is it a bad cable? Bad SATA port? Drive or controller firmware bug? Or libata driver bug?> Lots of sata error noise omitted.And entire dmesg might still be useful. I don''t know if the list will handle the whole dmesg in one email, but it''s worth a shot (reply to an email in the thread, don''t change the subject). It''s possible software or hardware problems are detected well before writes are even initiated.> Running "badblocks" twice in succession (non-destructive data test!) > shows no surface errors and no further errors on the sata interface.SATA link speed related errors aren''t related to bad blocks. If you do a smartctl -x on the drive, chances are it''s recording PHY Event errors that might be relevant, and also SMART might record UDMA/CMC errors that would just corroborate that the drive also found link errors.> > Running btrfsck twice gives the same result, giving a failure with:Well honestly at this point I expect file system corruption as it''s entirely possible that before the hardware dropped the link speed down to SATA 3Gbps, there was corrupt data already sent to the drive and that''s not something Btrfs can know about until trying to read the data back in. So *shrug* - I don''t see Btrfs as a way to totally mitigate hardware problems. It''s the same problem with bad RAM, and Btrfs doesn''t like that either. Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Martin
2013-Sep-28 22:51 UTC
Re: Corrupt btrfs filesystem recovery... (Due to *sata* errors)
Chris, All agreed. Further comment inlined: (Should have mentioned more prominently that the hardware problem has been worked-around by limiting the sata to 3Gbit/s on bootup.) On 28/09/13 21:51, Chris Murphy wrote:> > On Sep 28, 2013, at 1:26 PM, Martin <m_btrfs@ml1.co.uk> wrote: > >> Writing data via rsync at the 6Gbit/s sata rate caused IO errors >> for just THREE sectors... >> >> Yet btrfsck bombs out with LOTs of errors… > > Any fs will bomb out on write errors.Indeed. However, are not the sata errors reported back to btrfs so that it knows whatever parts haven''t been updated? Is there not a mechanism to then go "read-only"? Also, should not the journal limit the damage?>> How best to recover from this? > > Why you''re getting I/O errors at SATA 6Gbps link speed needs to be > understood. Is it a bad cable? Bad SATA port? Drive or controller > firmware bug? Or libata driver bug?I systematically eliminated such as leads, PSU, and NCQ. Limiting libata to only use 3Gbit/s is the one change that gives a consistent fix. The HDD and motherboard both support 6Gbit/s, but hey-ho, that''s an experiment I can try again some other time when I have another HDD/SSD to test in there. In any case, for the existing HDD - motherboard combination, using sata2 rather than sata3 speeds shouldn''t noticeably impact performance. (Other than sata2 works reliably and so is infinitely better for this case!)>> Lots of sata error noise omitted. > > And entire dmesg might still be useful. I don''t know if the list will > handle the whole dmesg in one email, but it''s worth a shot (reply to > an email in the thread, don''t change the subject).I can email directly if of use/interest. Let me know offlist.> do a smartctl -x on the drive, chances are it''s recording PHY Event(smartctl -x errors shown further down...) Nothing untoward noticed: # smartctl -a /dev/sdc === START OF INFORMATION SECTION ==Model Family: Western Digital Caviar Green (AF, SATA 6Gb/s) Device Model: WDC WD20EARX-00PASB0 Serial Number: WD-... LU WWN Device Id: ... Firmware Version: 51.0AB51 User Capacity: 2,000,398,934,016 bytes [2.00 TB] Sector Sizes: 512 bytes logical, 4096 bytes physical Device is: In smartctl database [for details use: -P show] ATA Version is: ATA8-ACS (minor revision not indicated) SATA Version is: SATA 3.0, 6.0 Gb/s (current: 3.0 Gb/s) Local Time is: Sat Sep 28 23:35:57 2013 BST SMART support is: Available - device has SMART capability. SMART support is: Enabled [...] SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 9 3 Spin_Up_Time 0x0027 253 159 021 Pre-fail Always - 1983 4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 55 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0 7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0 9 Power_On_Hours 0x0032 099 099 000 Old_age Always - 800 10 Spin_Retry_Count 0x0032 100 253 000 Old_age Always - 0 11 Calibration_Retry_Count 0x0032 100 253 000 Old_age Always - 0 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 53 192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 31 193 Load_Cycle_Count 0x0032 199 199 000 Old_age Always - 3115 194 Temperature_Celsius 0x0022 118 110 000 Old_age Always - 32 196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0 197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0030 200 200 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0 200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 0 # smartctl -x /dev/sdc ... also shows the errors it saw: (Just the last 4 copied which look timed for when the HDD was last exposed to 6Gbit/s sata) Error 46 [21] occurred at disk power-on lifetime: 755 hours (31 days + 11 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER -- ST COUNT LBA_48 LH LM LL DV DC -- -- -- == -- == == == -- -- -- -- -- 01 -- 51 00 08 00 00 6c 1a 4b b0 e0 00 Error: AMNF 8 sectors at LBA 0x6c1a4bb0 = 1813662640 Commands leading to the command that caused the error were: CR FEATR COUNT LBA_48 LH LM LL DV DC Powered_Up_Time Command/Feature_Name -- == -- == -- == == == -- -- -- -- -- --------------- -------------------- 25 00 00 00 08 00 00 6c 1a 4b b0 e0 08 10:51:07.192 READ DMA EXT 35 00 00 00 08 00 00 6c 1a 4b a8 e0 08 10:51:07.192 WRITE DMA EXT 25 00 00 00 08 00 00 6c 1a 4b a8 e0 08 10:51:07.192 READ DMA EXT 35 00 00 00 08 00 00 6c 1a 4b a8 e0 08 10:51:07.192 WRITE DMA EXT 25 00 00 00 08 00 00 6c 1a 4b a8 e0 08 10:51:07.157 READ DMA EXT Error 45 [20] occurred at disk power-on lifetime: 755 hours (31 days + 11 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER -- ST COUNT LBA_48 LH LM LL DV DC -- -- -- == -- == == == -- -- -- -- -- 01 -- 51 04 00 00 00 6c 1a 4b b0 e0 00 Error: AMNF 1024 sectors at LBA = 0x6c1a4bb0 = 1813662640 Commands leading to the command that caused the error were: CR FEATR COUNT LBA_48 LH LM LL DV DC Powered_Up_Time Command/Feature_Name -- == -- == -- == == == -- -- -- -- -- --------------- -------------------- 25 00 00 04 00 00 00 6c 1a 48 00 e0 08 10:51:03.450 READ DMA EXT ef 00 10 00 02 00 00 00 00 00 00 a0 08 10:51:03.449 SET FEATURES [Enable SATA feature] 27 00 00 00 00 00 00 00 00 00 00 e0 08 10:51:03.449 READ NATIVE MAX ADDRESS EXT [OBS-ACS-3] ec 00 00 00 00 00 00 00 00 00 00 a0 08 10:51:03.446 IDENTIFY DEVICE ef 00 03 00 46 00 00 00 00 00 00 a0 08 10:51:03.446 SET FEATURES [Set transfer mode] Error 44 [19] occurred at disk power-on lifetime: 755 hours (31 days + 11 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER -- ST COUNT LBA_48 LH LM LL DV DC -- -- -- == -- == == == -- -- -- -- -- 01 -- 51 04 00 00 00 6c 1a 4b b0 e0 00 Error: AMNF 1024 sectors at LBA = 0x6c1a4bb0 = 1813662640 Commands leading to the command that caused the error were: CR FEATR COUNT LBA_48 LH LM LL DV DC Powered_Up_Time Command/Feature_Name -- == -- == -- == == == -- -- -- -- -- --------------- -------------------- 25 00 00 04 00 00 00 6c 1a 48 00 e0 08 10:51:00.453 READ DMA EXT ef 00 10 00 02 00 00 00 00 00 00 a0 08 10:51:00.452 SET FEATURES [Enable SATA feature] 27 00 00 00 00 00 00 00 00 00 00 e0 08 10:51:00.452 READ NATIVE MAX ADDRESS EXT [OBS-ACS-3] ec 00 00 00 00 00 00 00 00 00 00 a0 08 10:51:00.449 IDENTIFY DEVICE ef 00 03 00 46 00 00 00 00 00 00 a0 08 10:51:00.449 SET FEATURES [Set transfer mode] Error 43 [18] occurred at disk power-on lifetime: 755 hours (31 days + 11 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER -- ST COUNT LBA_48 LH LM LL DV DC -- -- -- == -- == == == -- -- -- -- -- 01 -- 51 04 00 00 00 6c 1a 4b b0 e0 00 Error: AMNF 1024 sectors at LBA = 0x6c1a4bb0 = 1813662640 Commands leading to the command that caused the error were: CR FEATR COUNT LBA_48 LH LM LL DV DC Powered_Up_Time Command/Feature_Name -- == -- == -- == == == -- -- -- -- -- --------------- -------------------- 25 00 00 04 00 00 00 6c 1a 48 00 e0 08 10:50:57.455 READ DMA EXT ef 00 10 00 02 00 00 00 00 00 00 a0 08 10:50:57.455 SET FEATURES [Enable SATA feature] 27 00 00 00 00 00 00 00 00 00 00 e0 08 10:50:57.455 READ NATIVE MAX ADDRESS EXT [OBS-ACS-3] ec 00 00 00 00 00 00 00 00 00 00 a0 08 10:50:57.452 IDENTIFY DEVICE ef 00 03 00 46 00 00 00 00 00 00 a0 08 10:50:57.452 SET FEATURES [Set transfer mode] Error 42 [17] occurred at disk power-on lifetime: 755 hours (31 days + 11 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER -- ST COUNT LBA_48 LH LM LL DV DC -- -- -- == -- == == == -- -- -- -- -- 01 -- 51 04 00 00 00 6c 1a 4b b0 e0 00 Error: AMNF 1024 sectors at LBA = 0x6c1a4bb0 = 1813662640 Commands leading to the command that caused the error were: CR FEATR COUNT LBA_48 LH LM LL DV DC Powered_Up_Time Command/Feature_Name -- == -- == -- == == == -- -- -- -- -- --------------- -------------------- 25 00 00 04 00 00 00 6c 1a 48 00 e0 08 10:50:54.459 READ DMA EXT ef 00 10 00 02 00 00 00 00 00 00 a0 08 10:50:54.458 SET FEATURES [Enable SATA feature] 27 00 00 00 00 00 00 00 00 00 00 e0 08 10:50:54.458 READ NATIVE MAX ADDRESS EXT [OBS-ACS-3] ec 00 00 00 00 00 00 00 00 00 00 a0 08 10:50:54.455 IDENTIFY DEVICE ef 00 03 00 46 00 00 00 00 00 00 a0 08 10:50:54.455 SET FEATURES [Set transfer mode]>> Running btrfsck twice gives the same result, giving a failure >> with: > > Well honestly at this point I expect file system corruption as it''s > entirely possible that before the hardware dropped the link speed > down to SATA 3Gbps, there was corrupt data already sent to the drive > and that''s not something Btrfs can know about until trying to read > the data back in. So *shrug* - I don''t see Btrfs as a way to totally > mitigate hardware problems. It''s the same problem with bad RAM, and > Btrfs doesn''t like that either.Indeed. Hence trapping ''unexpectedness'' where reasonable to then go read-only... (I guess a hard compromise though whilst still debugging bugs ''unexpectedness''! But still good to have in mind. ;-) ) Regards, Martin -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Martin
2013-Sep-28 22:54 UTC
Re: Corrupt btrfs filesystem recovery... (Due to *sata* errors)
On 28/09/13 20:26, Martin wrote:> ... btrfsck bombs out with LOTs of errors... > > How best to recover from this? > > (This is a ''backup'' disk so not ''critical'' but it would be nice to avoid > rewriting about 1.5TB of data over the network...) > > > Is there an obvious sequence/recipe to follow for recovery?I''ve got the drive reliably working with the sata limited to 3Gbit/s. What is the best sequence to try to tidy-up and carry on with the 1.5TB or so of data on there, rather than working from scratch? So far, I''ve only run btrfsck since the corruption errors for the three sectors... Suggestions for recovery? Thanks, Martin -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Chris Murphy
2013-Sep-29 02:06 UTC
Re: Corrupt btrfs filesystem recovery... (Due to *sata* errors)
On Sep 28, 2013, at 4:51 PM, Martin <m_btrfs@ml1.co.uk> wrote:> Indeed. However, are not the sata errors reported back to btrfs so that > it knows whatever parts haven''t been updated?It''s a good question. My doubtful speculation of such a mechanism is that it is really not the responsibility of the file system to be prepared for the hardware face planting this spectacularly. The hardware really should do better than this. There are specifications that apply here, and the drive and controller and driver all agreed long before the mounting of a volume and writes started to occur. But then later on, at some point in the middle of the really important part of the conversation (writing your data) something in the hardware chain puked and said "OHHh wait about that prior conversation, I''m really confused, let''s talk at a slower speed shall we?" So the before part is just a lost conversation, is my speculation. The other thing is that SATA and SAS handle these things differently. When there''s such a serious error that results in a link speed change, usually the bus is reset and for SATA it means the command queue is lost. And I don''t think Btrfs is informed of what commands were completed vs failed in such a case. But I''d love someone who actually knows what they''re talking about to answer that question. My expectation though, is that unlike perhaps other file systems, Btrfs''s design goal is to handle the data that did get written, better. In that it''s still accessible where other file systems possibly will have a more difficulty.> Is there not a mechanism to then go "read-only"?I don''t know. In this case it does seem sorta reasonable. But the dmesg might still be revealing. The PHY Event counters indicate a lot of retries of over 1000 sectors.> > Also, should not the journal limit the damage?Well it''s COW so it''s not quite like a journaled file system, but yeah it should be in a position to know at the next mount time the most recent state of file system consistency. But that doesn''t mean it can fix the parts that are just fundamentally broken. But I think it''s a valid question, "now what?" because I don''t actually know the state of your file system or how to determine it. So maybe Hugo, or someone else has some thoughts. But for sure I would move to kernel 3.11.2 or 3.12.rc2 before mounting this file system again.> > >>> How best to recover from this? >> >> Why you''re getting I/O errors at SATA 6Gbps link speed needs to be >> understood. Is it a bad cable? Bad SATA port? Drive or controller >> firmware bug? Or libata driver bug? > > I systematically eliminated such as leads, PSU, and NCQ. Limiting libata > to only use 3Gbit/s is the one change that gives a consistent fix. The > HDD and motherboard both support 6Gbit/s, but hey-ho, that''s an > experiment I can try again some other time when I have another HDD/SSD > to test in there.Stick with forced 3Gbps, but I think it''s worth while to find out what the actual problem is. One day you forget about this 3Gbps SATA link, upgrade or regress to another kernel and you don''t have the 3Gbps forced speed on the parameter line, and poof - you''ve got more problems again. The hardware shouldn''t negotiate a 6Gbps link and then do a backwards swan dive at 30,000'' with your data as if it''s an after thought.> In any case, for the existing HDD - motherboard combination, using sata2 > rather than sata3 speeds shouldn''t noticeably impact performance. (Other > than sata2 works reliably and so is infinitely better for this case!)It''s true.> > >>> Lots of sata error noise omitted. >> >> And entire dmesg might still be useful. I don''t know if the list will >> handle the whole dmesg in one email, but it''s worth a shot (reply to >> an email in the thread, don''t change the subject). > > I can email directly if of use/interest. Let me know offlist.Use pastebin.com and post the link if it''s really huge, but I''d consider setting it to no expiration because if something interesting is learned, people doing searches have a better chance of finding the problem if the link hasn''t expired. I would also separately unmount the file system, note the latest kernel message, then mount the file system and see if there are any kernel messages that might indicate recognition of problems with the fs. I would not use btrfsck --repair until someone says it''s a good idea. That person would not be me. Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Martin
2013-Sep-29 02:10 UTC
Re: Corrupt btrfs filesystem recovery... What best instructions?
On 28/09/13 23:54, Martin wrote:> On 28/09/13 20:26, Martin wrote: > >> ... btrfsck bombs out with LOTs of errors... >> >> How best to recover from this? >> >> (This is a ''backup'' disk so not ''critical'' but it would be nice to avoid >> rewriting about 1.5TB of data over the network...) >> >> >> Is there an obvious sequence/recipe to follow for recovery? > > > I''ve got the drive reliably working with the sata limited to 3Gbit/s. > What is the best sequence to try to tidy-up and carry on with the 1.5TB > or so of data on there, rather than working from scratch? > > > So far, I''ve only run btrfsck since the corruption...So... Any options for btrfsck to fix things? Or is anything/everything that is fixable automatically fixed on the next mount? Or should: btrfs scrub /dev/sdX be run first? Or? What does btrfs do (or can do) for recovery? Advice welcomed, Thanks, Martin -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Martin
2013-Sep-29 02:31 UTC
Re: Corrupt btrfs filesystem recovery... (Due to *sata* errors)
Chris, Thanks for good comment/discussion. On 29/09/13 03:06, Chris Murphy wrote:> > On Sep 28, 2013, at 4:51 PM, Martin <m_btrfs@ml1.co.uk> wrote: >> Stick with forced 3Gbps, but I think it''s worth while to find out > what the actual problem is. One day you forget about this 3Gbps SATA > link, upgrade or regress to another kernel and you don''t have the > 3Gbps forced speed on the parameter line, and poof - you''ve got more > problems again. The hardware shouldn''t negotiate a 6Gbps link and > then do a backwards swan dive at 30,000'' with your data as if it''s an > after thought.I''ve got an engineer''s curiosity so that one is very definitely marked for revisiting at some time... If only to blog that x-y-z combination is a tar pit for your data...>> In any case, for the existing HDD - motherboard combination, using >> sata2 rather than sata3 speeds shouldn''t noticeably impact >> performance. (Other than sata2 works reliably and so is infinitely >> better for this case!) > > It''s true.Well, the IO data rate for badblocks is exactly the same as before, limited by the speed of the physical rust spinning and data density...> I would also separately unmount the file system, note the latest > kernel message, then mount the file system and see if there are any > kernel messages that might indicate recognition of problems with the > fs. > > I would not use btrfsck --repair until someone says it''s a good idea. > That person would not be me.It is sat unmounted until some informed opinion is gained... Thanks again for your notes, Regards, Martin -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Duncan
2013-Sep-29 05:11 UTC
Re: Corrupt btrfs filesystem recovery... What best instructions?
Martin posted on Sun, 29 Sep 2013 03:10:37 +0100 as excerpted:> So... > > Any options for btrfsck to fix things? > > Or is anything/everything that is fixable automatically fixed on the > next mount? > > Or should: > > btrfs scrub /dev/sdX > > be run first? > > Or? > > > What does btrfs do (or can do) for recovery?Here''s a general-case answer (courtesy gmane) to the order in which to try recovery question, that Hugo posted a few weeks ago: http://permalink.gmane.org/gmane.comp.file-systems.btrfs/27999 Note that in specific cases someone who knew what they were doing could omit some steps and focus on others, but I''m not at that level of "know what I''m doing", so... Scrub... would go before this, if it''s useful. But scrub depends on a second, valid copy being available in ordered to fix the bad-checksum one. On a single device btrfs, btrfs defaults to DUP metadata (unless it''s SSD), so you may have a second copy for that, but you won''t have a second copy of the data. This is a very strong reason to go btrfs raid1 mode (for both data and metadata) if you can, because that gives you a second copy of everything, thereby actually making use of btrfs'' checksum and scrub ability. (Unfortunately, there is as yet no way to do N-way mirroring, there''s only the second copy not a third, no matter how many devices you have in that "raid1".) Finally, if you mentioned your kernel (and btrfs-tools) version(s) I missed it, but [boilerplate recommendation, stressed repeatedly both in the wiki and on-list] btrfs being still labeled experimental and under serious development, there''s still lots of bugs fixed every kernel release. So as Chris Murphy said, if you''re not on 3.11-stable or 3.12- rcX already, get there. Not only can the safety of your data depend on it, but by choosing to run experimental we''re all testers, and our reports if something does go wrong will be far more usable if we''re on a current kernel. Similarly, btrfs-tools 0.20-rc1 is already somewhat old; you really should be on a git-snapshot beyond that. (The master branch is kept stable, work is done in other branches and only merged to master when it''s considered suitably stable, so a recently updated btrfs-tools master HEAD is at least in theory always the best possible version you can be running. If that''s ever NOT the case, then testers need to be reporting that ASAP so it can be fixed, too.) Back to the kernel, it''s worth noting that 3.12-rcX includes an option that turns off most btrfs bugons by default. Unless you''re a btrfs developer (which it doesn''t sound like you are), you''ll want to activate that (turning off the bugons), as they''re not helpful for ordinary users and just force unnecessary reboots when something minor and otherwise immediately recoverable goes wrong. That''s just one of the latest fixes. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Martin
2013-Sep-29 21:29 UTC
Re: Corrupt btrfs filesystem recovery... What best instructions?
On 29/09/13 06:11, Duncan wrote:> Martin posted on Sun, 29 Sep 2013 03:10:37 +0100 as excerpted: > >> So... >> >> Any options for btrfsck to fix things? >> >> Or is anything/everything that is fixable automatically fixed on the >> next mount? >> >> Or should: >> >> btrfs scrub /dev/sdX >> >> be run first? >> >> Or? >> >> >> What does btrfs do (or can do) for recovery? > > Here''s a general-case answer (courtesy gmane) to the order in which to > try recovery question, that Hugo posted a few weeks ago: > > http://permalink.gmane.org/gmane.comp.file-systems.btrfs/27999Thanks for that. Very well found! The instructions from Hugo are: #### Let''s assume that you don''t have a physical device failure (which is a different set of tools -- mount -odegraded, btrfs dev del missing). First thing to do is to take a btrfs-image -c9 -t4 of the filesystem, and keep a copy of the output to show josef. :) Then start with -orecovery and -oro,recovery for pretty much anything. If those fail, then look in dmesg for errors relating to the log tree -- if that''s corrupt and can''t be read (or causes a crash), use btrfs-zero-log. If there''s problems with the chunk tree -- the only one I''ve seen recently was reporting something like "can''t map address" -- then chunk-recover may be of use. After that, btrfsck is probably the next thing to try. If options -s1, -s2, -s3 have any success, then btrfs-select-super will help by replacing the superblock with one that works. If that''s not going to be useful, fall back to btrfsck --repair. Finally, btrfsck --repair --init-extent-tree may be necessary if there''s a damaged extent tree. Finally, if you''ve got corruption in the checksums, there''s --init-csum-tree. Hugo. #### Those will be tried next...> Note that in specific cases someone who knew what they were doing could > omit some steps and focus on others, but I''m not at that level of "know > what I''m doing", so... > > Scrub... would go before this, if it''s useful. But scrub depends on a > second, valid copy being available in ordered to fix the bad-checksum > one. On a single device btrfs, btrfs defaults to DUP metadata (unless > it''s SSD), so you may have a second copy for that, but you won''t have a > second copy of the data. This is a very strong reason to go btrfs raid1 > mode (for both data and metadata) if you can, because that gives you a > second copy of everything, thereby actually making use of btrfs'' checksum > and scrub ability. (Unfortunately, there is as yet no way to do N-way > mirroring, there''s only the second copy not a third, no matter how many > devices you have in that "raid1".) > > Finally, if you mentioned your kernel (and btrfs-tools) version(s) I > missed it, but [boilerplate recommendation, stressed repeatedly both in > the wiki and on-list] btrfs being still labeled experimental and under > serious development, there''s still lots of bugs fixed every kernel > release. So as Chris Murphy said, if you''re not on 3.11-stable or 3.12- > rcX already, get there. Not only can the safety of your data depend on > it, but by choosing to run experimental we''re all testers, and our > reports if something does go wrong will be far more usable if we''re on a > current kernel. Similarly, btrfs-tools 0.20-rc1 is already somewhat old; > you really should be on a git-snapshot beyond that. (The master branch > is kept stable, work is done in other branches and only merged to master > when it''s considered suitably stable, so a recently updated btrfs-tools > master HEAD is at least in theory always the best possible version you > can be running. If that''s ever NOT the case, then testers need to be > reporting that ASAP so it can be fixed, too.) > > Back to the kernel, it''s worth noting that 3.12-rcX includes an option > that turns off most btrfs bugons by default. Unless you''re a btrfs > developer (which it doesn''t sound like you are), you''ll want to activate > that (turning off the bugons), as they''re not helpful for ordinary users > and just force unnecessary reboots when something minor and otherwise > immediately recoverable goes wrong. That''s just one of the latest fixes.Looking up what''s available for Gentoo, the maintainers there look to be nicely sharp with multiple versions available all the way up to kernel 3.11.2... There''s also the latest available from btrfs tools with sys-fs/btrfs-progs "9999"... OK, so onto the cutting edge to compile them in... Thanks all, Martin -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Martin
2013-Sep-29 21:55 UTC
Re: Corrupt btrfs filesystem recovery... What best instructions?
On 29/09/13 22:29, Martin wrote:> Looking up what''s available for Gentoo, the maintainers there look to be > nicely sharp with multiple versions available all the way up to kernel > 3.11.2...That is being pulled in now as expected: sys-kernel/gentoo-sources-3.11.2> There''s also the latest available from btrfs tools with > sys-fs/btrfs-progs "9999"...Oddly, that caused emerge to report: [ebuild UD ] sys-fs/btrfs-progs-0.19.11 [0.20_rc1_p358] 0 kB which is a *downgrade*. Hence, I''m keeping with the 0.20_rc1_p358.> OK, so onto the cutting edge to compile them in...Interesting times as is said in a certain part of the world... Martin -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Duncan
2013-Sep-30 07:51 UTC
Re: Corrupt btrfs filesystem recovery... What best instructions?
Martin posted on Sun, 29 Sep 2013 22:55:43 +0100 as excerpted:> On 29/09/13 22:29, Martin wrote: > >> Looking up what''s available for Gentoo, the maintainers there look to >> be nicely sharp with multiple versions available all the way up to >> kernel 3.11.2...Cool, another gentooer! =:^)> That is being pulled in now as expected: > > sys-kernel/gentoo-sources-3.11.2FWIW, I''ve been doing my own kernels (mainline) since back on mandrake a decade ago, and I just changed up my scripts a bit when I switched to gentoo. Then later on I changed them up a bit more to be able to run git kernels. These days I normally first try (and switch to if no serious bugs) to the dev kernel around -rc2 or so, by which point I figure anything likely to eat my system for breakfast should be either worked thru, or at least news of it available. As a non-dev, it''s very cool being able to spot and report bugs, possibly bisecting to a specific commit, and watch them get fixed before general kernel release. Just one way I as a non-dev can contribute back. =:^) To take care of packages that depend on a kernel package, I used to have a kernel (gentoo-sources-2.6.9999 or some such, back then, now of course it''d be 3.9999) in package.provided, but these days I don''t even need that. =:^)>> There''s also the latest available from btrfs tools with >> sys-fs/btrfs-progs "9999"... > > Oddly, that caused emerge to report: > > [ebuild UD ] sys-fs/btrfs-progs-0.19.11 [0.20_rc1_p358] 0 kB > > which is a *downgrade*. Hence, I''m keeping with the 0.20_rc1_p358.btrfs-progs-9999 is available, but as a live package, it''s masked in keeping with gentoo policy. So to get it you''d need to unmask it. But 0.20_rc1_p358 shouldn''t be /too/ far back. In fact, I''m guessing the p-number is the (serialized) patch sequence number indicating the number of commits forward from the rc1 tag. And on the (locally unmasked) -9999 version here, a git describe --tags gives me v0.20-rc1-358-g194aa4a ... so 0.20_rc1_p358 is very likely identical to the live version at this point, and it makes no difference, except that the non-live version is a stable snapshot instead of a version that might change every time you merge it, if upstream has done any further commits. So btrfs-progs-0.20_rc1_p358 should be fine. And you were updating kernel to 3.11.2, so that should be fine by the time you read this, as well. =:^) -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Martin
2013-Oct-03 00:49 UTC
Re: Corrupt btrfs filesystem recovery... What best instructions?
So... The fix: ( Summary: Mounting "-o recovery,noatime" worked well and allowed a diff check to complete for all but one directory tree. So very nearly all the data is fine. Deleting the failed directory tree caused a call stack dump and eventually: kernel: parent transid verify failed on 915444822016 wanted 16974 found 13021 kernel: BTRFS info (device sdc): failed to delete reference to eggdrop-1.6.19.ebuild, inode 2096893 parent 5881667 kernel: BTRFS error (device sdc) in __btrfs_unlink_inode:3662: errno=-5 IO failure kernel: BTRFS info (device sdc): forced readonly Greater detail listed below. What next best to try? Safer to try again but this time with with "no_space_cache,no_inode_cache"? Thanks, Martin ) On 29/09/13 22:29, Martin wrote:> On 29/09/13 06:11, Duncan wrote:>>> What does btrfs do (or can do) for recovery? >> >> Here''s a general-case answer (courtesy gmane) to the order in which to >> try recovery question, that Hugo posted a few weeks ago: >> >> http://permalink.gmane.org/gmane.comp.file-systems.btrfs/27999 > > Thanks for that. Very well found! > > The instructions from Hugo are: > > #### > Let''s assume that you don''t have a physical device failure (which > is a different set of tools -- mount -odegraded, btrfs dev del > missing). > > First thing to do is to take a btrfs-image -c9 -t4 of the > filesystem, and keep a copy of the output to show josef. :) > > Then start with -orecovery and -oro,recovery for pretty much > anything.For anyone following this, first a health warning: If your data is in any way critical or important, then you should already have a backup copy elsewhere. If not, best make a binary image copy of your disk first! OK... So with the latest kernel (3.11.2) and btrfs tools (Btrfs v0.20-rc1-358-g194aa4a) and the sequence went: mount -v -t btrfs -o recovery LABEL=bu_A /mnt/bu_A (From syslog:) kernel: device label bu_A devid 1 transid 17222 /dev/sdc kernel: btrfs: enabling auto recovery kernel: btrfs: disk space caching is enabled kernel: btrfs: bdev /dev/sdc errs: wr 0, rd 27, flush 0, corrupt 0, gen 0 Running through a diff check for part of the backups, syslog reported: kernel: btrfs read error corrected: ino 1 off 915433144320 (dev /dev/sdc sector 1813661856) Also, the HDD was showing quite a few write operations so... Is "noatime" set?... Ooops... Didn''t include a "ro"... So, killed the diff check and remounted: mount -v -t btrfs -o remount,recovery,noatime /mnt/bu_A mount: /dev/sdc mounted on /mnt/bu_A kernel: btrfs: enabling inode map caching kernel: btrfs: enabling auto recovery kernel: btrfs: disk space caching is enabled And running the diff check again... Now zero writes to the HDD :-) Various syslog messages were given: kernel: parent transid verify failed on 907185135616 wanted 15935 found 12264 kernel: btrfs read error corrected: ino 1 off 907185135616 (dev /dev/sdc sector 1781823824) kernel: parent transid verify failed on 907185143808 wanted 15935 found 12264 kernel: btrfs read error corrected: ino 1 off 907185143808 (dev /dev/sdc sector 1781823840) kernel: parent transid verify failed on 907185139712 wanted 15935 found 12264 kernel: btrfs read error corrected: ino 1 off 907185139712 (dev /dev/sdc sector 1781823832) kernel: parent transid verify failed on 907185152000 wanted 15935 found 10903 kernel: btrfs read error corrected: ino 1 off 907185152000 (dev /dev/sdc sector 1781823856) kernel: parent transid verify failed on 907183783936 wanted 15935 found 12263 kernel: btrfs read error corrected: ino 1 off 907183783936 (dev /dev/sdc sector 1781821184) kernel: parent transid verify failed on 907183792128 wanted 15935 found 10903 kernel: btrfs read error corrected: ino 1 off 907183792128 (dev /dev/sdc sector 1781821200) kernel: parent transid verify failed on 907183796224 wanted 15935 found 12263 kernel: btrfs read error corrected: ino 1 off 907183796224 (dev /dev/sdc sector 1781821208) kernel: parent transid verify failed on 907183841280 wanted 15935 found 10903 kernel: btrfs read error corrected: ino 1 off 907183841280 (dev /dev/sdc sector 1781821296) kernel: parent transid verify failed on 907183878144 wanted 15935 found 12263 kernel: btrfs read error corrected: ino 1 off 907183878144 (dev /dev/sdc sector 1781821368) kernel: parent transid verify failed on 907183874048 wanted 15935 found 12263 kernel: btrfs read error corrected: ino 1 off 907183874048 (dev /dev/sdc sector 1781821360) kernel: verify_parent_transid: 25 callbacks suppressed kernel: parent transid verify failed on 915431288832 wanted 16974 found 16972 kernel: repair_io_failure: 25 callbacks suppressed kernel: btrfs read error corrected: ino 1 off 915431288832 (dev /dev/sdc sector 1813658232) kernel: parent transid verify failed on 915444523008 wanted 16974 found 13021 kernel: parent transid verify failed on 915444523008 wanted 16974 found 13021 [...] One directory tree failed the diff checks so I ''mv''-ed that one tree to rename it out of the way and then ran an "rm -Rf" to remove it. That appeared to run fine until: kernel: parent transid verify failed on 915431862272 wanted 16974 found 16972 kernel: btrfs read error corrected: ino 1 off 915431862272 (dev /dev/sdc sector 1813659352) kernel: parent transid verify failed on 907185127424 wanted 15935 found 12264 kernel: btrfs read error corrected: ino 1 off 907185127424 (dev /dev/sdc sector 1781823808) kernel: parent transid verify failed on 915444523008 wanted 16974 found 13021 kernel: parent transid verify failed on 915444523008 wanted 16974 found 13021 kernel: parent transid verify failed on 915444523008 wanted 16974 found 13021 kernel: parent transid verify failed on 915444523008 wanted 16974 found 13021 kernel: BTRFS info (device sdc): failed to delete reference to metadata.xml, inode 1846452 parent 5851502 kernel: ------------[ cut here ]------------ kernel: WARNING: CPU: 0 PID: 3236 at fs/btrfs/super.c:253 __btrfs_abort_transaction+0x4a/0xfc() kernel: btrfs: Transaction aborted (error -5) kernel: Modules linked in: nfsd auth_rpcgss oid_registry exportfs nfs_acl lockd sunrpc bridge stp llc snd_hda_codec_realtek snd_hda_codec_hdmi ppdev evdev serio_raw pcspkr acpi_cpufreq snd_hda_intel snd_hda_codec mperf snd_pcm freq_table snd_page_alloc snd_timer parport_pc processor wmi bnx2 snd parport thermal_sys i2c_piix4 button usbhid firewire_ohci firewire_core xhci_hcd ata_generic pata_acpi kernel: CPU: 0 PID: 3236 Comm: nfsd Not tainted 3.11.2-gentoo_muse11_07 #1 kernel: Hardware name: System manufacturer System Product Name/E45M1-M PRO, BIOS 0502 09/21/2011 kernel: 0000000000000000 ffffffff81700892 ffffffff815261d1 ffff8801f91f1c18 kernel: ffffffff8102ea45 ffff88010b18e5a0 ffffffff811df675 ffff8801f91f1c38 kernel: 00000000fffffffb ffff880233afb000 ffff880230a3b960 0000000000000e4e kernel: Call Trace: kernel: [<ffffffff815261d1>] ? dump_stack+0x41/0x51 kernel: [<ffffffff8102ea45>] ? warn_slowpath_common+0x79/0x92 kernel: [<ffffffff811df675>] ? __btrfs_abort_transaction+0x4a/0xfc kernel: [<ffffffff8102eaf6>] ? warn_slowpath_fmt+0x45/0x4a kernel: [<ffffffff811df675>] ? __btrfs_abort_transaction+0x4a/0xfc kernel: [<ffffffff812071e3>] ? __btrfs_unlink_inode+0x19a/0x2c0 kernel: [<ffffffff812093bf>] ? btrfs_unlink_inode+0x12/0x35 kernel: [<ffffffff8120943e>] ? btrfs_unlink+0x5c/0x94 kernel: [<ffffffff810f8e03>] ? vfs_unlink+0x69/0xc8 kernel: [<ffffffffa029f215>] ? nfsd_unlink+0x18e/0x1d1 [nfsd] kernel: [<ffffffffa02a4e87>] ? nfsd3_proc_remove+0x67/0xab [nfsd] kernel: [<ffffffffa029a9d2>] ? nfsd_dispatch+0x91/0x148 [nfsd] kernel: [<ffffffffa0234fc7>] ? svc_process+0x3e1/0x630 [sunrpc] kernel: [<ffffffffa0235211>] ? svc_process+0x62b/0x630 [sunrpc] kernel: [<ffffffffa029a574>] ? nfsd+0xc0/0x117 [nfsd] kernel: [<ffffffffa029a4b4>] ? nfsd_destroy+0x64/0x64 [nfsd] kernel: [<ffffffff81047287>] ? kthread+0xad/0xb5 kernel: [<ffffffff810471da>] ? kthread_freezable_should_stop+0x41/0x41 kernel: [<ffffffff8152c5ec>] ? ret_from_fork+0x7c/0xb0 kernel: [<ffffffff810471da>] ? kthread_freezable_should_stop+0x41/0x41 kernel: ---[ end trace 53d6fb93a497e75d ]--- kernel: BTRFS warning (device sdc): __btrfs_unlink_inode:3662: Aborting unused transaction(IO failure). kernel: parent transid verify failed on 915444523008 wanted 16974 found 13021 kernel: parent transid verify failed on 915444523008 wanted 16974 found 13021 kernel: parent transid verify failed on 915444523008 wanted 16974 found 13021 kernel: parent transid verify failed on 915444523008 wanted 16974 found 13021 kernel: parent transid verify failed on 915444523008 wanted 16974 found 13021 kernel: parent transid verify failed on 915444523008 wanted 16974 found 13021 kernel: btrfs read error corrected: ino 1 off 915433652224 (dev /dev/sdc sector 1813662848) kernel: btrfs read error corrected: ino 1 off 915433029632 (dev /dev/sdc sector 1813661632) kernel: btrfs read error corrected: ino 1 off 915433041920 (dev /dev/sdc sector 1813661656) kernel: btrfs read error corrected: ino 1 off 915433955328 (dev /dev/sdc sector 1813663440) kernel: btrfs read error corrected: ino 1 off 915433127936 (dev /dev/sdc sector 1813661824) kernel: btrfs read error corrected: ino 1 off 915434070016 (dev /dev/sdc sector 1813663664) kernel: btrfs read error corrected: ino 1 off 915433132032 (dev /dev/sdc sector 1813661832) kernel: btrfs read error corrected: ino 1 off 915433136128 (dev /dev/sdc sector 1813661840) kernel: btrfs read error corrected: ino 1 off 915433545728 (dev /dev/sdc sector 1813662640) kernel: BTRFS info (device sdc): failed to delete reference to metadata.xml, inode 1846733 parent 5851559 kernel: BTRFS warning (device sdc): __btrfs_unlink_inode:3662: Aborting unused transaction(IO failure). kernel: verify_parent_transid: 96 callbacks suppressed kernel: parent transid verify failed on 915431579648 wanted 16974 found 16972 kernel: repair_io_failure: 13 callbacks suppressed kernel: btrfs read error corrected: ino 1 off 915431579648 (dev /dev/sdc sector 1813658800) kernel: parent transid verify failed on 915432382464 wanted 16974 found 16972 kernel: btrfs read error corrected: ino 1 off 915432382464 (dev /dev/sdc sector 1813660368) kernel: parent transid verify failed on 915444707328 wanted 16974 found 13021 kernel: btrfs read error corrected: ino 1 off 915444707328 (dev /dev/sdc sector 1813684440) kernel: parent transid verify failed on 915445092352 wanted 16974 found 13021 kernel: btrfs read error corrected: ino 1 off 915445092352 (dev /dev/sdc sector 1813685192) kernel: parent transid verify failed on 915445100544 wanted 16974 found 13021 kernel: btrfs read error corrected: ino 1 off 915445100544 (dev /dev/sdc sector 1813685208) kernel: parent transid verify failed on 915431026688 wanted 16974 found 16972 kernel: btrfs read error corrected: ino 1 off 915431026688 (dev /dev/sdc sector 1813657720) kernel: parent transid verify failed on 915432538112 wanted 16974 found 16972 kernel: btrfs read error corrected: ino 1 off 915432538112 (dev /dev/sdc sector 1813660672) kernel: parent transid verify failed on 915444740096 wanted 16974 found 13021 kernel: btrfs read error corrected: ino 1 off 915444740096 (dev /dev/sdc sector 1813684504) kernel: parent transid verify failed on 915444469760 wanted 16974 found 13021 kernel: parent transid verify failed on 915444469760 wanted 16974 found 13021 kernel: parent transid verify failed on 915444469760 wanted 16974 found 13021 kernel: parent transid verify failed on 915444469760 wanted 16974 found 13021 kernel: parent transid verify failed on 915444469760 wanted 16974 found 13021 kernel: parent transid verify failed on 915444469760 wanted 16974 found 13021 kernel: parent transid verify failed on 915444518912 wanted 16974 found 13021 kernel: parent transid verify failed on 915444518912 wanted 16974 found 13021 kernel: parent transid verify failed on 915444518912 wanted 16974 found 13021 kernel: verify_parent_transid: 45 callbacks suppressed kernel: parent transid verify failed on 915444518912 wanted 16974 found 13021 kernel: parent transid verify failed on 915444518912 wanted 16974 found 13021 kernel: parent transid verify failed on 915444518912 wanted 16974 found 13021 kernel: parent transid verify failed on 915444518912 wanted 16974 found 13021 kernel: parent transid verify failed on 915444518912 wanted 16974 found 13021 kernel: parent transid verify failed on 915444518912 wanted 16974 found 13021 kernel: parent transid verify failed on 915444518912 wanted 16974 found 13021 kernel: parent transid verify failed on 915444518912 wanted 16974 found 13021 kernel: parent transid verify failed on 915444518912 wanted 16974 found 13021 kernel: parent transid verify failed on 915444518912 wanted 16974 found 13021 kernel: btrfs read error corrected: ino 1 off 915431141376 (dev /dev/sdc sector 1813657944) kernel: btrfs read error corrected: ino 1 off 915431165952 (dev /dev/sdc sector 1813657992) kernel: btrfs read error corrected: ino 1 off 915431272448 (dev /dev/sdc sector 1813658200) kernel: btrfs read error corrected: ino 1 off 915431161856 (dev /dev/sdc sector 1813657984) kernel: btrfs read error corrected: ino 1 off 915445268480 (dev /dev/sdc sector 1813685536) kernel: btrfs read error corrected: ino 1 off 915440472064 (dev /dev/sdc sector 1813676168) kernel: btrfs read error corrected: ino 1 off 915431170048 (dev /dev/sdc sector 1813658000) kernel: btrfs read error corrected: ino 1 off 915431174144 (dev /dev/sdc sector 1813658008) kernel: btrfs read error corrected: ino 1 off 915431378944 (dev /dev/sdc sector 1813658408) kernel: verify_parent_transid: 147 callbacks suppressed kernel: parent transid verify failed on 915432869888 wanted 16974 found 16972 kernel: parent transid verify failed on 915444473856 wanted 16974 found 13021 kernel: parent transid verify failed on 915444473856 wanted 16974 found 13021 kernel: parent transid verify failed on 915433119744 wanted 16974 found 16972 kernel: parent transid verify failed on 915433656320 wanted 16974 found 16972 kernel: parent transid verify failed on 915433123840 wanted 16974 found 16972 kernel: parent transid verify failed on 915433050112 wanted 16974 found 16972 kernel: parent transid verify failed on 915444473856 wanted 16974 found 13021 kernel: parent transid verify failed on 915444473856 wanted 16974 found 13021 kernel: parent transid verify failed on 915444822016 wanted 16974 found 13021 kernel: BTRFS info (device sdc): failed to delete reference to eggdrop-1.6.19.ebuild, inode 2096893 parent 5881667 kernel: BTRFS error (device sdc) in __btrfs_unlink_inode:3662: errno=-5 IO failure kernel: BTRFS info (device sdc): forced readonly Next best step to try? Remount "-o recovery,noatime" again? Thanks, Martin -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Chris Murphy
2013-Oct-03 01:31 UTC
Re: Corrupt btrfs filesystem recovery... What best instructions?
On Oct 2, 2013, at 6:49 PM, Martin <m_btrfs@ml1.co.uk> wrote:> kernel: btrfs read error corrected: ino 1 off 907183792128 (dev /dev/sdc sector 1781821200)Can anyone answer if this is what corrupt metadata detection and correction looks like? From the original email this is a single disk, with default mkfs.btrfs. So I guess I''m asking an almost obvious question, but I''m still going to ask it. There is only one copy of data, but two copies of metadata so it can self-correct for metadata corruption. Next question. Why is -o recovery needed to get this correction behavior? The original post was completely devoid of messages indicating correction. Chris Murphy-- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Martin
2013-Oct-03 16:56 UTC
Re: Corrupt btrfs filesystem recovery... What best instructions?
On 03/10/13 01:49, Martin wrote:> Summary: > > Mounting "-o recovery,noatime" worked well and allowed a diff check to > complete for all but one directory tree. So very nearly all the data is > fine. > > Deleting the failed directory tree caused a call stack dump and eventually: > > kernel: parent transid verify failed on 915444822016 wanted 16974 found > 13021 > kernel: BTRFS info (device sdc): failed to delete reference to > eggdrop-1.6.19.ebuild, inode 2096893 parent 5881667 > kernel: BTRFS error (device sdc) in __btrfs_unlink_inode:3662: errno=-5 > IO failure > kernel: BTRFS info (device sdc): forced readonly > > > Greater detail listed below. > > What next best to try? > > Safer to try again but this time with with "no_space_cache,no_inode_cache"? > > Thanks, > Martin> Next best step to try? > > Remount "-o recovery,noatime" again?In the meantime, trying: btrfsck /dev/sdc gave the following output + abort: parent transid verify failed on 915444523008 wanted 16974 found 13021 Ignoring transid failure btrfsck: cmds-check.c:1066: process_file_extent: Assertion `!(rec->ino != key->objectid || rec->refs > 1)'' failed. id not match free space cache generation (1625) free space inode generation (0) did not match free space cache generation (1607) free space inode generation (0) did not match free space cache generation (1604) free space inode generation (0) did not match free space cache generation (1606) free space inode generation (0) did not match free space cache generation (1620) free space inode generation (0) did not match free space cache generation (1626) free space inode generation (0) did not match free space cache generation (1609) free space inode generation (0) did not match free space cache generation (1653) free space inode generation (0) did not match free space cache generation (1628) free space inode generation (0) did not match free space cache generation (1628) free space inode generation (0) did not match free space cache generation (1649) (There was no syslog output.) Full btrfsck listing attached. Suggestions please? Thanks, Martin
Martin
2013-Oct-04 15:43 UTC
Re: Corrupt btrfs filesystem recovery... What best instructions?
What best to try next? mount "-o recovery,noatime" btrfsck: --repair try to repair the filesystem --init-csum-tree create a new CRC tree --init-extent-tree create a new extent tree or is a "scrub" worthwhile? The fail and switch to read-only occured whilst trying to delete a known bad directory tree. No worries for losing the data in that. But how best to clean up the filesystem errors? Thanks, Martin On 03/10/13 17:56, Martin wrote:> On 03/10/13 01:49, Martin wrote: > >> Summary: >> >> Mounting "-o recovery,noatime" worked well and allowed a diff check to >> complete for all but one directory tree. So very nearly all the data is >> fine. >> >> Deleting the failed directory tree caused a call stack dump and eventually: >> >> kernel: parent transid verify failed on 915444822016 wanted 16974 found >> 13021 >> kernel: BTRFS info (device sdc): failed to delete reference to >> eggdrop-1.6.19.ebuild, inode 2096893 parent 5881667 >> kernel: BTRFS error (device sdc) in __btrfs_unlink_inode:3662: errno=-5 >> IO failure >> kernel: BTRFS info (device sdc): forced readonly >> >> >> Greater detail listed below. >> >> What next best to try? >> >> Safer to try again but this time with with "no_space_cache,no_inode_cache"? >> >> Thanks, >> Martin > > >> Next best step to try? >> >> Remount "-o recovery,noatime" again? > > > In the meantime, trying: > > btrfsck /dev/sdc > > gave the following output + abort: > > parent transid verify failed on 915444523008 wanted 16974 found 13021 > Ignoring transid failure > btrfsck: cmds-check.c:1066: process_file_extent: Assertion `!(rec->ino > != key->objectid || rec->refs > 1)'' failed. > id not match free space cache generation (1625) > free space inode generation (0) did not match free space cache > generation (1607) > free space inode generation (0) did not match free space cache > generation (1604) > free space inode generation (0) did not match free space cache > generation (1606) > free space inode generation (0) did not match free space cache > generation (1620) > free space inode generation (0) did not match free space cache > generation (1626) > free space inode generation (0) did not match free space cache > generation (1609) > free space inode generation (0) did not match free space cache > generation (1653) > free space inode generation (0) did not match free space cache > generation (1628) > free space inode generation (0) did not match free space cache > generation (1628) > free space inode generation (0) did not match free space cache > generation (1649) > > > (There was no syslog output.) > > Full btrfsck listing attached. > > > Suggestions please? > > Thanks, > Martin-- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Martin
2013-Oct-05 11:32 UTC
Re: Corrupt btrfs filesystem recovery... What best instructions?
No comment so blindly trying: btrfsck --repair /dev/sdc gave the following abort: btrfsck: extent-tree.c:2736: alloc_reserved_tree_block: Assertion `!(ret)'' failed. Full output attached. All on: 3.11.2-gentoo Btrfs v0.20-rc1-358-g194aa4a For a 2TB single HDD formatted with defaults. What next? Thanks, Martin>> In the meantime, trying: >> >> btrfsck /dev/sdc >> >> gave the following output + abort: >> >> parent transid verify failed on 915444523008 wanted 16974 found 13021 >> Ignoring transid failure >> btrfsck: cmds-check.c:1066: process_file_extent: Assertion `!(rec->ino >> != key->objectid || rec->refs > 1)'' failed. >> id not match free space cache generation (1625) >> free space inode generation (0) did not match free space cache >> generation (1607) >> free space inode generation (0) did not match free space cache >> generation (1604) >> free space inode generation (0) did not match free space cache >> generation (1606) >> free space inode generation (0) did not match free space cache >> generation (1620) >> free space inode generation (0) did not match free space cache >> generation (1626) >> free space inode generation (0) did not match free space cache >> generation (1609) >> free space inode generation (0) did not match free space cache >> generation (1653) >> free space inode generation (0) did not match free space cache >> generation (1628) >> free space inode generation (0) did not match free space cache >> generation (1628) >> free space inode generation (0) did not match free space cache >> generation (1649) >> >> >> (There was no syslog output.) >> >> Full btrfsck listing attached. >> >> >> Suggestions please? >> >> Thanks, >> Martin
Martin
2013-Oct-05 12:05 UTC
ASM1083 rev01 PCIe to PCI Bridge chip (Was: Corrupt btrfs filesystem recovery... (Due to *sata* errors))
On 28/09/13 20:26, Martin wrote:> AMD > E-450 APU with Radeon(tm) HD Graphics AuthenticAMD GNU/LinuxJust in case someone else stumbles across this thread due to a related problem for my particular motherboard... There appears to be a fatal hardware bug for the interrupt line deassert for a PCIe to PCI Bridge chip: ASMedia Technology Inc. ASM1083/1085 PCIe to PCI Bridge (rev 01) See the thread on https://lkml.org/lkml/2012/1/30/216 For that chip, the interrupt line is not always deasserted for PCI interrupts. The hardware fault appears to be fixed in ASM1083 rev 03. Unfortunately, there is no useful OS workaround possible for rev 01. Hence, the PCI interrupts are unusable for ASM1083 rev01 ? :-( In brief, this means that the PCI card slots on the motherboard cannot be used for any hardware that might generate an interrupt. That means pretty much all normal PCI cards. (The PCIe card slots are fine.) For my own example, there does not appear to be any other devices using that bridge chip. The only concern is for the sound chip but I happen to never use sound on that system and so that is disabled. The problem is listed in syslog/dmesg by lines such as: kernel: irq 16: nobody cared (try booting with the "irqpoll" option) kernel: Disabling IRQ #16 Unfortunately, the HDDs and network interfaces also use that irq or "irg 17" (which can also be affected). Losing the irq will badly slow down your system and can cause data corruption for heavy use of the HDD. Use: lspci | grep -i ASM1083 to see if you have that chip and if so, what revision. To see if you have any irqpoll messages, use: grep -ia irqpoll /var/log/messages To list what devices use what interrupts, use either of: grep -ia '' irq '' /var/log/messages cat /proc/interrupts Note that there should no longer be any ASM1083 rev01 chips being supplied by now. (ASM1083 rev03 chips have been seen in products.) Hope that helps for that bit of obscurity! Martin -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Martin
2013-Oct-05 13:18 UTC
Re: Corrupt btrfs filesystem recovery... What best instructions?
So... The hint there is "btrfsck: extent-tree.c:2736", so trying: btrfsck --repair --init-extent-tree /dev/sdc That ran for a while until: kernel: btrfsck[16610]: segfault at cc ip 000000000041d2a7 sp 00007fffd2c2d710 error 4 in btrfsck[400000+4d000] There''s no other messages in the syslog. The output attached. What next? Thanks, Martin On 05/10/13 12:32, Martin wrote:> No comment so blindly trying: > > btrfsck --repair /dev/sdc > > gave the following abort: > > btrfsck: extent-tree.c:2736: alloc_reserved_tree_block: Assertion > `!(ret)'' failed. > > Full output attached. > > > All on: > > 3.11.2-gentoo > Btrfs v0.20-rc1-358-g194aa4a > > For a 2TB single HDD formatted with defaults. > > > What next? > > Thanks, > Martin > > > > >>> In the meantime, trying: >>> >>> btrfsck /dev/sdc >>> >>> gave the following output + abort: >>> >>> parent transid verify failed on 915444523008 wanted 16974 found 13021 >>> Ignoring transid failure >>> btrfsck: cmds-check.c:1066: process_file_extent: Assertion `!(rec->ino >>> != key->objectid || rec->refs > 1)'' failed. >>> id not match free space cache generation (1625) >>> free space inode generation (0) did not match free space cache >>> generation (1607) >>> free space inode generation (0) did not match free space cache >>> generation (1604) >>> free space inode generation (0) did not match free space cache >>> generation (1606) >>> free space inode generation (0) did not match free space cache >>> generation (1620) >>> free space inode generation (0) did not match free space cache >>> generation (1626) >>> free space inode generation (0) did not match free space cache >>> generation (1609) >>> free space inode generation (0) did not match free space cache >>> generation (1653) >>> free space inode generation (0) did not match free space cache >>> generation (1628) >>> free space inode generation (0) did not match free space cache >>> generation (1628) >>> free space inode generation (0) did not match free space cache >>> generation (1649) >>> >>> >>> (There was no syslog output.) >>> >>> Full btrfsck listing attached. >>> >>> >>> Suggestions please? >>> >>> Thanks, >>> Martin
Any clues or educated comment please? Can the corrupt directory tree safely be ignored and left in place? Or might that cause everything to fall over in a big heap as soon as I try to write data again? Could these other tricks work-around or fix the corrupt tree: Run a scrub? Make a snapshot and work from the snapshot? Or try "mount -o recovery,noatime" again? Or is it dead? (The 1.5TB of backup data is replicated elsewhere but it would be good to rescue this version rather than completely redo from scratch. Especially so for the sake of just a few MBytes of one corrupt directory tree.) Thanks, Martin On 05/10/13 14:18, Martin wrote:> So... > > The hint there is "btrfsck: extent-tree.c:2736", so trying: > > btrfsck --repair --init-extent-tree /dev/sdc > > That ran for a while until: > > kernel: btrfsck[16610]: segfault at cc ip 000000000041d2a7 sp > 00007fffd2c2d710 error 4 in btrfsck[400000+4d000] > > There''s no other messages in the syslog. > > The output attached. > > > What next? > > > Thanks, > Martin > > > > On 05/10/13 12:32, Martin wrote: >> No comment so blindly trying: >> >> btrfsck --repair /dev/sdc >> >> gave the following abort: >> >> btrfsck: extent-tree.c:2736: alloc_reserved_tree_block: Assertion >> `!(ret)'' failed. >> >> Full output attached. >> >> >> All on: >> >> 3.11.2-gentoo >> Btrfs v0.20-rc1-358-g194aa4a >> >> For a 2TB single HDD formatted with defaults. >> >> >> What next? >> >> Thanks, >> Martin >> >> >> >> >>>> In the meantime, trying: >>>> >>>> btrfsck /dev/sdc >>>> >>>> gave the following output + abort: >>>> >>>> parent transid verify failed on 915444523008 wanted 16974 found 13021 >>>> Ignoring transid failure >>>> btrfsck: cmds-check.c:1066: process_file_extent: Assertion `!(rec->ino >>>> != key->objectid || rec->refs > 1)'' failed. >>>> id not match free space cache generation (1625) >>>> free space inode generation (0) did not match free space cache >>>> generation (1607) >>>> free space inode generation (0) did not match free space cache >>>> generation (1604) >>>> free space inode generation (0) did not match free space cache >>>> generation (1606) >>>> free space inode generation (0) did not match free space cache >>>> generation (1620) >>>> free space inode generation (0) did not match free space cache >>>> generation (1626) >>>> free space inode generation (0) did not match free space cache >>>> generation (1609) >>>> free space inode generation (0) did not match free space cache >>>> generation (1653) >>>> free space inode generation (0) did not match free space cache >>>> generation (1628) >>>> free space inode generation (0) did not match free space cache >>>> generation (1628) >>>> free space inode generation (0) did not match free space cache >>>> generation (1649) >>>> >>>> >>>> (There was no syslog output.) >>>> >>>> Full btrfsck listing attached. >>>> >>>> >>>> Suggestions please? >>>> >>>> Thanks, >>>> Martin-- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Chris Murphy
2013-Oct-07 19:03 UTC
Re: btrfsck --repair --init-extent-tree: segfault error 4
On Oct 7, 2013, at 8:56 AM, Martin <m_btrfs@ml1.co.uk> wrote:> > Or try "mount -o recovery,noatime" again?Because of this: free space inode generation (0) did not match free space cache generation (1607) Try mount option clear_cache. You could then use iotop to make sure the btrfs-freespace process becomes inactive before unmounting the file system; I don''t think you need to wait in order to use the file system, nor do you need to unmount then remount without the option. But if it works, it should only be needed once, not as a persistent mount option.> Or is it dead? > > (The 1.5TB of backup data is replicated elsewhere but it would be good > to rescue this version rather than completely redo from scratch. > Especially so for the sake of just a few MBytes of one corrupt directory > tree.)Right. If you snapshot the subvolume containing the corrupt portion of the file system, the snapshot probably inherits that corruption. But if you write to only one of them, if those writes make the problem worse, should be isolated only to the one you write to. I might avoid writing to it, honestly. To save time, get increasingly aggressive to get data out of this directory and once you succeed, blow away the file system and start from scratch. You could also then try kernel 3.12 rc4, as there are some btrfs bug fixes I''m seeing in there also, but I don''t know if any of them will help your case. If you try it, mount normally, then try to get your data. If that doesn''t work, try the recovery option. Maybe you''ll get different results. Chris Murphy-- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
In summary: Looks like minimal damage remains and yet I''m still suffering "Input/output error" from btrfs and btrfsck appears to have looped... A diff check suggests the damage to be in one (heavily linked to) tree of a few MBytes. Would a scrub clear out the damaged trees? Worth debugging? Thanks, Martin Further detail: On 07/10/13 20:03, Chris Murphy wrote:> > On Oct 7, 2013, at 8:56 AM, Martin <m_btrfs@ml1.co.uk> wrote: > >> >> Or try "mount -o recovery,noatime" again? > > Because of this: free space inode generation (0) did not match free > space cache generation (1607) > > Try mount option clear_cache. You could then use iotop to make sure > the btrfs-freespace process becomes inactive before unmounting the > file system; I don''t think you need to wait in order to use the file > system, nor do you need to unmount then remount without the option. > But if it works, it should only be needed once, not as a persistent > mount option.Thanks for that. So, trying: mount -v -t btrfs -o recovery,noatime,clear_cache /dev/sdc gave: kernel: device label bu_A devid 1 transid 17448 /dev/sdc kernel: btrfs: enabling inode map caching kernel: btrfs: enabling auto recovery kernel: btrfs: force clearing of disk cache kernel: btrfs: disk space caching is enabled kernel: btrfs: bdev /dev/sdc errs: wr 0, rd 27, flush 0, corrupt 0, gen 0 btrfs-freespace appeared occasionally briefly in atop but there''s no noticeable disk activity. All very rapidly done? Running a diff check to see if all ok and what might be missing gave the syslog output: kernel: verify_parent_transid: 165 callbacks suppressed kernel: parent transid verify failed on 915444506624 wanted 16974 found 13021 kernel: parent transid verify failed on 915444506624 wanted 16974 found 13021 kernel: parent transid verify failed on 915444506624 wanted 16974 found 13021 kernel: parent transid verify failed on 915444506624 wanted 16974 found 13021 kernel: parent transid verify failed on 915444506624 wanted 16974 found 13021 kernel: parent transid verify failed on 915444506624 wanted 16974 found 13021 The diff eventually failed with "Input/output error". ''mv'' to move this failed directory tree out of the way worked. Attempting to use ''ln -s'' gave the attached syslog output and the filesystem was made "Read-only". Remounting: mount -v -o remount,recovery,noatime,clear_cache,rw /dev/sdc and the mv looks fine. Trying the ''ln -s'' again gives: ln: creating symbolic link `./portage'': Read-only file system unmounting gave the syslog message: kernel: btrfs: commit super ret -30 Mounting again: mount -v -t btrfs -o recovery,noatime,clear_cache /dev/sdc showed that the symbolic link was put in place ok. Rerunning the diff check eventually found another "Input/output error". So unmounted and tried again: btrfsck --repair --init-extent-tree /dev/sdc Failed with: btrfs unable to find ref byte nr 911367733248 parent 0 root 1 owner 2 offset 0 btrfs unable to find ref byte nr 911367737344 parent 0 root 1 owner 1 offset 1 btrfs unable to find ref byte nr 911367741440 parent 0 root 1 owner 0 offset 1 leaf free space ret -297791851, leaf data size 3995, used 297795846 nritems 2 checking extents btrfsck: extent_io.c:606: free_extent_buffer: Assertion `!(eb->refs < 0)'' failed. enabling repair mode Checking filesystem on /dev/sdc UUID: 38a60270-f9c6-4ed4-8421-4bf1253ae0b3 Creating a new extent tree Failed to find [911367733248, 168, 4096] Failed to find [911367737344, 168, 4096] Failed to find [911367741440, 168, 4096] Rerunning again and this time btrfsck is sat there at 100% CPU for the last 24 hours. Full output so far is: parent transid verify failed on 911904604160 wanted 17448 found 17449 parent transid verify failed on 911904604160 wanted 17448 found 17449 parent transid verify failed on 911904604160 wanted 17448 found 17449 parent transid verify failed on 911904604160 wanted 17448 found 17449 Ignoring transid failure Nothing syslog and no disk activity. Looped?...>> Or is it dead? >> >> (The 1.5TB of backup data is replicated elsewhere but it would be >> good to rescue this version rather than completely redo from >> scratch. Especially so for the sake of just a few MBytes of one >> corrupt directory tree.) > > Right. If you snapshot the subvolume containing the corrupt portion > of the file system, the snapshot probably inherits that corruption. > But if you write to only one of them, if those writes make the > problem worse, should be isolated only to the one you write to. I > might avoid writing to it, honestly. To save time, get increasingly > aggressive to get data out of this directory and once you succeed, > blow away the file system and start from scratch. > > You could also then try kernel 3.12 rc4, as there are some btrfs bug > fixes I''m seeing in there also, but I don''t know if any of them will > help your case. If you try it, mount normally, then try to get your > data. If that doesn''t work, try the recovery option. Maybe you''ll get > different results.As suspected, thanks. Would a scrub clear out the damaged trees? Anything useful to try? Any debug value in looking at the fail cases? Is there a btrfsck mode of making good everything that is certain and dumping any remaining fragments into "lost + found"? (Or is that way down the developments yet?) Aside: btrfs looks to be usable enough, especially so with the disk format now stable, to at least offer the well established features as ''stable''...? (This is the first fail I''ve had, and considering the sata failed, is no surprise... Too severe a test! But can the limited damage be recovered...?) Thanks, Martin -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html