thr3ads.net - Btrfs devel - Corrupt btrfs filesystem recovery... (Due to *sata* errors) [Sep 2013]

If this information is useful, please help other people find it:
Share via:

Martin

2013-Sep-28 19:26 UTC

Corrupt btrfs filesystem recovery... (Due to sata errors)

This may be of interest for the fail cause aswel as how to recover...


I have a known good 2TB (4kByte physical sectors) HDD that supports
sata3 (6Gbit/s). Writing data via rsync at the 6Gbit/s sata rate caused
IO errors for just THREE sectors...

Yet btrfsck bombs out with LOTs of errors...

How best to recover from this?

(This is a ''backup'' disk so not ''critical''
but it would be nice to avoid
rewriting about 1.5TB of data over the network...)


Is there an obvious sequence/recipe to follow for recovery?

Thanks,
Martin



Further details:

Linux  3.10.7-gentoo-r1 #2 SMP Fri Sep 27 23:38:06 BST 2013 x86_64 AMD
E-450 APU with Radeon(tm) HD Graphics AuthenticAMD GNU/Linux

# btrfs version
Btrfs v0.20-rc1-358-g194aa4a

Single 2TB HDD using default mkbtrfs.
Entire disk (/dev/sdc) is btrfs (no partitions).


The IO errors were:

kernel: end_request: I/O error, dev sdc, sector 3215049328
kernel: end_request: I/O error, dev sdc, sector 3215049328
kernel: end_request: I/O error, dev sdc, sector 3215049328
kernel: end_request: I/O error, dev sdc, sector 3215049328
kernel: end_request: I/O error, dev sdc, sector 3215049328
kernel: end_request: I/O error, dev sdc, sector 3206563752
kernel: end_request: I/O error, dev sdc, sector 3206563752
kernel: end_request: I/O error, dev sdc, sector 3206563752
kernel: end_request: I/O error, dev sdc, sector 3206563752
kernel: end_request: I/O error, dev sdc, sector 3206563752
kernel: end_request: I/O error, dev sdc, sector 3213925248
kernel: end_request: I/O error, dev sdc, sector 3213925248
kernel: end_request: I/O error, dev sdc, sector 3213925248
kernel: end_request: I/O error, dev sdc, sector 3213925248
kernel: end_request: I/O error, dev sdc, sector 3213925248

Lots of sata error noise omitted.


The sata problem was fixed by limiting libata to 3Gbit/s:

libata.force=3.0G

added onto the Grub kernel line.

Running "badblocks" twice in succession (non-destructive data test!)
shows no surface errors and no further errors on the sata interface.

Running btrfsck twice gives the same result, giving a failure with:

Ignoring transid failure
btrfsck: cmds-check.c:1066: process_file_extent: Assertion `!(rec->ino
!= key->objectid || rec->refs > 1)'' failed.


An abridged summary is:

checking extents
parent transid verify failed on 907185082368 wanted 15935 found 12264
parent transid verify failed on 907185082368 wanted 15935 found 12264
parent transid verify failed on 907185127424 wanted 15935 found 12264
parent transid verify failed on 907185127424 wanted 15935 found 12264
leaf parent key incorrect 907185135616
bad block 907185135616
parent transid verify failed on 915444707328 wanted 16974 found 13021
parent transid verify failed on 915444707328 wanted 16974 found 13021
parent transid verify failed on 915445092352 wanted 16974 found 13021
parent transid verify failed on 915445092352 wanted 16974 found 13021
leaf parent key incorrect 915444883456
bad block 915444883456
leaf parent key incorrect 915445014528
bad block 915445014528
parent transid verify failed on 907185082368 wanted 15935 found 12264
parent transid verify failed on 907185082368 wanted 15935 found 12264
parent transid verify failed on 907185127424 wanted 15935 found 12264
parent transid verify failed on 907185127424 wanted 15935 found 12264
leaf parent key incorrect 907183771648
bad block 907183771648
leaf parent key incorrect 907183779840
bad block 907183779840
leaf parent key incorrect 907183783936
bad block 907183783936
[...]
leaf parent key incorrect 907185913856
bad block 907185913856
leaf parent key incorrect 907185917952
bad block 907185917952
parent transid verify failed on 915431579648 wanted 16974 found 16972
parent transid verify failed on 915431579648 wanted 16974 found 16972
parent transid verify failed on 915432382464 wanted 16974 found 16972
parent transid verify failed on 915432382464 wanted 16974 found 16972
parent transid verify failed on 915444707328 wanted 16974 found 13021
parent transid verify failed on 915444707328 wanted 16974 found 13021
parent transid verify failed on 915445092352 wanted 16974 found 13021
parent transid verify failed on 915445092352 wanted 16974 found 13021
parent transid verify failed on 915445100544 wanted 16974 found 13021
parent transid verify failed on 915445100544 wanted 16974 found 13021
parent transid verify failed on 915432734720 wanted 16974 found 16972
parent transid verify failed on 915432734720 wanted 16974 found 16972
parent transid verify failed on 915433144320 wanted 16974 found 16972
parent transid verify failed on 915433144320 wanted 16974 found 16972
parent transid verify failed on 915431862272 wanted 16974 found 16972
parent transid verify failed on 915431862272 wanted 16974 found 16972
parent transid verify failed on 915444715520 wanted 16974 found 13021
parent transid verify failed on 915444715520 wanted 16974 found 13021
parent transid verify failed on 915445166080 wanted 16974 found 13021
parent transid verify failed on 915445166080 wanted 16974 found 13021
parent transid verify failed on 915444740096 wanted 16974 found 13021
parent transid verify failed on 915444740096 wanted 16974 found 13021
bad block 915431026688
leaf parent key incorrect 915431141376
bad block 915431141376
leaf parent key incorrect 915431161856
[...]
leaf parent key incorrect 915445100544
bad block 915445100544
leaf parent key incorrect 915445166080
bad block 915445166080
leaf parent key incorrect 915445268480
bad block 915445268480
parent transid verify failed on 915444973568 wanted 16974 found 13021
parent transid verify failed on 915444973568 wanted 16974 found 13021
parent transid verify failed on 915444977664 wanted 16974 found 13021
parent transid verify failed on 915444977664 wanted 16974 found 13021
parent transid verify failed on 915444981760 wanted 16974 found 13021
parent transid verify failed on 915444981760 wanted 16974 found 13021
parent transid verify failed on 915432701952 wanted 16974 found 16972
parent transid verify failed on 915432701952 wanted 16974 found 16972
parent transid verify failed on 915444678656 wanted 16974 found 13021
parent transid verify failed on 915444678656 wanted 16974 found 13021
parent transid verify failed on 915444682752 wanted 16974 found 13021
parent transid verify failed on 915444682752 wanted 16974 found 13021
ref mismatch on [712708972544 4096] extent item 0, found 1
Backref 712708972544 parent 5 root 5 not found in extent tree
backpointer mismatch on [712708972544 4096]
ref mismatch on [712708988928 4096] extent item 0, found 1
Backref 712708988928 parent 5 root 5 not found in extent tree
backpointer mismatch on [712708988928 4096]
ref mismatch on [712708993024 4096] extent item 0, found 1
Backref 712708993024 parent 5 root 5 not found in extent tree
backpointer mismatch on [712708993024 4096]
ref mismatch on [712708997120 4096] extent item 0, found 1
Backref 712708997120 parent 5 root 5 not found in extent tree
backpointer mismatch on [712708997120 4096]
ref mismatch on [712709001216 4096] extent item 0, found 1
Backref 712709001216 parent 5 root 5 not found in extent tree
backpointer mismatch on [712709001216 4096]
[...]
ref mismatch on [712709062656 4096] extent item 0, found 1
Backref 712709062656 parent 5 root 5 not found in extent tree
backpointer mismatch on [712709062656 4096]
ref mismatch on [712709066752 4096] extent item 0, found 1
Backref 712709066752 parent 5 root 5 not found in extent tree
backpointer mismatch on [712709066752 4096]
ref mismatch on [907178082304 4096] extent item 1, found 0
Backref 907178082304 root 5 not referenced back 0x1b96f2a0
Incorrect global backref count on 907178082304 found 1 wanted 0
backpointer mismatch on [907178082304 4096]
owner ref check failed [907178082304 4096]
ref mismatch on [907178090496 4096] extent item 1, found 0
Backref 907178090496 root 5 not referenced back 0x1b98aed0
Incorrect global backref count on 907178090496 found 1 wanted 0
backpointer mismatch on [907178090496 4096]
owner ref check failed [907178090496 4096]
ref mismatch on [907178156032 4096] extent item 1, found 0
Backref 907178156032 root 5 not referenced back 0x3ffe5ce0
Incorrect global backref count on 907178156032 found 1 wanted 0
backpointer mismatch on [907178156032 4096]
owner ref check failed [907178156032 4096]
ref mismatch on [907178160128 4096] extent item 1, found 0
Backref 907178160128 root 5 not referenced back 0x5fbf8b0
Incorrect global backref count on 907178160128 found 1 wanted 0
backpointer mismatch on [907178160128 4096]
owner ref check failed [907178160128 4096]
[...]
ref mismatch on [907180011520 4096] extent item 1, found 0
Backref 907180011520 root 5 not referenced back 0x5980c7e0
Incorrect global backref count on 907180011520 found 1 wanted 0
backpointer mismatch on [907180011520 4096]
owner ref check failed [907180011520 4096]
owner ref check failed [907183771648 4096]
owner ref check failed [907183779840 4096]
owner ref check failed [907183783936 4096]
owner ref check failed [907183792128 4096]
owner ref check failed [907183796224 4096]
owner ref check failed [907183841280 4096]
owner ref check failed [907183874048 4096]
owner ref check failed [907183878144 4096]
owner ref check failed [907183882240 4096]
owner ref check failed [907183886336 4096]
owner ref check failed [907183894528 4096]
owner ref check failed [907183898624 4096]
owner ref check failed [907183902720 4096]
owner ref check failed [907183906816 4096]
owner ref check failed [907183910912 4096]
owner ref check failed [907185057792 4096]
owner ref check failed [907185082368 4096]
owner ref check failed [907185135616 4096]
ref mismatch on [907185139712 4096] extent item 1, found 0
Backref 907185139712 root 5 not referenced back 0x470fa690
Incorrect global backref count on 907185139712 found 1 wanted 0
backpointer mismatch on [907185139712 4096]
owner ref check failed [907185139712 4096]
[...]
ref mismatch on [934316011520 4096] extent item 0, found 1
Backref 934316011520 parent 5 root 5 not found in extent tree
backpointer mismatch on [934316011520 4096]
ref mismatch on [934316019712 4096] extent item 0, found 1
Backref 934316019712 parent 5 root 5 not found in extent tree
backpointer mismatch on [934316019712 4096]
ref mismatch on [934316032000 4096] extent item 0, found 1
Backref 934316032000 parent 5 root 5 not found in extent tree
backpointer mismatch on [934316032000 4096]
ref mismatch on [1128365600768 8192] extent item 1, found 0
Incorrect local backref count on 1128365600768 root 5 owner 889187
offset 0 found 0 wanted 1 back 0x6bb76d90
Backref disk bytenr does not match extent record, bytenr=1128365600768,
ref bytenr=17613768628740554752
backpointer mismatch on [1128365600768 8192]
owner ref check failed [1128365600768 8192]
ref mismatch on [1128365608960 8192] extent item 1, found 0
Incorrect local backref count on 1128365608960 root 5 owner 889188
offset 0 found 0 wanted 1 back 0x6bb76ec0
Backref disk bytenr does not match extent record, bytenr=1128365608960,
ref bytenr=8848955218968205284
backpointer mismatch on [1128365608960 8192]
owner ref check failed [1128365608960 8192]
ref mismatch on [1128365617152 8192] extent item 1, found 0
Incorrect local backref count on 1128365617152 root 5 owner 889189
offset 0 found 0 wanted 1 back 0x6bb76ff0
Backref disk bytenr does not match extent record, bytenr=1128365617152,
ref bytenr=1928784803178016523
backpointer mismatch on [1128365617152 8192]
owner ref check failed [1128365617152 8192]
ref mismatch on [1128365625344 4096] extent item 1, found 0
Incorrect local backref count on 1128365625344 root 5 owner 889190
offset 0 found 0 wanted 1 back 0x6bb77120
Backref disk bytenr does not match extent record, bytenr=1128365625344,
ref bytenr=3735616339648328182
backpointer mismatch on [1128365625344 4096]
owner ref check failed [1128365625344 4096]
[...]
ref mismatch on [1454133166080 12288] extent item 1, found 0
Incorrect local backref count on 1454133166080 root 5 owner 2096965
offset 0 found 0 wanted 1 back 0x50c68ad0
Backref disk bytenr does not match extent record, bytenr=1454133166080,
ref bytenr=64
backpointer mismatch on [1454133166080 12288]
owner ref check failed [1454133166080 12288]
Errors found in extent allocation tree or chunk allocation
checking free space cache
Checking filesystem on /dev/sdc
UUID: 38a60270-f9c6-4ed4-8421-4bf1253ae0b3
free space inode generation (0) did not match free space cache
generation (505)
free space inode generation (0) did not match free space cache
generation (486)
free space inode generation (0) did not match free space cache
generation (486)
free space inode generation (0) did not match free space cache
generation (486)
free space inode generation (0) did not match free space cache
generation (484)
free space inode generation (0) did not match free space cache
generation (486)
free space inode generation (0) did not match free space cache
generation (486)
free space inode generation (0) did not match free space cache
generation (484)
free space inode generation (0) did not match free space cache
generation (486)
free space inode generation (0) did not match free space cache
generation (484)
free space inode generation (0) did not match free space cache
generation (516)
free space inode generation (0) did not match free space cache
generation (486)
free space inode generation (0) did not match free space cache
generation (487)
free space inode generation (0) did not match free space cache
generation (486)
free space inode generation (0) did not match free space cache
generation (501)
free space inode generation (0) did not match free space cache
generation (531)
free space inode generation (0) did not match free space cache
generation (498)
free space inode generation (0) did not match free space cache
generation (498)
free space inode generation (0) did not match free space cache
generation (484)
free space inode generation (0) did not match free space cache
generation (532)
free space inode generation (0) did not match free space cache
generation (502)
free space inode generation (0) did not match free space cache
generation (532)
free space inode generation (0) did not match free space cache
generation (502)
[...]
free space inode generation (0) did not match free space cache
generation (1612)
free space inode generation (0) did not match free space cache
generation (1612)
free space inode generation (0) did not match free space cache
generation (1613)
free space inode generation (0) did not match free space cache
generation (1599)
free space inode generation (0) did not match free space cache
generation (1606)
free space inode generation (0) dparent transid verify failed on
907185127424 wanted 15935 found 12264
parent transid verify failed on 907185127424 wanted 15935 found 12264
parent transid verify failed on 915431579648 wanted 16974 found 16972
parent transid verify failed on 915431579648 wanted 16974 found 16972
parent transid verify failed on 915432382464 wanted 16974 found 16972
parent transid verify failed on 915432382464 wanted 16974 found 16972
parent transid verify failed on 915444707328 wanted 16974 found 13021
parent transid verify failed on 915444707328 wanted 16974 found 13021
parent transid verify failed on 915445092352 wanted 16974 found 13021
parent transid verify failed on 915445092352 wanted 16974 found 13021
parent transid verify failed on 915445100544 wanted 16974 found 13021
parent transid verify failed on 915445100544 wanted 16974 found 13021
parent transid verify failed on 915432734720 wanted 16974 found 16972
parent transid verify failed on 915432734720 wanted 16974 found 16972
parent transid verify failed on 915433144320 wanted 16974 found 16972
parent transid verify failed on 915433144320 wanted 16974 found 16972
parent transid verify failed on 915431862272 wanted 16974 found 16972
parent transid verify failed on 915431862272 wanted 16974 found 16972
parent transid verify failed on 915444715520 wanted 16974 found 13021
parent transid verify failed on 915444715520 wanted 16974 found 13021
parent transid verify failed on 915445166080 wanted 16974 found 13021
parent transid verify failed on 915445166080 wanted 16974 found 13021
parent transid verify failed on 915444740096 wanted 16974 found 13021
parent transid verify failed on 915444740096 wanted 16974 found 13021
parent transid verify failed on 915444973568 wanted 16974 found 13021
parent transid verify failed on 915444973568 wanted 16974 found 13021
parent transid verify failed on 915444977664 wanted 16974 found 13021
parent transid verify failed on 915444977664 wanted 16974 found 13021
parent transid verify failed on 915444981760 wanted 16974 found 13021
parent transid verify failed on 915444981760 wanted 16974 found 13021
parent transid verify failed on 915432701952 wanted 16974 found 16972
parent transid verify failed on 915432701952 wanted 16974 found 16972
parent transid verify failed on 915444678656 wanted 16974 found 13021
parent transid verify failed on 915444678656 wanted 16974 found 13021
parent transid verify failed on 915444682752 wanted 16974 found 13021
parent transid verify failed on 915444682752 wanted 16974 found 13021
checking fs roots
parent transid verify failed on 907185082368 wanted 15935 found 12264
parent transid verify failed on 907185082368 wanted 15935 found 12264
parent transid verify failed on 907185082368 wanted 15935 found 12264
parent transid verify failed on 907185082368 wanted 15935 found 12264
parent transid verify failed on 907185127424 wanted 15935 found 12264
parent transid verify failed on 907185127424 wanted 15935 found 12264
parent transid verify failed on 907185127424 wanted 15935 found 12264
parent transid verify failed on 907185127424 wanted 15935 found 12264
[...]
parent transid verify failed on 915444523008 wanted 16974 found 13021
parent transid verify failed on 915444523008 wanted 16974 found 13021
parent transid verify failed on 915444523008 wanted 16974 found 13021
parent transid verify failed on 915444523008 wanted 16974 found 13021
Ignoring transid failure
btrfsck: cmds-check.c:1066: process_file_extent: Assertion `!(rec->ino
!= key->objectid || rec->refs > 1)'' failed.
id not match free space cache generation (1625)
free space inode generation (0) did not match free space cache
generation (1607)
free space inode generation (0) did not match free space cache
generation (1604)
free space inode generation (0) did not match free space cache
generation (1606)
free space inode generation (0) did not match free space cache
generation (1620)
free space inode generation (0) did not match free space cache
generation (1626)
free space inode generation (0) did not match free space cache
generation (1609)
free space inode generation (0) did not match free space cache
generation (1653)
free space inode generation (0) did not match free space cache
generation (1628)
free space inode generation (0) did not match free space cache
generation (1628)
free space inode generation (0) did not match free space cache
generation (1649)

End of output







--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Chris Murphy

2013-Sep-28 20:51 UTC

head link

Re: Corrupt btrfs filesystem recovery... (Due to sata errors)

On Sep 28, 2013, at 1:26 PM, Martin <m_btrfs@ml1.co.uk> wrote:
> Writing data via rsync at the 6Gbit/s sata rate caused
> IO errors for just THREE sectors...
> 
> Yet btrfsck bombs out with LOTs of errors…
Any fs will bomb out on write errors.
> How best to recover from this?
Why you''re getting I/O errors at SATA 6Gbps link speed needs to be
understood. Is it a bad cable? Bad SATA port? Drive or controller firmware bug?
Or libata driver bug?
> Lots of sata error noise omitted.
And entire dmesg might still be useful. I don''t know if the list will
handle the whole dmesg in one email, but it''s worth a shot (reply to an
email in the thread, don''t change the subject).

It''s possible software or hardware problems are detected well before
writes are even initiated.
> Running "badblocks" twice in succession (non-destructive data
test!)
> shows no surface errors and no further errors on the sata interface.
SATA link speed related errors aren''t related to bad blocks. If you do
a smartctl -x on the drive, chances are it''s recording PHY Event errors
that might be relevant, and also SMART might record UDMA/CMC errors that would
just corroborate that the drive also found link errors.

> 
> Running btrfsck twice gives the same result, giving a failure with:
Well honestly at this point I expect file system corruption as it''s
entirely possible that before the hardware dropped the link speed down to SATA
3Gbps, there was corrupt data already sent to the drive and that''s not
something Btrfs can know about until trying to read the data back in. So *shrug*
- I don''t see Btrfs as a way to totally mitigate hardware problems.
It''s the same problem with bad RAM, and Btrfs doesn''t like
that either.


Chris Murphy

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Martin

2013-Sep-28 22:51 UTC

head link

Re: Corrupt btrfs filesystem recovery... (Due to sata errors)

Chris,

All agreed. Further comment inlined:

(Should have mentioned more prominently that the hardware problem has
been worked-around by limiting the sata to 3Gbit/s on bootup.)


On 28/09/13 21:51, Chris Murphy wrote:> 
> On Sep 28, 2013, at 1:26 PM, Martin <m_btrfs@ml1.co.uk> wrote:
> 
>> Writing data via rsync at the 6Gbit/s sata rate caused IO errors
>> for just THREE sectors...
>> 
>> Yet btrfsck bombs out with LOTs of errors…
> 
> Any fs will bomb out on write errors.
Indeed. However, are not the sata errors reported back to btrfs so that
it knows whatever parts haven''t been updated?

Is there not a mechanism to then go "read-only"?

Also, should not the journal limit the damage?

>> How best to recover from this?
> 
> Why you''re getting I/O errors at SATA 6Gbps link speed needs to be
> understood. Is it a bad cable? Bad SATA port? Drive or controller
> firmware bug? Or libata driver bug?
I systematically eliminated such as leads, PSU, and NCQ. Limiting libata
to only use 3Gbit/s is the one change that gives a consistent fix. The
HDD and motherboard both support 6Gbit/s, but hey-ho, that''s an
experiment I can try again some other time when I have another HDD/SSD
to test in there.

In any case, for the existing HDD - motherboard combination, using sata2
rather than sata3 speeds shouldn''t noticeably impact performance.
(Other
than sata2 works reliably and so is infinitely better for this case!)

>> Lots of sata error noise omitted.
> 
> And entire dmesg might still be useful. I don''t know if the list
will
> handle the whole dmesg in one email, but it''s worth a shot (reply
to
> an email in the thread, don''t change the subject).
I can email directly if of use/interest. Let me know offlist.

> do a smartctl -x on the drive, chances are it''s recording PHY
Event
(smartctl -x errors shown further down...)

Nothing untoward noticed:

# smartctl -a /dev/sdc

=== START OF INFORMATION SECTION ==Model Family:     Western Digital Caviar
Green (AF, SATA 6Gb/s)
Device Model:     WDC WD20EARX-00PASB0
Serial Number:    WD-...
LU WWN Device Id: ...
Firmware Version: 51.0AB51
User Capacity:    2,000,398,934,016 bytes [2.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ATA8-ACS (minor revision not indicated)
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 3.0 Gb/s)
Local Time is:    Sat Sep 28 23:35:57 2013 BST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

[...]

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE
UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always
      -       9
  3 Spin_Up_Time            0x0027   253   159   021    Pre-fail  Always
      -       1983
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always
      -       55
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always
      -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always
      -       0
  9 Power_On_Hours          0x0032   099   099   000    Old_age   Always
      -       800
 10 Spin_Retry_Count        0x0032   100   253   000    Old_age   Always
      -       0
 11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always
      -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always
      -       53
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always
      -       31
193 Load_Cycle_Count        0x0032   199   199   000    Old_age   Always
      -       3115
194 Temperature_Celsius     0x0022   118   110   000    Old_age   Always
      -       32
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always
      -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always
      -       0
198 Offline_Uncorrectable   0x0030   200   200   000    Old_age
Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always
      -       0
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age
Offline      -       0


# smartctl -x /dev/sdc

... also shows the errors it saw:

(Just the last 4 copied which look timed for when the HDD was last
exposed to 6Gbit/s sata)

Error 46 [21] occurred at disk power-on lifetime: 755 hours (31 days +
11 hours)
  When the command that caused the error occurred, the device was active
or idle.

  After command completion occurred, registers were:
  ER -- ST COUNT  LBA_48  LH LM LL DV DC
  -- -- -- == -- == == == -- -- -- -- --
  01 -- 51 00 08 00 00 6c 1a 4b b0 e0 00  Error: AMNF 8 sectors at LBA
0x6c1a4bb0 = 1813662640

  Commands leading to the command that caused the error were:
  CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time
Command/Feature_Name
  -- == -- == -- == == == -- -- -- -- --  ---------------
--------------------
  25 00 00 00 08 00 00 6c 1a 4b b0 e0 08     10:51:07.192  READ DMA EXT
  35 00 00 00 08 00 00 6c 1a 4b a8 e0 08     10:51:07.192  WRITE DMA EXT
  25 00 00 00 08 00 00 6c 1a 4b a8 e0 08     10:51:07.192  READ DMA EXT
  35 00 00 00 08 00 00 6c 1a 4b a8 e0 08     10:51:07.192  WRITE DMA EXT
  25 00 00 00 08 00 00 6c 1a 4b a8 e0 08     10:51:07.157  READ DMA EXT

Error 45 [20] occurred at disk power-on lifetime: 755 hours (31 days +
11 hours)
  When the command that caused the error occurred, the device was active
or idle.

  After command completion occurred, registers were:
  ER -- ST COUNT  LBA_48  LH LM LL DV DC
  -- -- -- == -- == == == -- -- -- -- --
  01 -- 51 04 00 00 00 6c 1a 4b b0 e0 00  Error: AMNF 1024 sectors at
LBA = 0x6c1a4bb0 = 1813662640

  Commands leading to the command that caused the error were:
  CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time
Command/Feature_Name
  -- == -- == -- == == == -- -- -- -- --  ---------------
--------------------
  25 00 00 04 00 00 00 6c 1a 48 00 e0 08     10:51:03.450  READ DMA EXT
  ef 00 10 00 02 00 00 00 00 00 00 a0 08     10:51:03.449  SET FEATURES
[Enable SATA feature]
  27 00 00 00 00 00 00 00 00 00 00 e0 08     10:51:03.449  READ NATIVE
MAX ADDRESS EXT [OBS-ACS-3]
  ec 00 00 00 00 00 00 00 00 00 00 a0 08     10:51:03.446  IDENTIFY DEVICE
  ef 00 03 00 46 00 00 00 00 00 00 a0 08     10:51:03.446  SET FEATURES
[Set transfer mode]

Error 44 [19] occurred at disk power-on lifetime: 755 hours (31 days +
11 hours)
  When the command that caused the error occurred, the device was active
or idle.

  After command completion occurred, registers were:
  ER -- ST COUNT  LBA_48  LH LM LL DV DC
  -- -- -- == -- == == == -- -- -- -- --
  01 -- 51 04 00 00 00 6c 1a 4b b0 e0 00  Error: AMNF 1024 sectors at
LBA = 0x6c1a4bb0 = 1813662640

  Commands leading to the command that caused the error were:
  CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time
Command/Feature_Name
  -- == -- == -- == == == -- -- -- -- --  ---------------
--------------------
  25 00 00 04 00 00 00 6c 1a 48 00 e0 08     10:51:00.453  READ DMA EXT
  ef 00 10 00 02 00 00 00 00 00 00 a0 08     10:51:00.452  SET FEATURES
[Enable SATA feature]
  27 00 00 00 00 00 00 00 00 00 00 e0 08     10:51:00.452  READ NATIVE
MAX ADDRESS EXT [OBS-ACS-3]
  ec 00 00 00 00 00 00 00 00 00 00 a0 08     10:51:00.449  IDENTIFY DEVICE
  ef 00 03 00 46 00 00 00 00 00 00 a0 08     10:51:00.449  SET FEATURES
[Set transfer mode]

Error 43 [18] occurred at disk power-on lifetime: 755 hours (31 days +
11 hours)
  When the command that caused the error occurred, the device was active
or idle.

  After command completion occurred, registers were:
  ER -- ST COUNT  LBA_48  LH LM LL DV DC
  -- -- -- == -- == == == -- -- -- -- --
  01 -- 51 04 00 00 00 6c 1a 4b b0 e0 00  Error: AMNF 1024 sectors at
LBA = 0x6c1a4bb0 = 1813662640

  Commands leading to the command that caused the error were:
  CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time
Command/Feature_Name
  -- == -- == -- == == == -- -- -- -- --  ---------------
--------------------
  25 00 00 04 00 00 00 6c 1a 48 00 e0 08     10:50:57.455  READ DMA EXT
  ef 00 10 00 02 00 00 00 00 00 00 a0 08     10:50:57.455  SET FEATURES
[Enable SATA feature]
  27 00 00 00 00 00 00 00 00 00 00 e0 08     10:50:57.455  READ NATIVE
MAX ADDRESS EXT [OBS-ACS-3]
  ec 00 00 00 00 00 00 00 00 00 00 a0 08     10:50:57.452  IDENTIFY DEVICE
  ef 00 03 00 46 00 00 00 00 00 00 a0 08     10:50:57.452  SET FEATURES
[Set transfer mode]

Error 42 [17] occurred at disk power-on lifetime: 755 hours (31 days +
11 hours)
  When the command that caused the error occurred, the device was active
or idle.

  After command completion occurred, registers were:
  ER -- ST COUNT  LBA_48  LH LM LL DV DC
  -- -- -- == -- == == == -- -- -- -- --
  01 -- 51 04 00 00 00 6c 1a 4b b0 e0 00  Error: AMNF 1024 sectors at
LBA = 0x6c1a4bb0 = 1813662640

  Commands leading to the command that caused the error were:
  CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time
Command/Feature_Name
  -- == -- == -- == == == -- -- -- -- --  ---------------
--------------------
  25 00 00 04 00 00 00 6c 1a 48 00 e0 08     10:50:54.459  READ DMA EXT
  ef 00 10 00 02 00 00 00 00 00 00 a0 08     10:50:54.458  SET FEATURES
[Enable SATA feature]
  27 00 00 00 00 00 00 00 00 00 00 e0 08     10:50:54.458  READ NATIVE
MAX ADDRESS EXT [OBS-ACS-3]
  ec 00 00 00 00 00 00 00 00 00 00 a0 08     10:50:54.455  IDENTIFY DEVICE
  ef 00 03 00 46 00 00 00 00 00 00 a0 08     10:50:54.455  SET FEATURES
[Set transfer mode]






>> Running btrfsck twice gives the same result, giving a failure
>> with:
> 
> Well honestly at this point I expect file system corruption as
it''s
> entirely possible that before the hardware dropped the link speed
> down to SATA 3Gbps, there was corrupt data already sent to the drive
> and that''s not something Btrfs can know about until trying to read
> the data back in. So *shrug* - I don''t see Btrfs as a way to
totally
> mitigate hardware problems. It''s the same problem with bad RAM,
and
> Btrfs doesn''t like that either.
Indeed. Hence trapping ''unexpectedness'' where reasonable to
then go
read-only... (I guess a hard compromise though whilst still debugging
bugs ''unexpectedness''! But still good to have in mind. ;-) )


Regards,
Martin




--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Martin

2013-Sep-28 22:54 UTC

head link

Re: Corrupt btrfs filesystem recovery... (Due to sata errors)

On 28/09/13 20:26, Martin wrote:
> ... btrfsck bombs out with LOTs of errors...
> 
> How best to recover from this?
> 
> (This is a ''backup'' disk so not
''critical'' but it would be nice to avoid
> rewriting about 1.5TB of data over the network...)
> 
> 
> Is there an obvious sequence/recipe to follow for recovery?

I''ve got the drive reliably working with the sata limited to 3Gbit/s.
What is the best sequence to try to tidy-up and carry on with the 1.5TB
or so of data on there, rather than working from scratch?

So far, I''ve only run btrfsck since the corruption errors for the three
sectors...

Suggestions for recovery?

Thanks,
Martin

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Chris Murphy

2013-Sep-29 02:06 UTC

head link

Re: Corrupt btrfs filesystem recovery... (Due to sata errors)

On Sep 28, 2013, at 4:51 PM, Martin <m_btrfs@ml1.co.uk> wrote:
> Indeed. However, are not the sata errors reported back to btrfs so that
> it knows whatever parts haven''t been updated?
It''s a good question.

My doubtful speculation of such a mechanism is that it is really not the
responsibility of the file system to be prepared for the hardware face planting
this spectacularly. The hardware really should do better than this. There are
specifications that apply here, and the drive and controller and driver all
agreed long before the mounting of a volume and writes started to occur. But
then later on, at some point in the middle of the really important part of the
conversation (writing your data) something in the hardware chain puked and said
"OHHh wait about that prior conversation, I''m really confused,
let''s talk at a slower speed shall we?" So the before part is just
a lost conversation, is my speculation.

The other thing is that SATA and SAS handle these things differently. When
there''s such a serious error that results in a link speed change,
usually the bus is reset and for SATA it means the command queue is lost. And I
don''t think Btrfs is informed of what commands were completed vs failed
in such a case.

But I''d love someone who actually knows what they''re talking
about to answer that question.

My expectation though, is that unlike perhaps other file systems,
Btrfs''s design goal is to handle the data that did get written, better.
In that it''s still accessible where other file systems possibly will
have a more difficulty.

> Is there not a mechanism to then go "read-only"?
I don''t know. In this case it does seem sorta reasonable. But the dmesg
might still be revealing. The PHY Event counters indicate a lot of retries of
over 1000 sectors.> 
> Also, should not the journal limit the damage?
Well it''s COW so it''s not quite like a journaled file system,
but yeah it should be in a position to know at the next mount time the most
recent state of file system consistency. But that doesn''t mean it can
fix the parts that are just fundamentally broken. But I think it''s a
valid question, "now what?" because I don''t actually know the
state of your file system or how to determine it. So maybe Hugo, or someone else
has some thoughts.

But for sure I would move to kernel 3.11.2 or 3.12.rc2 before mounting this file
system again.
> 
> 
>>> How best to recover from this?
>> 
>> Why you''re getting I/O errors at SATA 6Gbps link speed needs
to be
>> understood. Is it a bad cable? Bad SATA port? Drive or controller
>> firmware bug? Or libata driver bug?
> 
> I systematically eliminated such as leads, PSU, and NCQ. Limiting libata
> to only use 3Gbit/s is the one change that gives a consistent fix. The
> HDD and motherboard both support 6Gbit/s, but hey-ho, that''s an
> experiment I can try again some other time when I have another HDD/SSD
> to test in there.
Stick with forced 3Gbps, but I think it''s worth while to find out what
the actual problem is. One day you forget about this 3Gbps SATA link, upgrade or
regress to another kernel and you don''t have the 3Gbps forced speed on
the parameter line, and poof - you''ve got more problems again. The
hardware shouldn''t negotiate a 6Gbps link and then do a backwards swan
dive at 30,000'' with your data as if it''s an after thought.

> In any case, for the existing HDD - motherboard combination, using sata2
> rather than sata3 speeds shouldn''t noticeably impact performance.
(Other
> than sata2 works reliably and so is infinitely better for this case!)
It''s true.

> 
> 
>>> Lots of sata error noise omitted.
>> 
>> And entire dmesg might still be useful. I don''t know if the
list will
>> handle the whole dmesg in one email, but it''s worth a shot
(reply to
>> an email in the thread, don''t change the subject).
> 
> I can email directly if of use/interest. Let me know offlist.
Use pastebin.com and post the link if it''s really huge, but
I''d consider setting it to no expiration because if something
interesting is learned, people doing searches have a better chance of finding
the problem if the link hasn''t expired.

I would also separately unmount the file system, note the latest kernel message,
then mount the file system and see if there are any kernel messages that might
indicate recognition of problems with the fs.

I would not use btrfsck --repair until someone says it''s a good idea.
That person would not be me.

Chris Murphy

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Martin

2013-Sep-29 02:10 UTC

head link

Re: Corrupt btrfs filesystem recovery... What best instructions?

On 28/09/13 23:54, Martin wrote:> On 28/09/13 20:26, Martin wrote:
> 
>> ... btrfsck bombs out with LOTs of errors...
>>
>> How best to recover from this?
>>
>> (This is a ''backup'' disk so not
''critical'' but it would be nice to avoid
>> rewriting about 1.5TB of data over the network...)
>>
>>
>> Is there an obvious sequence/recipe to follow for recovery?
> 
> 
> I''ve got the drive reliably working with the sata limited to
3Gbit/s.
> What is the best sequence to try to tidy-up and carry on with the 1.5TB
> or so of data on there, rather than working from scratch?
> 
> 
> So far, I''ve only run btrfsck since the corruption...
So...

Any options for btrfsck to fix things?

Or is anything/everything that is fixable automatically fixed on the
next mount?

Or should:

btrfs scrub /dev/sdX

be run first?

Or?


What does btrfs do (or can do) for recovery?

Advice welcomed,

Thanks,
Martin




--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Martin

2013-Sep-29 02:31 UTC

head link

Re: Corrupt btrfs filesystem recovery... (Due to sata errors)

Chris,

Thanks for good comment/discussion.

On 29/09/13 03:06, Chris Murphy wrote:> 
> On Sep 28, 2013, at 4:51 PM, Martin <m_btrfs@ml1.co.uk> wrote:
> 
> Stick with forced 3Gbps, but I think it''s worth while to find out
> what the actual problem is. One day you forget about this 3Gbps SATA
> link, upgrade or regress to another kernel and you don''t have the
> 3Gbps forced speed on the parameter line, and poof - you''ve got
more
> problems again. The hardware shouldn''t negotiate a 6Gbps link and
> then do a backwards swan dive at 30,000'' with your data as if
it''s an
> after thought.
I''ve got an engineer''s curiosity so that one is very
definitely marked
for revisiting at some time... If only to blog that x-y-z combination is
a tar pit for your data...

>> In any case, for the existing HDD - motherboard combination, using
>> sata2 rather than sata3 speeds shouldn''t noticeably impact
>> performance. (Other than sata2 works reliably and so is infinitely
>> better for this case!)
> 
> It''s true.
Well, the IO data rate for badblocks is exactly the same as before,
limited by the speed of the physical rust spinning and data density...

> I would also separately unmount the file system, note the latest
> kernel message, then mount the file system and see if there are any
> kernel messages that might indicate recognition of problems with the
> fs.
> 
> I would not use btrfsck --repair until someone says it''s a good
idea.
> That person would not be me.
It is sat unmounted until some informed opinion is gained...


Thanks again for your notes,

Regards,
Martin




--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Duncan

2013-Sep-29 05:11 UTC

head link

Re: Corrupt btrfs filesystem recovery... What best instructions?

Martin posted on Sun, 29 Sep 2013 03:10:37 +0100 as excerpted:
> So...
> 
> Any options for btrfsck to fix things?
> 
> Or is anything/everything that is fixable automatically fixed on the
> next mount?
> 
> Or should:
> 
> btrfs scrub /dev/sdX
> 
> be run first?
> 
> Or?
> 
> 
> What does btrfs do (or can do) for recovery?
Here''s a general-case answer (courtesy gmane) to the order in which to 
try recovery question, that Hugo posted a few weeks ago:

http://permalink.gmane.org/gmane.comp.file-systems.btrfs/27999

Note that in specific cases someone who knew what they were doing could 
omit some steps and focus on others, but I''m not at that level of
"know
what I''m doing", so...

Scrub... would go before this, if it''s useful.  But scrub depends on a 
second, valid copy being available in ordered to fix the bad-checksum 
one.  On a single device btrfs, btrfs defaults to DUP metadata (unless 
it''s SSD), so you may have a second copy for that, but you
won''t have a
second copy of the data.  This is a very strong reason to go btrfs raid1 
mode (for both data and metadata) if you can, because that gives you a 
second copy of everything, thereby actually making use of btrfs''
checksum
and scrub ability.  (Unfortunately, there is as yet no way to do N-way 
mirroring, there''s only the second copy not a third, no matter how many
devices you have in that "raid1".)

Finally, if you mentioned your kernel (and btrfs-tools) version(s) I 
missed it, but [boilerplate recommendation, stressed repeatedly both in 
the wiki and on-list] btrfs being still labeled experimental and under 
serious development, there''s still lots of bugs fixed every kernel 
release.  So as Chris Murphy said, if you''re not on 3.11-stable or
3.12-
rcX already, get there.  Not only can the safety of your data depend on 
it, but by choosing to run experimental we''re all testers, and our 
reports if something does go wrong will be far more usable if we''re on
a
current kernel.  Similarly, btrfs-tools 0.20-rc1 is already somewhat old; 
you really should be on a git-snapshot beyond that.  (The master branch 
is kept stable, work is done in other branches and only merged to master 
when it''s considered suitably stable, so a recently updated btrfs-tools
master HEAD is at least in theory always the best possible version you 
can be running.  If that''s ever NOT the case, then testers need to be 
reporting that ASAP so it can be fixed, too.)

Back to the kernel, it''s worth noting that 3.12-rcX includes an option 
that turns off most btrfs bugons by default.  Unless you''re a btrfs 
developer (which it doesn''t sound like you are), you''ll want
to activate
that (turning off the bugons), as they''re not helpful for ordinary
users
and just force unnecessary reboots when something minor and otherwise 
immediately recoverable goes wrong.  That''s just one of the latest
fixes.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Martin

2013-Sep-29 21:29 UTC

head link

Re: Corrupt btrfs filesystem recovery... What best instructions?

On 29/09/13 06:11, Duncan wrote:> Martin posted on Sun, 29 Sep 2013 03:10:37 +0100 as excerpted:
> 
>> So...
>>
>> Any options for btrfsck to fix things?
>>
>> Or is anything/everything that is fixable automatically fixed on the
>> next mount?
>>
>> Or should:
>>
>> btrfs scrub /dev/sdX
>>
>> be run first?
>>
>> Or?
>>
>>
>> What does btrfs do (or can do) for recovery?
> 
> Here''s a general-case answer (courtesy gmane) to the order in
which to
> try recovery question, that Hugo posted a few weeks ago:
> 
> http://permalink.gmane.org/gmane.comp.file-systems.btrfs/27999
Thanks for that. Very well found!

The instructions from Hugo are:

####
   Let''s assume that you don''t have a physical device failure
(which
is a different set of tools -- mount -odegraded, btrfs dev del
missing).

   First thing to do is to take a btrfs-image -c9 -t4 of the
filesystem, and keep a copy of the output to show josef. :)

   Then start with -orecovery and -oro,recovery for pretty much
anything.

   If those fail, then look in dmesg for errors relating to the log
tree -- if that''s corrupt and can''t be read (or causes a
crash), use
btrfs-zero-log.

   If there''s problems with the chunk tree -- the only one
I''ve seen
recently was reporting something like "can''t map address" --
then
chunk-recover may be of use.

   After that, btrfsck is probably the next thing to try. If options
-s1, -s2, -s3 have any success, then btrfs-select-super will help by
replacing the superblock with one that works. If that''s not going to
be useful, fall back to btrfsck --repair.

   Finally, btrfsck --repair --init-extent-tree may be necessary if
there''s a damaged extent tree. Finally, if you''ve got
corruption in
the checksums, there''s --init-csum-tree.

   Hugo.
####

Those will be tried next...


> Note that in specific cases someone who knew what they were doing could 
> omit some steps and focus on others, but I''m not at that level of
"know
> what I''m doing", so...
> 
> Scrub... would go before this, if it''s useful.  But scrub depends
on a
> second, valid copy being available in ordered to fix the bad-checksum 
> one.  On a single device btrfs, btrfs defaults to DUP metadata (unless 
> it''s SSD), so you may have a second copy for that, but you
won''t have a
> second copy of the data.  This is a very strong reason to go btrfs raid1 
> mode (for both data and metadata) if you can, because that gives you a 
> second copy of everything, thereby actually making use of btrfs''
checksum
> and scrub ability.  (Unfortunately, there is as yet no way to do N-way 
> mirroring, there''s only the second copy not a third, no matter how
many
> devices you have in that "raid1".)
> 
> Finally, if you mentioned your kernel (and btrfs-tools) version(s) I 
> missed it, but [boilerplate recommendation, stressed repeatedly both in 
> the wiki and on-list] btrfs being still labeled experimental and under 
> serious development, there''s still lots of bugs fixed every kernel
> release.  So as Chris Murphy said, if you''re not on 3.11-stable or
3.12-
> rcX already, get there.  Not only can the safety of your data depend on 
> it, but by choosing to run experimental we''re all testers, and our
> reports if something does go wrong will be far more usable if
we''re on a
> current kernel.  Similarly, btrfs-tools 0.20-rc1 is already somewhat old; 
> you really should be on a git-snapshot beyond that.  (The master branch 
> is kept stable, work is done in other branches and only merged to master 
> when it''s considered suitably stable, so a recently updated
btrfs-tools
> master HEAD is at least in theory always the best possible version you 
> can be running.  If that''s ever NOT the case, then testers need to
be
> reporting that ASAP so it can be fixed, too.)
> 
> Back to the kernel, it''s worth noting that 3.12-rcX includes an
option
> that turns off most btrfs bugons by default.  Unless you''re a
btrfs
> developer (which it doesn''t sound like you are), you''ll
want to activate
> that (turning off the bugons), as they''re not helpful for ordinary
users
> and just force unnecessary reboots when something minor and otherwise 
> immediately recoverable goes wrong.  That''s just one of the latest
fixes.
Looking up what''s available for Gentoo, the maintainers there look to
be
nicely sharp with multiple versions available all the way up to kernel
3.11.2...

There''s also the latest available from btrfs tools with
sys-fs/btrfs-progs "9999"...

OK, so onto the cutting edge to compile them in...


Thanks all,
Martin



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Martin

2013-Sep-29 21:55 UTC

head link

Re: Corrupt btrfs filesystem recovery... What best instructions?

On 29/09/13 22:29, Martin wrote:
> Looking up what''s available for Gentoo, the maintainers there look
to be
> nicely sharp with multiple versions available all the way up to kernel
> 3.11.2...
That is being pulled in now as expected:

sys-kernel/gentoo-sources-3.11.2

> There''s also the latest available from btrfs tools with
> sys-fs/btrfs-progs "9999"...
Oddly, that caused emerge to report:

[ebuild     UD ] sys-fs/btrfs-progs-0.19.11 [0.20_rc1_p358] 0 kB

which is a *downgrade*. Hence, I''m keeping with the 0.20_rc1_p358.

> OK, so onto the cutting edge to compile them in...
Interesting times as is said in a certain part of the world...
Martin



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Duncan

2013-Sep-30 07:51 UTC

head link

Re: Corrupt btrfs filesystem recovery... What best instructions?

Martin posted on Sun, 29 Sep 2013 22:55:43 +0100 as excerpted:
> On 29/09/13 22:29, Martin wrote:
> 
>> Looking up what''s available for Gentoo, the maintainers there
look to
>> be nicely sharp with multiple versions available all the way up to
>> kernel 3.11.2...
Cool, another gentooer! =:^)
> That is being pulled in now as expected:
> 
> sys-kernel/gentoo-sources-3.11.2
FWIW, I''ve been doing my own kernels (mainline) since back on mandrake
a
decade ago, and I just changed up my scripts a bit when I switched to 
gentoo.  Then later on I changed them up a bit more to be able to run git 
kernels.  These days I normally first try (and switch to if no serious 
bugs) to the dev kernel around -rc2 or so, by which point I figure 
anything likely to eat my system for breakfast should be either worked 
thru, or at least news of it available.  As a non-dev, it''s very cool 
being able to spot and report bugs, possibly bisecting to a specific 
commit, and watch them get fixed before general kernel release. Just one 
way I as a non-dev can contribute back.  =:^)

To take care of packages that depend on a kernel package, I used to have 
a kernel (gentoo-sources-2.6.9999 or some such, back then, now of course 
it''d be 3.9999) in package.provided, but these days I don''t
even need
that. =:^)
>> There''s also the latest available from btrfs tools with
>> sys-fs/btrfs-progs "9999"...
> 
> Oddly, that caused emerge to report:
> 
> [ebuild     UD ] sys-fs/btrfs-progs-0.19.11 [0.20_rc1_p358] 0 kB
> 
> which is a *downgrade*. Hence, I''m keeping with the 0.20_rc1_p358.
btrfs-progs-9999 is available, but as a live package, it''s masked in 
keeping with gentoo policy.  So to get it you''d need to unmask it.

But 0.20_rc1_p358 shouldn''t be /too/ far back.  In fact, I''m
guessing the
p-number is the (serialized) patch sequence number indicating the number 
of commits forward from the rc1 tag.  And on the (locally unmasked) -9999 
version here, a git describe --tags gives me

v0.20-rc1-358-g194aa4a

... so 0.20_rc1_p358 is very likely identical to the live version at this 
point, and it makes no difference, except that the non-live version is a 
stable snapshot instead of a version that might change every time you 
merge it, if upstream has done any further commits.

So btrfs-progs-0.20_rc1_p358 should be fine.  And you were updating 
kernel to 3.11.2, so that should be fine by the time you read this, as 
well. =:^)

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Martin

2013-Oct-03 00:49 UTC

head link

Re: Corrupt btrfs filesystem recovery... What best instructions?

So... The fix:

(

Summary:

Mounting "-o recovery,noatime" worked well and allowed a diff check to
complete for all but one directory tree. So very nearly all the data is
fine.

Deleting the failed directory tree caused a call stack dump and eventually:

kernel: parent transid verify failed on 915444822016 wanted 16974 found
13021
kernel: BTRFS info (device sdc): failed to delete reference to
eggdrop-1.6.19.ebuild, inode 2096893 parent 5881667
kernel: BTRFS error (device sdc) in __btrfs_unlink_inode:3662: errno=-5
IO failure
kernel: BTRFS info (device sdc): forced readonly

Greater detail listed below.

What next best to try?

Safer to try again but this time with with
"no_space_cache,no_inode_cache"?

Thanks,
Martin

)

On 29/09/13 22:29, Martin wrote:> On 29/09/13 06:11, Duncan wrote:
>>> What does btrfs do (or can do) for recovery?
>>
>> Here''s a general-case answer (courtesy gmane) to the order in
which to
>> try recovery question, that Hugo posted a few weeks ago:
>>
>> http://permalink.gmane.org/gmane.comp.file-systems.btrfs/27999
> 
> Thanks for that. Very well found!
> 
> The instructions from Hugo are:
> 
> ####
>    Let''s assume that you don''t have a physical device
failure (which
> is a different set of tools -- mount -odegraded, btrfs dev del
> missing).
> 
>    First thing to do is to take a btrfs-image -c9 -t4 of the
> filesystem, and keep a copy of the output to show josef. :)
> 
>    Then start with -orecovery and -oro,recovery for pretty much
> anything.
For anyone following this, first a health warning:

If your data is in any way critical or important, then you should
already have a backup copy elsewhere. If not, best make a binary image
copy of your disk first!

OK... So with the latest kernel (3.11.2) and btrfs tools
(Btrfs v0.20-rc1-358-g194aa4a) and the sequence went:

mount -v -t btrfs -o recovery LABEL=bu_A /mnt/bu_A

(From syslog:)

kernel: device label bu_A devid 1 transid 17222 /dev/sdc
kernel: btrfs: enabling auto recovery
kernel: btrfs: disk space caching is enabled
kernel: btrfs: bdev /dev/sdc errs: wr 0, rd 27, flush 0, corrupt 0, gen 0

Running through a diff check for part of the backups, syslog reported:

kernel: btrfs read error corrected: ino 1 off 915433144320 (dev /dev/sdc
sector 1813661856)

Also, the HDD was showing quite a few write operations so... Is
"noatime" set?... Ooops... Didn''t include a "ro"...
So, killed the diff
check and remounted:

mount -v -t btrfs -o remount,recovery,noatime /mnt/bu_A
mount: /dev/sdc mounted on /mnt/bu_A

kernel: btrfs: enabling inode map caching
kernel: btrfs: enabling auto recovery
kernel: btrfs: disk space caching is enabled

And running the diff check again... Now zero writes to the HDD :-)

Various syslog messages were given:

kernel: parent transid verify failed on 907185135616 wanted 15935 found
12264
kernel: btrfs read error corrected: ino 1 off 907185135616 (dev /dev/sdc
sector 1781823824)
kernel: parent transid verify failed on 907185143808 wanted 15935 found
12264
kernel: btrfs read error corrected: ino 1 off 907185143808 (dev /dev/sdc
sector 1781823840)
kernel: parent transid verify failed on 907185139712 wanted 15935 found
12264
kernel: btrfs read error corrected: ino 1 off 907185139712 (dev /dev/sdc
sector 1781823832)
kernel: parent transid verify failed on 907185152000 wanted 15935 found
10903
kernel: btrfs read error corrected: ino 1 off 907185152000 (dev /dev/sdc
sector 1781823856)
kernel: parent transid verify failed on 907183783936 wanted 15935 found
12263
kernel: btrfs read error corrected: ino 1 off 907183783936 (dev /dev/sdc
sector 1781821184)
kernel: parent transid verify failed on 907183792128 wanted 15935 found
10903
kernel: btrfs read error corrected: ino 1 off 907183792128 (dev /dev/sdc
sector 1781821200)
kernel: parent transid verify failed on 907183796224 wanted 15935 found
12263
kernel: btrfs read error corrected: ino 1 off 907183796224 (dev /dev/sdc
sector 1781821208)
kernel: parent transid verify failed on 907183841280 wanted 15935 found
10903
kernel: btrfs read error corrected: ino 1 off 907183841280 (dev /dev/sdc
sector 1781821296)
kernel: parent transid verify failed on 907183878144 wanted 15935 found
12263
kernel: btrfs read error corrected: ino 1 off 907183878144 (dev /dev/sdc
sector 1781821368)
kernel: parent transid verify failed on 907183874048 wanted 15935 found
12263
kernel: btrfs read error corrected: ino 1 off 907183874048 (dev /dev/sdc
sector 1781821360)
kernel: verify_parent_transid: 25 callbacks suppressed
kernel: parent transid verify failed on 915431288832 wanted 16974 found
16972
kernel: repair_io_failure: 25 callbacks suppressed
kernel: btrfs read error corrected: ino 1 off 915431288832 (dev /dev/sdc
sector 1813658232)
kernel: parent transid verify failed on 915444523008 wanted 16974 found
13021
kernel: parent transid verify failed on 915444523008 wanted 16974 found
13021
[...]

One directory tree failed the diff checks so I ''mv''-ed that
one tree to
rename it out of the way and then ran an "rm -Rf" to remove it.

That appeared to run fine until:

kernel: parent transid verify failed on 915431862272 wanted 16974 found
16972
kernel: btrfs read error corrected: ino 1 off 915431862272 (dev /dev/sdc
sector 1813659352)
kernel: parent transid verify failed on 907185127424 wanted 15935 found
12264
kernel: btrfs read error corrected: ino 1 off 907185127424 (dev /dev/sdc
sector 1781823808)
kernel: parent transid verify failed on 915444523008 wanted 16974 found
13021
kernel: parent transid verify failed on 915444523008 wanted 16974 found
13021
kernel: parent transid verify failed on 915444523008 wanted 16974 found
13021
kernel: parent transid verify failed on 915444523008 wanted 16974 found
13021
kernel: BTRFS info (device sdc): failed to delete reference to
metadata.xml, inode 1846452 parent 5851502
kernel: ------------[ cut here ]------------
kernel: WARNING: CPU: 0 PID: 3236 at fs/btrfs/super.c:253
__btrfs_abort_transaction+0x4a/0xfc()
kernel: btrfs: Transaction aborted (error -5)
kernel: Modules linked in: nfsd auth_rpcgss oid_registry exportfs
nfs_acl lockd sunrpc bridge stp llc snd_hda_codec_realtek
snd_hda_codec_hdmi ppdev evdev serio_raw pcspkr acpi_cpufreq
snd_hda_intel snd_hda_codec mperf snd_pcm freq_table snd_page_alloc
snd_timer parport_pc processor wmi bnx2 snd parport thermal_sys
i2c_piix4 button usbhid firewire_ohci firewire_core xhci_hcd ata_generic
pata_acpi
kernel: CPU: 0 PID: 3236 Comm: nfsd Not tainted 3.11.2-gentoo_muse11_07 #1
kernel: Hardware name: System manufacturer System Product Name/E45M1-M
PRO, BIOS 0502 09/21/2011
kernel: 0000000000000000 ffffffff81700892 ffffffff815261d1 ffff8801f91f1c18
kernel: ffffffff8102ea45 ffff88010b18e5a0 ffffffff811df675 ffff8801f91f1c38
kernel: 00000000fffffffb ffff880233afb000 ffff880230a3b960 0000000000000e4e
kernel: Call Trace:
kernel: [<ffffffff815261d1>] ? dump_stack+0x41/0x51
kernel: [<ffffffff8102ea45>] ? warn_slowpath_common+0x79/0x92
kernel: [<ffffffff811df675>] ? __btrfs_abort_transaction+0x4a/0xfc
kernel: [<ffffffff8102eaf6>] ? warn_slowpath_fmt+0x45/0x4a
kernel: [<ffffffff811df675>] ? __btrfs_abort_transaction+0x4a/0xfc
kernel: [<ffffffff812071e3>] ? __btrfs_unlink_inode+0x19a/0x2c0
kernel: [<ffffffff812093bf>] ? btrfs_unlink_inode+0x12/0x35
kernel: [<ffffffff8120943e>] ? btrfs_unlink+0x5c/0x94
kernel: [<ffffffff810f8e03>] ? vfs_unlink+0x69/0xc8
kernel: [<ffffffffa029f215>] ? nfsd_unlink+0x18e/0x1d1 [nfsd]
kernel: [<ffffffffa02a4e87>] ? nfsd3_proc_remove+0x67/0xab [nfsd]
kernel: [<ffffffffa029a9d2>] ? nfsd_dispatch+0x91/0x148 [nfsd]
kernel: [<ffffffffa0234fc7>] ? svc_process+0x3e1/0x630 [sunrpc]
kernel: [<ffffffffa0235211>] ? svc_process+0x62b/0x630 [sunrpc]
kernel: [<ffffffffa029a574>] ? nfsd+0xc0/0x117 [nfsd]
kernel: [<ffffffffa029a4b4>] ? nfsd_destroy+0x64/0x64 [nfsd]
kernel: [<ffffffff81047287>] ? kthread+0xad/0xb5
kernel: [<ffffffff810471da>] ? kthread_freezable_should_stop+0x41/0x41
kernel: [<ffffffff8152c5ec>] ? ret_from_fork+0x7c/0xb0
kernel: [<ffffffff810471da>] ? kthread_freezable_should_stop+0x41/0x41
kernel: ---[ end trace 53d6fb93a497e75d ]---
kernel: BTRFS warning (device sdc): __btrfs_unlink_inode:3662: Aborting
unused transaction(IO failure).
kernel: parent transid verify failed on 915444523008 wanted 16974 found
13021
kernel: parent transid verify failed on 915444523008 wanted 16974 found
13021
kernel: parent transid verify failed on 915444523008 wanted 16974 found
13021
kernel: parent transid verify failed on 915444523008 wanted 16974 found
13021
kernel: parent transid verify failed on 915444523008 wanted 16974 found
13021
kernel: parent transid verify failed on 915444523008 wanted 16974 found
13021
kernel: btrfs read error corrected: ino 1 off 915433652224 (dev /dev/sdc
sector 1813662848)
kernel: btrfs read error corrected: ino 1 off 915433029632 (dev /dev/sdc
sector 1813661632)
kernel: btrfs read error corrected: ino 1 off 915433041920 (dev /dev/sdc
sector 1813661656)
kernel: btrfs read error corrected: ino 1 off 915433955328 (dev /dev/sdc
sector 1813663440)
kernel: btrfs read error corrected: ino 1 off 915433127936 (dev /dev/sdc
sector 1813661824)
kernel: btrfs read error corrected: ino 1 off 915434070016 (dev /dev/sdc
sector 1813663664)
kernel: btrfs read error corrected: ino 1 off 915433132032 (dev /dev/sdc
sector 1813661832)
kernel: btrfs read error corrected: ino 1 off 915433136128 (dev /dev/sdc
sector 1813661840)
kernel: btrfs read error corrected: ino 1 off 915433545728 (dev /dev/sdc
sector 1813662640)
kernel: BTRFS info (device sdc): failed to delete reference to
metadata.xml, inode 1846733 parent 5851559
kernel: BTRFS warning (device sdc): __btrfs_unlink_inode:3662: Aborting
unused transaction(IO failure).
kernel: verify_parent_transid: 96 callbacks suppressed
kernel: parent transid verify failed on 915431579648 wanted 16974 found
16972
kernel: repair_io_failure: 13 callbacks suppressed
kernel: btrfs read error corrected: ino 1 off 915431579648 (dev /dev/sdc
sector 1813658800)
kernel: parent transid verify failed on 915432382464 wanted 16974 found
16972
kernel: btrfs read error corrected: ino 1 off 915432382464 (dev /dev/sdc
sector 1813660368)
kernel: parent transid verify failed on 915444707328 wanted 16974 found
13021
kernel: btrfs read error corrected: ino 1 off 915444707328 (dev /dev/sdc
sector 1813684440)
kernel: parent transid verify failed on 915445092352 wanted 16974 found
13021
kernel: btrfs read error corrected: ino 1 off 915445092352 (dev /dev/sdc
sector 1813685192)
kernel: parent transid verify failed on 915445100544 wanted 16974 found
13021
kernel: btrfs read error corrected: ino 1 off 915445100544 (dev /dev/sdc
sector 1813685208)
kernel: parent transid verify failed on 915431026688 wanted 16974 found
16972
kernel: btrfs read error corrected: ino 1 off 915431026688 (dev /dev/sdc
sector 1813657720)
kernel: parent transid verify failed on 915432538112 wanted 16974 found
16972
kernel: btrfs read error corrected: ino 1 off 915432538112 (dev /dev/sdc
sector 1813660672)
kernel: parent transid verify failed on 915444740096 wanted 16974 found
13021
kernel: btrfs read error corrected: ino 1 off 915444740096 (dev /dev/sdc
sector 1813684504)
kernel: parent transid verify failed on 915444469760 wanted 16974 found
13021
kernel: parent transid verify failed on 915444469760 wanted 16974 found
13021
kernel: parent transid verify failed on 915444469760 wanted 16974 found
13021
kernel: parent transid verify failed on 915444469760 wanted 16974 found
13021
kernel: parent transid verify failed on 915444469760 wanted 16974 found
13021
kernel: parent transid verify failed on 915444469760 wanted 16974 found
13021
kernel: parent transid verify failed on 915444518912 wanted 16974 found
13021
kernel: parent transid verify failed on 915444518912 wanted 16974 found
13021
kernel: parent transid verify failed on 915444518912 wanted 16974 found
13021
kernel: verify_parent_transid: 45 callbacks suppressed
kernel: parent transid verify failed on 915444518912 wanted 16974 found
13021
kernel: parent transid verify failed on 915444518912 wanted 16974 found
13021
kernel: parent transid verify failed on 915444518912 wanted 16974 found
13021
kernel: parent transid verify failed on 915444518912 wanted 16974 found
13021
kernel: parent transid verify failed on 915444518912 wanted 16974 found
13021
kernel: parent transid verify failed on 915444518912 wanted 16974 found
13021
kernel: parent transid verify failed on 915444518912 wanted 16974 found
13021
kernel: parent transid verify failed on 915444518912 wanted 16974 found
13021
kernel: parent transid verify failed on 915444518912 wanted 16974 found
13021
kernel: parent transid verify failed on 915444518912 wanted 16974 found
13021
kernel: btrfs read error corrected: ino 1 off 915431141376 (dev /dev/sdc
sector 1813657944)
kernel: btrfs read error corrected: ino 1 off 915431165952 (dev /dev/sdc
sector 1813657992)
kernel: btrfs read error corrected: ino 1 off 915431272448 (dev /dev/sdc
sector 1813658200)
kernel: btrfs read error corrected: ino 1 off 915431161856 (dev /dev/sdc
sector 1813657984)
kernel: btrfs read error corrected: ino 1 off 915445268480 (dev /dev/sdc
sector 1813685536)
kernel: btrfs read error corrected: ino 1 off 915440472064 (dev /dev/sdc
sector 1813676168)
kernel: btrfs read error corrected: ino 1 off 915431170048 (dev /dev/sdc
sector 1813658000)
kernel: btrfs read error corrected: ino 1 off 915431174144 (dev /dev/sdc
sector 1813658008)
kernel: btrfs read error corrected: ino 1 off 915431378944 (dev /dev/sdc
sector 1813658408)
kernel: verify_parent_transid: 147 callbacks suppressed
kernel: parent transid verify failed on 915432869888 wanted 16974 found
16972
kernel: parent transid verify failed on 915444473856 wanted 16974 found
13021
kernel: parent transid verify failed on 915444473856 wanted 16974 found
13021
kernel: parent transid verify failed on 915433119744 wanted 16974 found
16972
kernel: parent transid verify failed on 915433656320 wanted 16974 found
16972
kernel: parent transid verify failed on 915433123840 wanted 16974 found
16972
kernel: parent transid verify failed on 915433050112 wanted 16974 found
16972
kernel: parent transid verify failed on 915444473856 wanted 16974 found
13021
kernel: parent transid verify failed on 915444473856 wanted 16974 found
13021
kernel: parent transid verify failed on 915444822016 wanted 16974 found
13021
kernel: BTRFS info (device sdc): failed to delete reference to
eggdrop-1.6.19.ebuild, inode 2096893 parent 5881667
kernel: BTRFS error (device sdc) in __btrfs_unlink_inode:3662: errno=-5
IO failure
kernel: BTRFS info (device sdc): forced readonly

Next best step to try?

Remount "-o recovery,noatime" again?

Thanks,
Martin

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Chris Murphy

2013-Oct-03 01:31 UTC

head link

Re: Corrupt btrfs filesystem recovery... What best instructions?

On Oct 2, 2013, at 6:49 PM, Martin <m_btrfs@ml1.co.uk> wrote:
> kernel: btrfs read error corrected: ino 1 off 907183792128 (dev /dev/sdc
sector 1781821200)
Can anyone answer if this is what corrupt metadata detection and correction
looks like? From the original email this is a single disk, with default
mkfs.btrfs. So I guess I''m asking an almost obvious question, but
I''m still going to ask it. There is only one copy of data, but two
copies of metadata so it can self-correct for metadata corruption.

Next question. Why is -o recovery needed to get this correction behavior? The
original post was completely devoid of messages indicating correction.


Chris Murphy--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Martin

2013-Oct-03 16:56 UTC

head link

Re: Corrupt btrfs filesystem recovery... What best instructions?

On 03/10/13 01:49, Martin wrote:
> Summary:
> 
> Mounting "-o recovery,noatime" worked well and allowed a diff
check to
> complete for all but one directory tree. So very nearly all the data is
> fine.
> 
> Deleting the failed directory tree caused a call stack dump and eventually:
> 
> kernel: parent transid verify failed on 915444822016 wanted 16974 found
> 13021
> kernel: BTRFS info (device sdc): failed to delete reference to
> eggdrop-1.6.19.ebuild, inode 2096893 parent 5881667
> kernel: BTRFS error (device sdc) in __btrfs_unlink_inode:3662: errno=-5
> IO failure
> kernel: BTRFS info (device sdc): forced readonly
> 
> 
> Greater detail listed below.
> 
> What next best to try?
> 
> Safer to try again but this time with with
"no_space_cache,no_inode_cache"?
> 
> Thanks,
> Martin
> Next best step to try?
> 
> Remount "-o recovery,noatime" again?

In the meantime, trying:

btrfsck /dev/sdc

gave the following output + abort:

parent transid verify failed on 915444523008 wanted 16974 found 13021
Ignoring transid failure
btrfsck: cmds-check.c:1066: process_file_extent: Assertion `!(rec->ino
!= key->objectid || rec->refs > 1)'' failed.
id not match free space cache generation (1625)
free space inode generation (0) did not match free space cache
generation (1607)
free space inode generation (0) did not match free space cache
generation (1604)
free space inode generation (0) did not match free space cache
generation (1606)
free space inode generation (0) did not match free space cache
generation (1620)
free space inode generation (0) did not match free space cache
generation (1626)
free space inode generation (0) did not match free space cache
generation (1609)
free space inode generation (0) did not match free space cache
generation (1653)
free space inode generation (0) did not match free space cache
generation (1628)
free space inode generation (0) did not match free space cache
generation (1628)
free space inode generation (0) did not match free space cache
generation (1649)

(There was no syslog output.)

Full btrfsck listing attached.

Suggestions please?

Thanks,
Martin

Martin

2013-Oct-04 15:43 UTC

head link

Re: Corrupt btrfs filesystem recovery... What best instructions?

What best to try next?

mount "-o recovery,noatime"

btrfsck:
    --repair                    try to repair the filesystem
    --init-csum-tree            create a new CRC tree
    --init-extent-tree          create a new extent tree

or is a "scrub" worthwhile?


The fail and switch to read-only occured whilst trying to delete a known
bad directory tree. No worries for losing the data in that.

But how best to clean up the filesystem errors?


Thanks,
Martin




On 03/10/13 17:56, Martin wrote:> On 03/10/13 01:49, Martin wrote:
> 
>> Summary:
>>
>> Mounting "-o recovery,noatime" worked well and allowed a diff
check to
>> complete for all but one directory tree. So very nearly all the data is
>> fine.
>>
>> Deleting the failed directory tree caused a call stack dump and
eventually:
>>
>> kernel: parent transid verify failed on 915444822016 wanted 16974 found
>> 13021
>> kernel: BTRFS info (device sdc): failed to delete reference to
>> eggdrop-1.6.19.ebuild, inode 2096893 parent 5881667
>> kernel: BTRFS error (device sdc) in __btrfs_unlink_inode:3662: errno=-5
>> IO failure
>> kernel: BTRFS info (device sdc): forced readonly
>>
>>
>> Greater detail listed below.
>>
>> What next best to try?
>>
>> Safer to try again but this time with with
"no_space_cache,no_inode_cache"?
>>
>> Thanks,
>> Martin
> 
> 
>> Next best step to try?
>>
>> Remount "-o recovery,noatime" again?
> 
> 
> In the meantime, trying:
> 
> btrfsck /dev/sdc
> 
> gave the following output + abort:
> 
> parent transid verify failed on 915444523008 wanted 16974 found 13021
> Ignoring transid failure
> btrfsck: cmds-check.c:1066: process_file_extent: Assertion `!(rec->ino
> != key->objectid || rec->refs > 1)'' failed.
> id not match free space cache generation (1625)
> free space inode generation (0) did not match free space cache
> generation (1607)
> free space inode generation (0) did not match free space cache
> generation (1604)
> free space inode generation (0) did not match free space cache
> generation (1606)
> free space inode generation (0) did not match free space cache
> generation (1620)
> free space inode generation (0) did not match free space cache
> generation (1626)
> free space inode generation (0) did not match free space cache
> generation (1609)
> free space inode generation (0) did not match free space cache
> generation (1653)
> free space inode generation (0) did not match free space cache
> generation (1628)
> free space inode generation (0) did not match free space cache
> generation (1628)
> free space inode generation (0) did not match free space cache
> generation (1649)
> 
> 
> (There was no syslog output.)
> 
> Full btrfsck listing attached.
> 
> 
> Suggestions please?
> 
> Thanks,
> Martin


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Martin

2013-Oct-05 11:32 UTC

head link

Re: Corrupt btrfs filesystem recovery... What best instructions?

No comment so blindly trying:

btrfsck --repair /dev/sdc

gave the following abort:

btrfsck: extent-tree.c:2736: alloc_reserved_tree_block: Assertion
`!(ret)'' failed.

Full output attached.


All on:

3.11.2-gentoo
Btrfs v0.20-rc1-358-g194aa4a

For a 2TB single HDD formatted with defaults.


What next?

Thanks,
Martin



>> In the meantime, trying:
>>
>> btrfsck /dev/sdc
>>
>> gave the following output + abort:
>>
>> parent transid verify failed on 915444523008 wanted 16974 found 13021
>> Ignoring transid failure
>> btrfsck: cmds-check.c:1066: process_file_extent: Assertion
`!(rec->ino
>> != key->objectid || rec->refs > 1)'' failed.
>> id not match free space cache generation (1625)
>> free space inode generation (0) did not match free space cache
>> generation (1607)
>> free space inode generation (0) did not match free space cache
>> generation (1604)
>> free space inode generation (0) did not match free space cache
>> generation (1606)
>> free space inode generation (0) did not match free space cache
>> generation (1620)
>> free space inode generation (0) did not match free space cache
>> generation (1626)
>> free space inode generation (0) did not match free space cache
>> generation (1609)
>> free space inode generation (0) did not match free space cache
>> generation (1653)
>> free space inode generation (0) did not match free space cache
>> generation (1628)
>> free space inode generation (0) did not match free space cache
>> generation (1628)
>> free space inode generation (0) did not match free space cache
>> generation (1649)
>>
>>
>> (There was no syslog output.)
>>
>> Full btrfsck listing attached.
>>
>>
>> Suggestions please?
>>
>> Thanks,
>> Martin

Martin

2013-Oct-05 12:05 UTC

head link

ASM1083 rev01 PCIe to PCI Bridge chip (Was: Corrupt btrfs filesystem recovery... (Due to sata errors))

On 28/09/13 20:26, Martin wrote:> AMD
> E-450 APU with Radeon(tm) HD Graphics AuthenticAMD GNU/Linux
Just in case someone else stumbles across this thread due to a related
problem for my particular motherboard...


There appears to be a fatal hardware bug for the interrupt line deassert
for a PCIe to PCI Bridge chip:

ASMedia Technology Inc. ASM1083/1085 PCIe to PCI Bridge (rev 01)

See the thread on https://lkml.org/lkml/2012/1/30/216

For that chip, the interrupt line is not always deasserted for PCI
interrupts. The hardware fault appears to be fixed in ASM1083 rev 03.
Unfortunately, there is no useful OS workaround possible for rev 01.

Hence, the PCI interrupts are unusable for ASM1083 rev01 ? :-(


In brief, this means that the PCI card slots on the motherboard cannot
be used for any hardware that might generate an interrupt. That means
pretty much all normal PCI cards. (The PCIe card slots are fine.)

For my own example, there does not appear to be any other devices using
that bridge chip. The only concern is for the sound chip but I happen to
never use sound on that system and so that is disabled.


The problem is listed in syslog/dmesg by lines such as:

kernel: irq 16: nobody cared (try booting with the "irqpoll" option)
kernel: Disabling IRQ #16


Unfortunately, the HDDs and network interfaces also use that irq or "irg
17" (which can also be affected). Losing the irq will badly slow down
your system and can cause data corruption for heavy use of the HDD.



Use:
lspci | grep -i ASM1083

to see if you have that chip and if so, what revision.

To see if you have any irqpoll messages, use:
grep -ia irqpoll /var/log/messages

To list what devices use what interrupts, use either of:
grep -ia '' irq '' /var/log/messages
cat /proc/interrupts



Note that there should no longer be any ASM1083 rev01 chips being
supplied by now. (ASM1083 rev03 chips have been seen in products.)

Hope that helps for that bit of obscurity!
Martin

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Martin

2013-Oct-05 13:18 UTC

head link

Re: Corrupt btrfs filesystem recovery... What best instructions?

So...

The hint there is "btrfsck: extent-tree.c:2736", so trying:

btrfsck --repair --init-extent-tree /dev/sdc

That ran for a while until:

kernel: btrfsck[16610]: segfault at cc ip 000000000041d2a7 sp
00007fffd2c2d710 error 4 in btrfsck[400000+4d000]

There''s no other messages in the syslog.

The output attached.


What next?


Thanks,
Martin



On 05/10/13 12:32, Martin wrote:> No comment so blindly trying:
> 
> btrfsck --repair /dev/sdc
> 
> gave the following abort:
> 
> btrfsck: extent-tree.c:2736: alloc_reserved_tree_block: Assertion
> `!(ret)'' failed.
> 
> Full output attached.
> 
> 
> All on:
> 
> 3.11.2-gentoo
> Btrfs v0.20-rc1-358-g194aa4a
> 
> For a 2TB single HDD formatted with defaults.
> 
> 
> What next?
> 
> Thanks,
> Martin
> 
> 
> 
> 
>>> In the meantime, trying:
>>>
>>> btrfsck /dev/sdc
>>>
>>> gave the following output + abort:
>>>
>>> parent transid verify failed on 915444523008 wanted 16974 found
13021
>>> Ignoring transid failure
>>> btrfsck: cmds-check.c:1066: process_file_extent: Assertion
`!(rec->ino
>>> != key->objectid || rec->refs > 1)'' failed.
>>> id not match free space cache generation (1625)
>>> free space inode generation (0) did not match free space cache
>>> generation (1607)
>>> free space inode generation (0) did not match free space cache
>>> generation (1604)
>>> free space inode generation (0) did not match free space cache
>>> generation (1606)
>>> free space inode generation (0) did not match free space cache
>>> generation (1620)
>>> free space inode generation (0) did not match free space cache
>>> generation (1626)
>>> free space inode generation (0) did not match free space cache
>>> generation (1609)
>>> free space inode generation (0) did not match free space cache
>>> generation (1653)
>>> free space inode generation (0) did not match free space cache
>>> generation (1628)
>>> free space inode generation (0) did not match free space cache
>>> generation (1628)
>>> free space inode generation (0) did not match free space cache
>>> generation (1649)
>>>
>>>
>>> (There was no syslog output.)
>>>
>>> Full btrfsck listing attached.
>>>
>>>
>>> Suggestions please?
>>>
>>> Thanks,
>>> Martin

Martin

2013-Oct-07 14:56 UTC

head link

btrfsck --repair --init-extent-tree: segfault error 4

Any clues or educated comment please?

Can the corrupt directory tree safely be ignored and left in place? Or
might that cause everything to fall over in a big heap as soon as I try
to write data again?


Could these other tricks work-around or fix the corrupt tree:

Run a scrub?

Make a snapshot and work from the snapshot?

Or try "mount -o recovery,noatime" again?


Or is it dead?

(The 1.5TB of backup data is replicated elsewhere but it would be good
to rescue this version rather than completely redo from scratch.
Especially so for the sake of just a few MBytes of one corrupt directory
tree.)

Thanks,
Martin



On 05/10/13 14:18, Martin wrote:> So...
> 
> The hint there is "btrfsck: extent-tree.c:2736", so trying:
> 
> btrfsck --repair --init-extent-tree /dev/sdc
> 
> That ran for a while until:
> 
> kernel: btrfsck[16610]: segfault at cc ip 000000000041d2a7 sp
> 00007fffd2c2d710 error 4 in btrfsck[400000+4d000]
> 
> There''s no other messages in the syslog.
> 
> The output attached.
> 
> 
> What next?
> 
> 
> Thanks,
> Martin
> 
> 
> 
> On 05/10/13 12:32, Martin wrote:
>> No comment so blindly trying:
>>
>> btrfsck --repair /dev/sdc
>>
>> gave the following abort:
>>
>> btrfsck: extent-tree.c:2736: alloc_reserved_tree_block: Assertion
>> `!(ret)'' failed.
>>
>> Full output attached.
>>
>>
>> All on:
>>
>> 3.11.2-gentoo
>> Btrfs v0.20-rc1-358-g194aa4a
>>
>> For a 2TB single HDD formatted with defaults.
>>
>>
>> What next?
>>
>> Thanks,
>> Martin
>>
>>
>>
>>
>>>> In the meantime, trying:
>>>>
>>>> btrfsck /dev/sdc
>>>>
>>>> gave the following output + abort:
>>>>
>>>> parent transid verify failed on 915444523008 wanted 16974 found
13021
>>>> Ignoring transid failure
>>>> btrfsck: cmds-check.c:1066: process_file_extent: Assertion
`!(rec->ino
>>>> != key->objectid || rec->refs > 1)'' failed.
>>>> id not match free space cache generation (1625)
>>>> free space inode generation (0) did not match free space cache
>>>> generation (1607)
>>>> free space inode generation (0) did not match free space cache
>>>> generation (1604)
>>>> free space inode generation (0) did not match free space cache
>>>> generation (1606)
>>>> free space inode generation (0) did not match free space cache
>>>> generation (1620)
>>>> free space inode generation (0) did not match free space cache
>>>> generation (1626)
>>>> free space inode generation (0) did not match free space cache
>>>> generation (1609)
>>>> free space inode generation (0) did not match free space cache
>>>> generation (1653)
>>>> free space inode generation (0) did not match free space cache
>>>> generation (1628)
>>>> free space inode generation (0) did not match free space cache
>>>> generation (1628)
>>>> free space inode generation (0) did not match free space cache
>>>> generation (1649)
>>>>
>>>>
>>>> (There was no syslog output.)
>>>>
>>>> Full btrfsck listing attached.
>>>>
>>>>
>>>> Suggestions please?
>>>>
>>>> Thanks,
>>>> Martin


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Chris Murphy

2013-Oct-07 19:03 UTC

head link

Re: btrfsck --repair --init-extent-tree: segfault error 4

On Oct 7, 2013, at 8:56 AM, Martin <m_btrfs@ml1.co.uk> wrote:
> 
> Or try "mount -o recovery,noatime" again?
Because of this: free space inode generation (0) did not match free space cache
generation (1607)

Try mount option clear_cache. You could then use iotop to make sure the
btrfs-freespace process becomes inactive before unmounting the file system; I
don''t think you need to wait in order to use the file system, nor do
you need to unmount then remount without the option. But if it works, it should
only be needed once, not as a persistent mount option.

> Or is it dead?
> 
> (The 1.5TB of backup data is replicated elsewhere but it would be good
> to rescue this version rather than completely redo from scratch.
> Especially so for the sake of just a few MBytes of one corrupt directory
> tree.)
Right. If you snapshot the subvolume containing the corrupt portion of the file
system, the snapshot probably inherits that corruption. But if you write to only
one of them, if those writes make the problem worse, should be isolated only to
the one you write to. I might avoid writing to it, honestly. To save time, get
increasingly aggressive to get data out of this directory and once you succeed,
blow away the file system and start from scratch.

You could also then try kernel 3.12 rc4, as there are some btrfs bug fixes
I''m seeing in there also, but I don''t know if any of them will
help your case. If you try it, mount normally, then try to get your data. If
that doesn''t work, try the recovery option. Maybe you''ll get
different results.

Chris Murphy--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Martin

2013-Oct-09 16:03 UTC

head link

Re: btrfsck --repair --init-extent-tree: segfault error 4

In summary:

Looks like minimal damage remains and yet I''m still suffering
"Input/output error" from btrfs and btrfsck appears to have looped...

A diff check suggests the damage to be in one (heavily linked to) tree
of a few MBytes.

Would a scrub clear out the damaged trees?


Worth debugging?

Thanks,
Martin


Further detail:


On 07/10/13 20:03, Chris Murphy wrote:> 
> On Oct 7, 2013, at 8:56 AM, Martin <m_btrfs@ml1.co.uk> wrote:
> 
>> 
>> Or try "mount -o recovery,noatime" again?
> 
> Because of this: free space inode generation (0) did not match free
> space cache generation (1607)
> 
> Try mount option clear_cache. You could then use iotop to make sure
> the btrfs-freespace process becomes inactive before unmounting the
> file system; I don''t think you need to wait in order to use the
file
> system, nor do you need to unmount then remount without the option.
> But if it works, it should only be needed once, not as a persistent
> mount option.
Thanks for that.

So, trying:

mount -v -t btrfs -o recovery,noatime,clear_cache /dev/sdc

gave:

kernel: device label bu_A devid 1 transid 17448 /dev/sdc
kernel: btrfs: enabling inode map caching
kernel: btrfs: enabling auto recovery
kernel: btrfs: force clearing of disk cache
kernel: btrfs: disk space caching is enabled
kernel: btrfs: bdev /dev/sdc errs: wr 0, rd 27, flush 0, corrupt 0, gen 0


btrfs-freespace appeared occasionally briefly in atop but there''s no
noticeable disk activity. All very rapidly done?

Running a diff check to see if all ok and what might be missing gave the
syslog output:

kernel: verify_parent_transid: 165 callbacks suppressed
kernel: parent transid verify failed on 915444506624 wanted 16974 found
13021
kernel: parent transid verify failed on 915444506624 wanted 16974 found
13021
kernel: parent transid verify failed on 915444506624 wanted 16974 found
13021
kernel: parent transid verify failed on 915444506624 wanted 16974 found
13021
kernel: parent transid verify failed on 915444506624 wanted 16974 found
13021
kernel: parent transid verify failed on 915444506624 wanted 16974 found
13021


The diff eventually failed with "Input/output error".

''mv'' to move this failed directory tree out of the way worked.
Attempting to use ''ln -s'' gave the attached syslog output and
the
filesystem was made "Read-only".

Remounting:

mount -v -o remount,recovery,noatime,clear_cache,rw /dev/sdc

and the mv looks fine. Trying the ''ln -s'' again gives:

ln: creating symbolic link `./portage'': Read-only file system

unmounting gave the syslog message:

kernel: btrfs: commit super ret -30


Mounting again:

mount -v -t btrfs -o recovery,noatime,clear_cache /dev/sdc

showed that the symbolic link was put in place ok.

Rerunning the diff check eventually found another "Input/output
error".


So unmounted and tried again:

btrfsck --repair --init-extent-tree /dev/sdc

Failed with:

btrfs unable to find ref byte nr 911367733248 parent 0 root 1  owner 2
offset 0
btrfs unable to find ref byte nr 911367737344 parent 0 root 1  owner 1
offset 1
btrfs unable to find ref byte nr 911367741440 parent 0 root 1  owner 0
offset 1
leaf free space ret -297791851, leaf data size 3995, used 297795846
nritems 2
checking extents
btrfsck: extent_io.c:606: free_extent_buffer: Assertion `!(eb->refs <
0)'' failed.
enabling repair mode
Checking filesystem on /dev/sdc
UUID: 38a60270-f9c6-4ed4-8421-4bf1253ae0b3
Creating a new extent tree
Failed to find [911367733248, 168, 4096]
Failed to find [911367737344, 168, 4096]
Failed to find [911367741440, 168, 4096]



Rerunning again and this time btrfsck is sat there at 100% CPU for the
last 24 hours. Full output so far is:

parent transid verify failed on 911904604160 wanted 17448 found 17449
parent transid verify failed on 911904604160 wanted 17448 found 17449
parent transid verify failed on 911904604160 wanted 17448 found 17449
parent transid verify failed on 911904604160 wanted 17448 found 17449
Ignoring transid failure


Nothing syslog and no disk activity.

Looped?...



>> Or is it dead?
>> 
>> (The 1.5TB of backup data is replicated elsewhere but it would be
>> good to rescue this version rather than completely redo from
>> scratch. Especially so for the sake of just a few MBytes of one
>> corrupt directory tree.)
> 
> Right. If you snapshot the subvolume containing the corrupt portion
> of the file system, the snapshot probably inherits that corruption.
> But if you write to only one of them, if those writes make the
> problem worse, should be isolated only to the one you write to. I
> might avoid writing to it, honestly. To save time, get increasingly
> aggressive to get data out of this directory and once you succeed,
> blow away the file system and start from scratch.
> 
> You could also then try kernel 3.12 rc4, as there are some btrfs bug
> fixes I''m seeing in there also, but I don''t know if any
of them will
> help your case. If you try it, mount normally, then try to get your
> data. If that doesn''t work, try the recovery option. Maybe
you''ll get
> different results.
As suspected, thanks.

Would a scrub clear out the damaged trees?


Anything useful to try? Any debug value in looking at the fail cases?

Is there a btrfsck mode of making good everything that is certain and
dumping any remaining fragments into "lost + found"? (Or is that way
down the developments yet?)


Aside: btrfs looks to be usable enough, especially so with the disk
format now stable, to at least offer the well established features as
''stable''...?

(This is the first fail I''ve had, and considering the sata failed, is
no surprise... Too severe a test! But can the limited damage be
recovered...?)


Thanks,
Martin

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Btrfs devel - Sep 2013 - Corrupt btrfs filesystem recovery... (Due to *sata* errors)

Corrupt btrfs filesystem recovery... (Due to *sata* errors)

Re: Corrupt btrfs filesystem recovery... (Due to *sata* errors)

Re: Corrupt btrfs filesystem recovery... (Due to *sata* errors)

Re: Corrupt btrfs filesystem recovery... (Due to *sata* errors)

Re: Corrupt btrfs filesystem recovery... (Due to *sata* errors)

Re: Corrupt btrfs filesystem recovery... What best instructions?

Re: Corrupt btrfs filesystem recovery... (Due to *sata* errors)

Re: Corrupt btrfs filesystem recovery... What best instructions?

Re: Corrupt btrfs filesystem recovery... What best instructions?

Re: Corrupt btrfs filesystem recovery... What best instructions?

Re: Corrupt btrfs filesystem recovery... What best instructions?

Re: Corrupt btrfs filesystem recovery... What best instructions?

Re: Corrupt btrfs filesystem recovery... What best instructions?

Re: Corrupt btrfs filesystem recovery... What best instructions?

Re: Corrupt btrfs filesystem recovery... What best instructions?

Re: Corrupt btrfs filesystem recovery... What best instructions?

ASM1083 rev01 PCIe to PCI Bridge chip (Was: Corrupt btrfs filesystem recovery... (Due to *sata* errors))

Re: Corrupt btrfs filesystem recovery... What best instructions?

btrfsck --repair --init-extent-tree: segfault error 4

Re: btrfsck --repair --init-extent-tree: segfault error 4

Re: btrfsck --repair --init-extent-tree: segfault error 4

Btrfs devel - Sep 2013 - Corrupt btrfs filesystem recovery... (Due to sata errors)

Corrupt btrfs filesystem recovery... (Due to sata errors)

Re: Corrupt btrfs filesystem recovery... (Due to sata errors)

Re: Corrupt btrfs filesystem recovery... (Due to sata errors)

Re: Corrupt btrfs filesystem recovery... (Due to sata errors)

Re: Corrupt btrfs filesystem recovery... (Due to sata errors)

Re: Corrupt btrfs filesystem recovery... (Due to sata errors)

ASM1083 rev01 PCIe to PCI Bridge chip (Was: Corrupt btrfs filesystem recovery... (Due to sata errors))