thr3ads.net - CentOS - [CentOS] LVM failure after CentOS 7.6 upgrade -- possible corruption [Dec 2018]

If this information is useful, please help other people find it:
Share via:

Gordon Messmer

2018-Dec-05 17:27 UTC

[CentOS] LVM failure after CentOS 7.6 upgrade -- possible corruption

I've started updating systems to CentOS 7.6, and so far I have one failure.

This system has two peculiarities which might have triggered the 
problem.  The first is that one of the software RAID arrays on this 
system is degraded.  While troubleshooting the problem, I saw similar 
error messages mentioned in bug reports indicating that sGNU/Linux 
ystems would not boot with degraded software RAID arrays.  The other 
peculiar aspect is that the system uses dm-cache.

Logs from some of the early failed boots are not available, but before I 
completely fixed the problem, I was able to bring the system up once, 
and captured logs which look substantially similar to the initial boot. 
The content of /var/log/messages is here:
	https://paste.fedoraproject.org/paste/n-E6X76FWIKzIvzPOw97uw

The output of lsblk (minus some VM logical volumes) is here:
	https://paste.fedoraproject.org/paste/OizFvMeGn81vF52VEvUbyg

As best I can tell, the LVM tools were treating software RAID component 
devices as PVs, and detecting a conflict between those and the assembled 
RAID volume.  When running "pvs" on the broken system, no RAID volumes
were listed, only component devices.  At the moment, I don't know if the 
LVs that were activated by the initrd were backed by component devices 
or the RAID devices, so it's possible that this bug might corrupt 
software RAID arrays.

In order to correct the problem, I had to add a global_filter to 
/etc/lvm/lvm.conf and rebuild the initrd (dracut -f):
	global_filter = [ "r|vm_.*_data|", "a|sdd1|",
"r|sd..|" ]

This filter excludes the LVs that contain VM data, accepts "/dev/sdd1"
which is the dm-cache device, and rejects all other partitions on 
SCSI(SATA) device nodes, as all of those are RAID component devices.

I'm still working on the details of the problem, but I wanted to share 
what I know now in case anyone else might be affected.

After updating, look at the output of "pvs" if you use LVM on software
RAID.

Simon Matter

2018-Dec-05 17:56 UTC

head link

[CentOS] LVM failure after CentOS 7.6 upgrade -- possible corruption

> I've started updating systems to CentOS 7.6, and so far I have one
> failure.
>
> This system has two peculiarities which might have triggered the
> problem.  The first is that one of the software RAID arrays on this
> system is degraded.  While troubleshooting the problem, I saw similar
> error messages mentioned in bug reports indicating that sGNU/Linux
> ystems would not boot with degraded software RAID arrays.  The other
> peculiar aspect is that the system uses dm-cache.
>
> Logs from some of the early failed boots are not available, but before I
> completely fixed the problem, I was able to bring the system up once,
> and captured logs which look substantially similar to the initial boot.
> The content of /var/log/messages is here:
> 	https://paste.fedoraproject.org/paste/n-E6X76FWIKzIvzPOw97uw
>
> The output of lsblk (minus some VM logical volumes) is here:
> 	https://paste.fedoraproject.org/paste/OizFvMeGn81vF52VEvUbyg
>
> As best I can tell, the LVM tools were treating software RAID component
> devices as PVs, and detecting a conflict between those and the assembled
> RAID volume.  When running "pvs" on the broken system, no RAID
volumes
> were listed, only component devices.  At the moment, I don't know if
the
> LVs that were activated by the initrd were backed by component devices
> or the RAID devices, so it's possible that this bug might corrupt
> software RAID arrays.
>
> In order to correct the problem, I had to add a global_filter to
> /etc/lvm/lvm.conf and rebuild the initrd (dracut -f):
> 	global_filter = [ "r|vm_.*_data|", "a|sdd1|",
"r|sd..|" ]
>
> This filter excludes the LVs that contain VM data, accepts
"/dev/sdd1"
> which is the dm-cache device, and rejects all other partitions on
> SCSI(SATA) device nodes, as all of those are RAID component devices.
>
> I'm still working on the details of the problem, but I wanted to share
> what I know now in case anyone else might be affected.
>
> After updating, look at the output of "pvs" if you use LVM on
software
> RAID.
What exactly did `pvs' show and instead of what?

Regards,
Simon

Benjamin Smith

2018-Dec-05 19:27 UTC

head link

[CentOS] LVM failure after CentOS 7.6 upgrade -- possible corruption

My gut feeling is that this is related to a RAID1 issue I'm seeing with 7.6.
See email thread "CentOS 7.6: Software RAID1 fails the only meaningful
test"

I suggest trying to boot from an earlier kernel. Good luck! 

Ben S 


On Wednesday, December 5, 2018 9:27:22 AM PST Gordon Messmer
wrote:> I've started updating systems to CentOS 7.6, and so far I have one
failure.
> 
> This system has two peculiarities which might have triggered the
> problem.  The first is that one of the software RAID arrays on this
> system is degraded.  While troubleshooting the problem, I saw similar
> error messages mentioned in bug reports indicating that sGNU/Linux
> ystems would not boot with degraded software RAID arrays.  The other
> peculiar aspect is that the system uses dm-cache.
> 
> Logs from some of the early failed boots are not available, but before I
> completely fixed the problem, I was able to bring the system up once,
> and captured logs which look substantially similar to the initial boot.
> The content of /var/log/messages is here:
> 	https://paste.fedoraproject.org/paste/n-E6X76FWIKzIvzPOw97uw
> 
> The output of lsblk (minus some VM logical volumes) is here:
> 	https://paste.fedoraproject.org/paste/OizFvMeGn81vF52VEvUbyg
> 
> As best I can tell, the LVM tools were treating software RAID component
> devices as PVs, and detecting a conflict between those and the assembled
> RAID volume.  When running "pvs" on the broken system, no RAID
volumes
> were listed, only component devices.  At the moment, I don't know if
the
> LVs that were activated by the initrd were backed by component devices
> or the RAID devices, so it's possible that this bug might corrupt
> software RAID arrays.
> 
> In order to correct the problem, I had to add a global_filter to
> /etc/lvm/lvm.conf and rebuild the initrd (dracut -f):
> 	global_filter = [ "r|vm_.*_data|", "a|sdd1|",
"r|sd..|" ]
> 
> This filter excludes the LVs that contain VM data, accepts
"/dev/sdd1"
> which is the dm-cache device, and rejects all other partitions on
> SCSI(SATA) device nodes, as all of those are RAID component devices.
> 
> I'm still working on the details of the problem, but I wanted to share
> what I know now in case anyone else might be affected.
> 
> After updating, look at the output of "pvs" if you use LVM on
software RAID.
> _______________________________________________
> CentOS mailing list
> CentOS at centos.org
> https://lists.centos.org/mailman/listinfo/centos

Stephen John Smoogen

2018-Dec-05 19:38 UTC

head link

[CentOS] LVM failure after CentOS 7.6 upgrade -- possible corruption

On Wed, 5 Dec 2018 at 14:27, Benjamin Smith <lists at benjamindsmith.com>
wrote:>
> My gut feeling is that this is related to a RAID1 issue I'm seeing with
7.6.
> See email thread "CentOS 7.6: Software RAID1 fails the only meaningful
test"
>
You might want to point out which list you posted it on since it
doesn't seem to be this one.

> I suggest trying to boot from an earlier kernel. Good luck!
>
> Ben S
>
>
> On Wednesday, December 5, 2018 9:27:22 AM PST Gordon Messmer wrote:
> > I've started updating systems to CentOS 7.6, and so far I have one
failure.
> >
> > This system has two peculiarities which might have triggered the
> > problem.  The first is that one of the software RAID arrays on this
> > system is degraded.  While troubleshooting the problem, I saw similar
> > error messages mentioned in bug reports indicating that sGNU/Linux
> > ystems would not boot with degraded software RAID arrays.  The other
> > peculiar aspect is that the system uses dm-cache.
> >
> > Logs from some of the early failed boots are not available, but before
I
> > completely fixed the problem, I was able to bring the system up once,
> > and captured logs which look substantially similar to the initial
boot.
> > The content of /var/log/messages is here:
> >       https://paste.fedoraproject.org/paste/n-E6X76FWIKzIvzPOw97uw
> >
> > The output of lsblk (minus some VM logical volumes) is here:
> >       https://paste.fedoraproject.org/paste/OizFvMeGn81vF52VEvUbyg
> >
> > As best I can tell, the LVM tools were treating software RAID
component
> > devices as PVs, and detecting a conflict between those and the
assembled
> > RAID volume.  When running "pvs" on the broken system, no
RAID volumes
> > were listed, only component devices.  At the moment, I don't know
if the
> > LVs that were activated by the initrd were backed by component devices
> > or the RAID devices, so it's possible that this bug might corrupt
> > software RAID arrays.
> >
> > In order to correct the problem, I had to add a global_filter to
> > /etc/lvm/lvm.conf and rebuild the initrd (dracut -f):
> >       global_filter = [ "r|vm_.*_data|",
"a|sdd1|", "r|sd..|" ]
> >
> > This filter excludes the LVs that contain VM data, accepts
"/dev/sdd1"
> > which is the dm-cache device, and rejects all other partitions on
> > SCSI(SATA) device nodes, as all of those are RAID component devices.
> >
> > I'm still working on the details of the problem, but I wanted to
share
> > what I know now in case anyone else might be affected.
> >
> > After updating, look at the output of "pvs" if you use LVM
on software RAID.
> > _______________________________________________
> > CentOS mailing list
> > CentOS at centos.org
> > https://lists.centos.org/mailman/listinfo/centos
>
>
>
>
> _______________________________________________
> CentOS mailing list
> CentOS at centos.org
> https://lists.centos.org/mailman/listinfo/centos


-- 
Stephen J Smoogen.

Gordon Messmer

2018-Dec-05 23:11 UTC

head link

[CentOS] LVM failure after CentOS 7.6 upgrade -- possible corruption

On 12/5/18 9:56 AM, Simon Matter wrote:>> When running "pvs" on the broken system, no RAID volumes
>> were listed, only component devices....
>> After updating, look at the output of "pvs" if you use LVM on
software
>> RAID.
> What exactly did `pvs' show and instead of what?
It should print:

# pvs
 ? PV???????? VG????????? Fmt? Attr PSize?? PFree
 ? /dev/md127 VolGroup??? lvm2 a--?? <2.73t <768.41g
 ? /dev/md2?? BackupGroup lvm2 a--?? <2.73t?????? 0
 ? /dev/sdd1? VolGroup??? lvm2 a--? <55.88g?????? 0

and IIRC, it printed:

# pvs
 ? PV???????? VG????????? Fmt? Attr PSize?? PFree
 ? /dev/sda3 VolGroup??? lvm2 a--?? <2.73t <768.41g
 ? /dev/sdc1 ? BackupGroup lvm2 a--?? <2.73t?????? 0

Gordon Messmer

2018-Dec-06 04:34 UTC

head link

[CentOS] LVM failure after CentOS 7.6 upgrade -- possible corruption

On 12/5/18 9:27 AM, Gordon Messmer wrote:> The content of /var/log/messages is here:
>  ????https://paste.fedoraproject.org/paste/n-E6X76FWIKzIvzPOw97uw
I don't have much new information, other than that I tested booting a 
similar system with an intentionally degraded RAID volume.  That one 
booted properly, so I don't think that was the problem.  The dm-cache 
device still needs further investigation, but I'm going to wait for all 
RAID arrays to re-sync before further testing.

Going through the log again, I'm looking at this line:
Dec  4 21:17:34 ascension lvm: WARNING: Device mismatch detected for 
VolGroup/lv_root which is accessing /dev/md127 instead of /dev/sda3.

Since it says "is accessing /dev/md127", I think the kernel activated 
the LVs properly, in which case there shouldn't be any corruption risk.

I still can't reason why the lvm tools were scanning the component 
volumes to begin with.

Gordon Messmer

2018-Dec-06 05:57 UTC

head link

[CentOS] LVM failure after CentOS 7.6 upgrade -- possible corruption

On 12/5/18 8:34 PM, Gordon Messmer wrote:> I still can't reason why the lvm tools were scanning the component 
> volumes to begin with.
I think I've figured it out.  The new lvm-tools package appears to have 
broken support for detecting dm metadata version 0.90.  The update 
should be stable for anyone who did not upgrade from earlier versions of 
CentOS.  (I don't actually know when Anaconda last used 0.90 metadata)

On a working system with "verbose = 6" in lvm.conf:

# mdadm --detail /dev/md/primary
/dev/md/primary:
            Version : 1.2
...
# pvs
...
#device/dev-io.c:609           Opened /dev/sda3 RO O_DIRECT
#device/dev-io.c:359         /dev/sda3: size is 1951133696 sectors
#device/dev-io.c:658           Closed /dev/sda3
#filters/filter-mpath.c:196           /dev/sda3: Device is a partition, 
using primary device sda for mpath component detection
#device/dev-io.c:336         /dev/sda3: using cached size 1951133696 sectors
#device/dev-md.c:163           Found md magic number at offset 4096 of 
/dev/sda3.
#filters/filter-md.c:108           /dev/sda3: Skipping md component device
...


On the broken system:

# mdadm --detail /dev/md/primary
/dev/md/primary:
            Version : 0.90
...
# pvs
...
#device/dev-io.c:609           Opened /dev/sda3 RO O_DIRECT
#device/dev-io.c:359         /dev/sda3: size is 5858142208 sectors
#device/dev-io.c:658           Closed /dev/sda3
#filters/filter-mpath.c:196           /dev/sda3: Device is a partition, 
using primary device sda for mpath component detection
#filters/filter-partitioned.c:30            filter partitioned deferred 
/dev/sda3
#filters/filter-md.c:99            filter md deferred /dev/sda3
#filters/filter-persistent.c:346           filter caching good /dev/sda3

Maybe Matching Threads

Search for more apparently analagous threads

CentOS - Dec 2018 - LVM failure after CentOS 7.6 upgrade -- possible corruption

[CentOS] LVM failure after CentOS 7.6 upgrade -- possible corruption

[CentOS] LVM failure after CentOS 7.6 upgrade -- possible corruption

[CentOS] LVM failure after CentOS 7.6 upgrade -- possible corruption

[CentOS] LVM failure after CentOS 7.6 upgrade -- possible corruption

[CentOS] LVM failure after CentOS 7.6 upgrade -- possible corruption

[CentOS] LVM failure after CentOS 7.6 upgrade -- possible corruption

[CentOS] LVM failure after CentOS 7.6 upgrade -- possible corruption

Maybe Matching Threads