Received the following message in mail to root: Message 257:>From root at desk4.localdomain Tue Oct 28 07:25:37 2014Return-Path: <root at desk4.localdomain> X-Original-To: root Delivered-To: root at desk4.localdomain From: mdadm monitoring <root at desk4.localdomain> To: root at desk4.localdomain Subject: DegradedArray event on /dev/md0:desk4 Date: Tue, 28 Oct 2014 07:25:27 -0400 (EDT) Status: RO This is an automatically generated mail message from mdadm running on desk4 A DegradedArray event had been detected on md device /dev/md0. Faithfully yours, etc. P.S. The /proc/mdstat file currently contains the following: Personalities : [raid1] md0 : active raid1 dm-2[1] 243682172 blocks super 1.1 [2/1] [_U] bitmap: 2/2 pages [8KB], 65536KB chunk md1 : active raid1 dm-3[0] dm-0[1] 1953510268 blocks super 1.1 [2/2] [UU] bitmap: 3/15 pages [12KB], 65536KB chunk unused devices: <none> & q Held 314 messages in /var/spool/mail/root You have mail in /var/spool/mail/root Ran a madam query against both raid partitions: [root at desk4 ~]# mdadm --query --detail /dev/md0 /dev/md0: Version : 1.1 Creation Time : Thu Nov 15 19:24:17 2012 Raid Level : raid1 Array Size : 243682172 (232.39 GiB 249.53 GB) Used Dev Size : 243682172 (232.39 GiB 249.53 GB) Raid Devices : 2 Total Devices : 1 Persistence : Superblock is persistent Intent Bitmap : Internal Update Time : Tue Dec 2 20:02:55 2014 State : active, degraded Active Devices : 1 Working Devices : 1 Failed Devices : 0 Spare Devices : 0 Name : desk4.localdomain:0 UUID : 29f70093:ae78cf9f:0ab7c1cd:e380f50b Events : 266241 Number Major Minor RaidDevice State 0 0 0 0 removed 1 253 3 1 active sync /dev/dm-3 [root at desk4 ~]# [root at desk4 ~]# mdadm --query --detail /dev/md1 /dev/md1: Version : 1.1 Creation Time : Thu Nov 15 19:24:19 2012 Raid Level : raid1 Array Size : 1953510268 (1863.01 GiB 2000.39 GB) Used Dev Size : 1953510268 (1863.01 GiB 2000.39 GB) Raid Devices : 2 Total Devices : 2 Persistence : Superblock is persistent Intent Bitmap : Internal Update Time : Tue Dec 2 20:06:21 2014 State : active Active Devices : 2 Working Devices : 2 Failed Devices : 0 Spare Devices : 0 Name : desk4.localdomain:1 UUID : 1bef270d:36301a24:7b93c7a9:a2a95879 Events : 108306 Number Major Minor RaidDevice State 0 253 0 0 active sync /dev/dm-0 1 253 1 1 active sync /dev/dm-1 [root at desk4 ~]# Appears to me that device 0 (/dev/dm-2) on md0 has been removed because of problems. This is my first encounter with a raid failure. I suspect I should replace disk 0 and let the raid rebuild itself. Seeking guidance and a good source for the procedures. Dave M
On 02/12/14 08:14 PM, David McGuffey wrote:> Received the following message in mail to root: > > Message 257: > From root at desk4.localdomain Tue Oct 28 07:25:37 2014 > Return-Path: <root at desk4.localdomain> > X-Original-To: root > Delivered-To: root at desk4.localdomain > From: mdadm monitoring <root at desk4.localdomain> > To: root at desk4.localdomain > Subject: DegradedArray event on /dev/md0:desk4 > Date: Tue, 28 Oct 2014 07:25:27 -0400 (EDT) > Status: RO > > This is an automatically generated mail message from mdadm > running on desk4 > > A DegradedArray event had been detected on md device /dev/md0. > > Faithfully yours, etc. > > P.S. The /proc/mdstat file currently contains the following: > > Personalities : [raid1] > md0 : active raid1 dm-2[1] > 243682172 blocks super 1.1 [2/1] [_U] > bitmap: 2/2 pages [8KB], 65536KB chunk > > md1 : active raid1 dm-3[0] dm-0[1] > 1953510268 blocks super 1.1 [2/2] [UU] > bitmap: 3/15 pages [12KB], 65536KB chunk > > unused devices: <none> > > & q > Held 314 messages in /var/spool/mail/root > You have mail in /var/spool/mail/root > > Ran a madam query against both raid partitions: > > [root at desk4 ~]# mdadm --query --detail /dev/md0 > /dev/md0: > Version : 1.1 > Creation Time : Thu Nov 15 19:24:17 2012 > Raid Level : raid1 > Array Size : 243682172 (232.39 GiB 249.53 GB) > Used Dev Size : 243682172 (232.39 GiB 249.53 GB) > Raid Devices : 2 > Total Devices : 1 > Persistence : Superblock is persistent > > Intent Bitmap : Internal > > Update Time : Tue Dec 2 20:02:55 2014 > State : active, degraded > Active Devices : 1 > Working Devices : 1 > Failed Devices : 0 > Spare Devices : 0 > > Name : desk4.localdomain:0 > UUID : 29f70093:ae78cf9f:0ab7c1cd:e380f50b > Events : 266241 > > Number Major Minor RaidDevice State > 0 0 0 0 removed > 1 253 3 1 active sync /dev/dm-3 > > [root at desk4 ~]# [root at desk4 ~]# mdadm --query --detail /dev/md1 > /dev/md1: > Version : 1.1 > Creation Time : Thu Nov 15 19:24:19 2012 > Raid Level : raid1 > Array Size : 1953510268 (1863.01 GiB 2000.39 GB) > Used Dev Size : 1953510268 (1863.01 GiB 2000.39 GB) > Raid Devices : 2 > Total Devices : 2 > Persistence : Superblock is persistent > > Intent Bitmap : Internal > > Update Time : Tue Dec 2 20:06:21 2014 > State : active > Active Devices : 2 > Working Devices : 2 > Failed Devices : 0 > Spare Devices : 0 > > Name : desk4.localdomain:1 > UUID : 1bef270d:36301a24:7b93c7a9:a2a95879 > Events : 108306 > > Number Major Minor RaidDevice State > 0 253 0 0 active sync /dev/dm-0 > 1 253 1 1 active sync /dev/dm-1 > [root at desk4 ~]# > > Appears to me that device 0 (/dev/dm-2) on md0 has been removed because > of problems. > > This is my first encounter with a raid failure. I suspect I should > replace disk 0 and let the raid rebuild itself. > > Seeking guidance and a good source for the procedures. > > Dave MIn short, buy a replacement disk equal or greater size, create matching partitions and then use mdadm to add the replacement partition (of appropriate size) back into the array. An example command to add a replacement partition would be: mdadm --manage /dev/md0 --add /dev/sda1 I strongly recommend creating a virtual machine with a pair of virtual disks and simulating the replacement of the drive before trying it out on your real system. In any case, be sure to have good backups (immediately). -- Digimer Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer is trapped in the mind of a person without access to education?
On 2014-12-03, David McGuffey <davidmcguffey at verizion.net> wrote:> > Appears to me that device 0 (/dev/dm-2) on md0 has been removed because > of problems.That looks about right. There may be more error messages in your system logs (e.g., /var/log/messages, dmesg), which might tell you more about the nature of the failure.> This is my first encounter with a raid failure. I suspect I should > replace disk 0 and let the raid rebuild itself. > > Seeking guidance and a good source for the procedures.The linux RAID wiki is often a good (though sometimes dated) resource. https://raid.wiki.kernel.org/index.php/Reconstruction If you wish to attempt a hot swap, make *sure* you pull the correct device! If you're not sure, or not sure your system even supports it, it's safer to power down to do the swap. You should verify which drive has failed before shutting down, though this will be less catastrophic if you pick the wrong one. As long as you are careful, reconstructing a degraded RAID is usually pretty straightforward. --keith -- kkeller at wombat.san-francisco.ca.us
On Tue, Dec 02, 2014 at 08:14:19PM -0500, David McGuffey wrote:> Received the following message in mail to root: > > Message 257: > >From root at desk4.localdomain Tue Oct 28 07:25:37 2014 > Return-Path: <root at desk4.localdomain> > X-Original-To: root > Delivered-To: root at desk4.localdomain > From: mdadm monitoring <root at desk4.localdomain> > To: root at desk4.localdomain > Subject: DegradedArray event on /dev/md0:desk4 > Date: Tue, 28 Oct 2014 07:25:27 -0400 (EDT) > Status: RO > > This is an automatically generated mail message from mdadm > running on desk4 > > A DegradedArray event had been detected on md device /dev/md0. > > Faithfully yours, etc. > > P.S. The /proc/mdstat file currently contains the following: > > Personalities : [raid1] > md0 : active raid1 dm-2[1] > 243682172 blocks super 1.1 [2/1] [_U] > bitmap: 2/2 pages [8KB], 65536KB chunk > > md1 : active raid1 dm-3[0] dm-0[1] > 1953510268 blocks super 1.1 [2/2] [UU] > bitmap: 3/15 pages [12KB], 65536KB chunk > > unused devices: <none>Could be a bad drive, as digimer alludes in his reply. OTOH, I had a perfectly good drive get kicked out of my RAID-1 array a fewyears ago just because, well, I guess I could say "it felt like it". In reality, I had (in my ignorance) purchased a pair of WD drives that aren't intended to be used in a RAID array, and once in a long while (that was actually the only such instance in the 4-5 years I've had that RAID array) it doesn't respond to some HD command or other and gets dropped. turned out to be easy to reinsert it and it ran for a long time thereafter without trouble. I can dig for the info on the drives and the nature of the problem if anyone wants to see it. Fred <huge snippage> -- ---- Fred Smith -- fredex at fcshome.stoneham.ma.us ----------------------------- The Lord is like a strong tower. Those who do what is right can run to him for safety. --------------------------- Proverbs 18:10 (niv) -----------------------------
On 12/2/2014 6:24 PM, Fred Smith wrote:> In reality, I had (in my ignorance) purchased a pair of WD > drives that aren't intended to be used in a RAID array, and > once in a long while (that was actually the only such instance > in the 4-5 years I've had that RAID array) it doesn't respond to > some HD command or other and gets dropped.desktop class SATA drives will report 'write successful' when there's still data in its buffers, so the raid will happily continue, then if the drive actually gets a unrecoverable write error, things are toast, the raid is out of sync. this is a major reason I'm leaning towards using ZFS for future raids (primarily via using FreeBSD), because ZFS checksums and timestamps every block it writes, so it can look at the two blocks that a regular raid can only say "something is wrong here, but what it is I ain't exactly sure" and go "A is good, B is bad/stale, lets replicate A back to B", as part of the zpool 'scrub' process. -- john r pierce 37N 122W somewhere on the middle of the left coast
On 12/03/2014 03:24 AM, Fred Smith wrote:> OTOH, I had a perfectly good drive get kicked out of my RAID-1 > array a fewyears ago just because, well, I guess I could say > "it felt like it".I've seen that too several times on my home "server". Once in a while (usually on one of the first days I'm on vacation) one of the drives stops responding. A shutdown and cold restart is necessary to bring the drive alive again, just a reboot won't fix it. After this, I rebuild the RAID partitions and all is OK. smartctl shows no sign of problems with the drive, so I suspect a controller problem. This is on a desktop machine used as a server. I guess this explains why we have server grade hardware :-) Mogens -- Mogens Kjaer, mk at lemo.dk http://www.lemo.dk
Hi David, Am 03.12.2014 um 02:14 schrieb David McGuffey <davidmcguffey at verizion.net>:> This is an automatically generated mail message from mdadm > running on desk4 > > A DegradedArray event had been detected on md device /dev/md0. > > Faithfully yours, etc. > > P.S. The /proc/mdstat file currently contains the following: > > Personalities : [raid1] > md0 : active raid1 dm-2[1] > 243682172 blocks super 1.1 [2/1] [_U] > bitmap: 2/2 pages [8KB], 65536KB chunk > > md1 : active raid1 dm-3[0] dm-0[1] > 1953510268 blocks super 1.1 [2/2] [UU] > bitmap: 3/15 pages [12KB], 65536KB chunkthe reason why one drive was kicked out (above [_U] ) will be in /var/log/messages. If it is also part of md1 then it should be manually removed from md1 before replacing the hd. -- LF
Thanks for all the responses. A little more digging revealed: md0 is made up of two 250G disks on which the OS and a very large /var partions resides for a number of virtual machines. md1 is made up of two 2T disks on which /home resides. Challenge is that disk 0 of md0 is the problem and it has a 524M /boot partition outside of the raid partition. My plan is to back up /home (md1) and at a minimum /etc/libvirt and /var/lib/libvirt (md0) before I do anything else. Here are the log entries for 'raid' Dec 1 20:50:15 desk4 kernel: md/raid1:md1: not clean -- starting background reconstruction Dec 1 20:50:15 desk4 kernel: md/raid1:md1: active with 2 out of 2 mirrors Dec 1 20:50:15 desk4 kernel: md/raid1:md0: active with 1 out of 2 mirrors This is a desktop, not a server. We've had several short (<20 sec) power outages over the last month. The last one was on 1 Dec. I suspect the sudden loss and restoration of power could have trashed a portion of disk 0 in md0. I finally obtained an APC UPS (BX1500G), installed, configured, and tested it. In the future, it will carry me through these short outages. I'll obtain a new 250G (or larger) drive and start rooting around for guidance on how to replace a drive with the MBR and /boot on it. On Wed, 2014-12-03 at 22:11 +0100, Leon Fauster wrote:> Hi David, > > Am 03.12.2014 um 02:14 schrieb David McGuffey <davidmcguffey at verizion.net>: > > This is an automatically generated mail message from mdadm > > running on desk4 > > > > A DegradedArray event had been detected on md device /dev/md0. > > > > Faithfully yours, etc. > > > > P.S. The /proc/mdstat file currently contains the following: > > > > Personalities : [raid1] > > md0 : active raid1 dm-2[1] > > 243682172 blocks super 1.1 [2/1] [_U] > > bitmap: 2/2 pages [8KB], 65536KB chunk > > > > md1 : active raid1 dm-3[0] dm-0[1] > > 1953510268 blocks super 1.1 [2/2] [UU] > > bitmap: 3/15 pages [12KB], 65536KB chunk > > > the reason why one drive was kicked out (above [_U] ) will > be in /var/log/messages. If it is also part of md1 then > it should be manually removed from md1 before replacing the > hd. > > -- > LF > > > > > > _______________________________________________ > CentOS mailing list > CentOS at centos.org > http://lists.centos.org/mailman/listinfo/centos