Steven Post
2013-Aug-31 10:12 UTC
Device delete returns "unable to go below four devices on raid10" on 5 drive setup
Hello list, I have a 5 drive raid10 setup (6th sata port malfunctions, all drives are 3TB in size). I want to remove a single drive, yet the ''btrfs device delete'' command gives me the "unable to go below four devices on raid10" error. This is the result after first deleting a device, after a check, it didn''t seem to be removed, then issuing the command again results in the error. Before: # btrfs filesystem show /dev/sda3 Label: ''maindrivearray'' uuid: f58976ab-2ce1-4a1c-bc82-22df7d3393b4 Total devices 5 FS bytes used 2.67TB devid 4 size 2.73TB used 1.09TB path /dev/sde3 devid 3 size 2.73TB used 1.09TB path /dev/sdd3 devid 2 size 2.73TB used 1.09TB path /dev/sdc3 devid 6 size 2.73TB used 1.09TB path /dev/sdb3 devid 5 size 2.73TB used 1.09TB path /dev/sda3 Btrfs Btrfs v0.19 After issuing the command # btrfs device delete /dev/sde3 /mnt I get this: # btrfs filesystem show /dev/sda3 Label: ''maindrivearray'' uuid: f58976ab-2ce1-4a1c-bc82-22df7d3393b4 Total devices 5 FS bytes used 2.67TB devid 4 size 2.73TB used 1.09TB path /dev/sde3 devid 3 size 2.73TB used 1.24TB path /dev/sdd3 devid 2 size 2.73TB used 1.24TB path /dev/sdc3 devid 6 size 2.73TB used 1.24TB path /dev/sdb3 devid 5 size 2.73TB used 1.24TB path /dev/sda3 Btrfs Btrfs v0.19 When issuing the delete command again, the error pops up, also after reboot. The first remove did take a long time to complete and according to syslog and the ''filesystem show'' command a lot of data was moved to the other drives (as expected). The system is running Debian Wheezy (kernel 3.2.0-4-amd64 #1 SMP Debian 3.2.46-1 x86_64). Is this something known (and possibly resolved in a later version), or should I open a bug report about it? Could it be that the device removal was completed, but still shows as part of the array for some reason? The reason for the remove is actually that I want to (gradually) replace the 3TB drives with 1 TB ones, and somewhere in the middle move some of the data of the array, to another machine, that currently has the 1 TB drives which I intend to replace with the 3TB ones. Best regards, Steven
Duncan
2013-Aug-31 11:41 UTC
Re: Device delete returns "unable to go below four devices on raid10" on 5 drive setup
Steven Post posted on Sat, 31 Aug 2013 12:12:55 +0200 as excerpted:> Btrfs Btrfs v0.19> The system is running Debian Wheezy (kernel 3.2.0-4-amd64 #1 SMP Debian > 3.2.46-1 x86_64). > > Is this something known (and possibly resolved in a later version), or > should I open a bug report about it? Could it be that the device removal > was completed, but still shows as part of the array for some reason?As a sysadmin running btrfs not a dev, I don''t know about your specific issue, but in general, be aware that btrfs (both the kernelspace filesystem and the userspace btrfs-tools) is still experimental and under heavy development. As such, it''s strongly recommended to run the latest stable series kernel at the oldest, if not the rcs (here, I run a git kernel, but don''t normally update to the new development series until rc2 or 3, which will hopefully avoid any serious data-destroying bugs, but of course on a development filesystem backups are even more important than normal in any case, so...), unless you have a specific known problem preventing you from doing so. That would be 3.10.x, with 3.11 late soon to be out now. Some people do run one behind that, so the 3.9 series, but older than that and trying something that isn''t as ancient as the hills is generally a strong first recommendation, both because many known bugs have been fixed so you''re actually taking a bigger risk with older, and because the bug reports simply aren''t as useful that far back. Similarly, the git master branch of btrfs-tools is deliberately kept stable and usable -- development happens on other branches and is merged -- so a live-git build not older than a couple months (that being about the length of a kernel cycle so they''ll be of similar age) is recommended. 0.19 is a very old release tho it''s the latest actual release, but there''s a much newer 0.20-rc1 tagged if you still don''t feel comfortable with a live-git build. All this and more is covered on the btrfs wiki, found at https://btrfs.wiki.kernel.org/ If you wish to continue testing btrfs I''d suggest you read up a bit there, as if you didn''t know the above, there''s surely a lot else covered there that you''re not aware of -- and on a development filesystem that lack of knowledge could well bite you! Alternatively, testing a development filesystem certainly isn''t for everybody, and the fact that you''re running an old 3.2 kernel could be a hint that you''re looking for something a bit more conservative and stable than a development filesystem, making it a poor fit for your needs at best. But that''s for you to decide. If you''re happy being a tester and either have your data well backed up or otherwise consider it losable in testing if things go wrong, go for it, but do it right, with current kernel and tools so your tests at least have some value if things /do/ go wrong! =:^) -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Chris Murphy
2013-Aug-31 17:42 UTC
Re: Device delete returns "unable to go below four devices on raid10" on 5 drive setup
On Aug 31, 2013, at 4:12 AM, Steven Post <redalert.commander@gmail.com> wrote:> > The system is running Debian Wheezy (kernel 3.2.0-4-amd64 #1 SMP Debian > 3.2.46-1 x86_64). > > Is this something known (and possibly resolved in a later version), or > should I open a bug report about it?Try 3.10 or 3.11 before filing a bug on it.> Could it be that the device removal > was completed, but still shows as part of the array for some reason?Yes. It might take a few minutes after the chunks are reallocated for the device to be removed from the volume. I''ve had some cases where even a reboot was needed for the information in fi sh to refresh.> The reason for the remove is actually that I want to (gradually) replace > the 3TB drives with 1 TB ones, and somewhere in the middle move some of > the data of the array, to another machine, that currently has the 1 TB > drives which I intend to replace with the 3TB ones.Use a newer kernel for sure. What you suggest should work. If you''re testing to see if it does work, and you''re prepared for it not working (i.e. totally losing the entire file system) and prepared to find a consistent reproducer if it doesn''t work, then have at it. Otherwise, create a whole new btrfs volume with recent kernel and btrfs-progs on the other machine; and then rsync everything from old to new. Rsync has a checksum option, it will take longer, but you can then be reasonably assured of file integrity. Chris Murphy-- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hugo Mills
2013-Aug-31 22:20 UTC
Re: Device delete returns "unable to go below four devices on raid10" on 5 drive setup
On Sat, Aug 31, 2013 at 11:42:28AM -0600, Chris Murphy wrote:> > On Aug 31, 2013, at 4:12 AM, Steven Post <redalert.commander@gmail.com> wrote: > > > > The system is running Debian Wheezy (kernel 3.2.0-4-amd64 #1 SMP Debian > > 3.2.46-1 x86_64). > > > > Is this something known (and possibly resolved in a later version), or > > should I open a bug report about it? > > Try 3.10 or 3.11 before filing a bug on it.If you want a debian-packaged kernel, they''re available from the experimental distribution. Hugo. -- === Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk == PGP key: 65E74AC0 from wwwkeys.eu.pgp.net or http://www.carfax.org.uk --- Great oxymorons of the world, no. 5: Manifesto Promise ---
Steven Post
2013-Aug-31 23:55 UTC
Re: Device delete returns "unable to go below four devices on raid10" on 5 drive setup
On Sat, 2013-08-31 at 11:42 -0600, Chris Murphy wrote:> On Aug 31, 2013, at 4:12 AM, Steven Post <redalert.commander@gmail.com> wrote: > > > > The system is running Debian Wheezy (kernel 3.2.0-4-amd64 #1 SMP Debian > > 3.2.46-1 x86_64). > > > > Is this something known (and possibly resolved in a later version), or > > should I open a bug report about it? > > Try 3.10 or 3.11 before filing a bug on it.I don''t intend on upgrading the first machine at this point, but I''ll see if I can reproduce this on the second machine which is running Debian Testing (Jessie), that one has a 3.10.7 kernel. Hugo Mills suggested using a kernel from experimental, but I don''t feel comfortable at running that at this point, as that would be a 3.11-rc4 kernel, I might consider it if the 3.11 release became available in ''unstable'' (I understand that Linus might release 3.11 this weekend) . I might also consider running the 3.10 kernel from backports on the first machine if that would be necessary for some reason, but we''ll see.> > > Could it be that the device removal > > was completed, but still shows as part of the array for some reason? > > Yes. It might take a few minutes after the chunks are reallocated for the device to be removed from the volume. I''ve had some cases where even a reboot was needed for the information in fi sh to refresh.I see, so that might be normal behaviour. Although we''re several hours later now and there has been a reboot after the first time the "unable to go below four drives" error. I did start a balance operation after the reboot, we''ll see what that gives. Once that completes, I intend to try removing the device again with the ''device delete'' command, if that still gives the error I''ll just remove the drive from the machine and go from there.> > > > The reason for the remove is actually that I want to (gradually) replace > > the 3TB drives with 1 TB ones, and somewhere in the middle move some of > > the data of the array, to another machine, that currently has the 1 TB > > drives which I intend to replace with the 3TB ones. > > Use a newer kernel for sure. What you suggest should work. If you''re testing to see if it does work, and you''re prepared for it not working (i.e. totally losing the entire file system) and prepared to find a consistent reproducer if it doesn''t work, then have at it. > > Otherwise, create a whole new btrfs volume with recent kernel and btrfs-progs on the other machine; and then rsync everything from old to new. Rsync has a checksum option, it will take longer, but you can then be reasonably assured of file integrity.The plan was to switch 2 or 3 3TB drives with 1TB drives, then move data using sftp (scp), and then switch the remaining drives, all this time keeping the raid10 configuration. Except for the first switch on machine 2 as I didn''t have the capacity to remove a single drive, so I had to mount degraded. As I was handling the second machine (3.10.7 kernel) the filesystem suddenly became read-only during a device delete missing operation with the a warning in /var/log/syslog (after already adding a new 3TB device), I''ll add the Call Trace from the log at the end of this message for reference. After remounting (with -o degraded again) I issued a balance which completed successfully, then the device delete command immediately returned and the the filesystem seemed alright, with no sign of data loss or corruption. As an aside, I''d rather not recreate the arrays if it can be done without recreating. On the other hand we''re not talking about a mission critical system, I wouldn''t use btrfs for such a system at this point, but for home use (with backups) or testing, things seem to be in good shape.> > > Chris MurphyThanks to all who replied for your responses. Best regards, Steven PS: I forgot to mention it in my first mail, but please CC me, I''m not subscribed to the list. I''ll try to check the archives to see if I missed anything though. I see I missed 1 reply on the list, while 1 reply was sent to me directly, and a third didn''t even hit the list archives (yet?) at spinics.net. PPS: sorry if I seem to be rambling on a bit about everything in a non-structured e-mail message. /var/log/syslog (3.10-2-amd64 #1 SMP Debian 3.10.7-1 (2013-08-17) x86_64): [16431.789463] btrfs: relocating block group 1573890818048 flags 65 [16456.635819] btrfs: found 3392 extents [16459.691201] BTRFS error (device sdb) in btrfs_commit_transaction:1809: errno=-5 IO failure (Error while writing out transaction) [16459.691207] BTRFS info (device sdb): forced readonly [16459.691210] BTRFS warning (device sdb): Skipping commit of aborted transaction. [16459.691212] ------------[ cut here ]------------ [16459.691252] WARNING: at /build/linux-kDQkfE/linux-3.10.7/fs/btrfs/super.c:254 __btrfs_abort_transaction+0x4a/0xbe [btrfs]() [16459.691253] btrfs: Transaction aborted (error -5) [16459.691254] Modules linked in: rpcsec_gss_krb5 nfsv4 nfnetlink_queue nfnetlink nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack xt_tcpudp ip6table_filter ip6_tables ebtable_nat ebtables iptable_filter ip_tables xt_iprange xt_state nf_conntrack ipt_REJECT xt_mark xt_NFQUEUE x_tables parport_pc ppdev lp parport bnep rfcomm bluetooth snd_hrtimer pci_stub vboxpci(O) vboxnetadp(O) cpufreq_userspace cpufreq_conservative cpufreq_powersave cpufreq_stats vboxnetflt(O) vboxdrv(O) binfmt_misc nfsd auth_rpcgss oid_registry nfs_acl nfs lockd dns_resolver fscache sunrpc loop fuse joydev adt7475 hwmon_vid snd_hda_codec_realtek snd_hda_intel coretemp snd_hda_codec snd_hwdep kvm_intel snd_pcm_oss snd_mixer_oss kvm snd_pcm snd_page_alloc crc32c_intel snd_seq_midi snd_seq_midi_event ghash_clmulni_intel snd_rawmidi snd_seq eeepc_wmi iTCO_wdt asus_wmi iTCO_vendor_support sparse_keymap rfkill evdev aesni_intel snd_seq_device snd_timer aes_x86_64 ablk_helper cryptd lrw gf128mul glue_helper microcode pcspkr snd nouveau psmouse serio_raw i2c_i801 mxm_wmi lpc_ich video mfd_core ttm drm_kms_helper drm mperf i2c_algo_bit i2c_core soundcore wmi mei_me processor button mei thermal_sys ext4 crc16 jbd2 mbcache btrfs xor zlib_deflate raid6_pq crc32c libcrc32c dm_mod hid_generic md_mod usbhid hid sg sd_mod crc_t10dif ata_generic xhci_hcd ehci_pci ehci_hcd pata_via ata_piix ahci libahci usbcore usb_common libata r8169 mii scsi_mod [16459.691308] CPU: 0 PID: 5381 Comm: btrfs Tainted: G O 3.10-2-amd64 #1 Debian 3.10.7-1 [16459.691309] Hardware name: System manufacturer System Product Name/P8H67, BIOS 1103 08/12/2011 [16459.691311] 0000000000000000 ffffffff8103bb5f ffff8801f75039f0 00000000fffffffb [16459.691313] ffff8801f7503a40 ffff88005ed243b0 ffffffffa01f5500 ffffffff8103bc0a [16459.691315] ffffffffa01f7288 0000000000000020 ffff8801f7503a50 ffff8801f7503a10 [16459.691317] Call Trace: [16459.691323] [<ffffffff8103bb5f>] ? warn_slowpath_common+0x5b/0x70 [16459.691326] [<ffffffff8103bc0a>] ? warn_slowpath_fmt+0x47/0x49 [16459.691334] [<ffffffffa017d657>] ? __btrfs_abort_transaction +0x4a/0xbe [btrfs] [16459.691344] [<ffffffffa019dbbe>] ? cleanup_transaction+0x84/0x24f [btrfs] [16459.691347] [<ffffffff81057c67>] ? abort_exclusive_wait+0x79/0x79 [16459.691357] [<ffffffffa019d870>] ? btrfs_commit_transaction +0x866/0x878 [btrfs] [16459.691359] [<ffffffff81057c67>] ? abort_exclusive_wait+0x79/0x79 [16459.691368] [<ffffffffa019e0ae>] ? start_transaction+0x325/0x448 [btrfs] [16459.691371] [<ffffffff8105f669>] ? should_resched+0x5/0x23 [16459.691374] [<ffffffff81384167>] ? mutex_lock+0xa/0x27 [16459.691384] [<ffffffffa01d3988>] ? prepare_to_relocate+0xc2/0xd0 [btrfs] [16459.691395] [<ffffffffa01d7d45>] ? relocate_block_group+0x3d/0x4db [btrfs] [16459.691404] [<ffffffffa01d8327>] ? btrfs_relocate_block_group +0x144/0x268 [btrfs] [16459.691415] [<ffffffffa01b9c23>] ? btrfs_relocate_chunk.isra.59 +0x50/0x3f6 [btrfs] [16459.691421] [<ffffffffa017e0eb>] ? btrfs_item_key_to_cpu+0x12/0x30 [btrfs] [16459.691432] [<ffffffffa01af0fc>] ? btrfs_get_token_64+0x76/0xc6 [btrfs] [16459.691442] [<ffffffffa01b19a1>] ? release_extent_buffer+0x90/0x97 [btrfs] [16459.691452] [<ffffffffa01bbea0>] ? btrfs_shrink_device+0x1f8/0x35e [btrfs] [16459.691462] [<ffffffffa01be84b>] ? btrfs_rm_device+0x2b8/0x690 [btrfs] [16459.691472] [<ffffffffa01c49ed>] ? btrfs_ioctl+0x8ee/0x197d [btrfs] [16459.691474] [<ffffffff810dee28>] ? handle_mm_fault+0x1f1/0x238 [16459.691476] [<ffffffff81388c33>] ? __do_page_fault+0x32d/0x3cb [16459.691479] [<ffffffff81115f74>] ? vfs_ioctl+0x1b/0x25 [16459.691480] [<ffffffff81116795>] ? do_vfs_ioctl+0x3e8/0x42a [16459.691482] [<ffffffff81116825>] ? SyS_ioctl+0x4e/0x79 [16459.691484] [<ffffffff8138ade9>] ? system_call_fastpath+0x16/0x1b [16459.691485] ---[ end trace 92cca53f6fe2bc37 ]--- [16459.691487] BTRFS error (device sdb) in cleanup_transaction:1449: errno=-5 IO failure [16459.691488] delayed_refs has NO entry
Chris Murphy
2013-Sep-01 00:03 UTC
Re: Device delete returns "unable to go below four devices on raid10" on 5 drive setup
On Aug 31, 2013, at 5:55 PM, Steven Post <redalert.commander@gmail.com> wrote:> On Sat, 2013-08-31 at 11:42 -0600, Chris Murphy wrote: >> >> Yes. It might take a few minutes after the chunks are reallocated for the device to be removed from the volume. I''ve had some cases where even a reboot was needed for the information in fi sh to refresh. > > I see, so that might be normal behaviour.No, I think it''s expected for deleted devices to not appear in the volume listing anymore, but with older kernels I had that experience. I haven''t tried it recently with newer kernels.> > As an aside, I''d rather not recreate the arrays if it can be done > without recreating.It should work. But it''s an experimental file system. I''d at least make a backup if you''re going to do this with the device add/remove method. Chris Murphy-- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Steven Post
2013-Sep-01 12:08 UTC
Re: Device delete returns "unable to go below four devices on raid10" on 5 drive setup
On Sat, 2013-08-31 at 18:03 -0600, Chris Murphy wrote:> On Aug 31, 2013, at 5:55 PM, Steven Post <redalert.commander@gmail.com> wrote: > > > On Sat, 2013-08-31 at 11:42 -0600, Chris Murphy wrote: > >> > >> Yes. It might take a few minutes after the chunks are reallocated for the device to be removed from the volume. I''ve had some cases where even a reboot was needed for the information in fi sh to refresh. > > > > I see, so that might be normal behaviour. > > No, I think it''s expected for deleted devices to not appear in the volume listing anymore, but with older kernels I had that experience. I haven''t tried it recently with newer kernels.I might have phrased that a bit incorrectly, with ''normal behaviour'' I meant it was known to do that and not cause major problems. Of course I would expect the entry to just disappear. I''ll let you know the result of the ''device delete'' operation on the second machine (3.10.7 kernel). Back on the 3.2 kernel, the "filesystem show" command still showed the removed device, but still with less used space that the others after the balance. Shutdown + physical removal + boot didn''t produce any error, since this array is used as the root filesystem I think I would have noticed a serious problem by now. I successfully added the 1 TB drive to the array (after partitioning). So it seems it''s just the output of the ''filesystem show'' command that is lagging behind, even after a reboot and a 20 hour idle period.> > > > > As an aside, I''d rather not recreate the arrays if it can be done > > without recreating. > > It should work. But it''s an experimental file system. I''d at least make a backup if you''re going to do this with the device add/remove method.Naturally, all important files have a backup, but for the rest of the volume I can live with the fact that it would be lost. Also if every one created a new filesystem instead of using device add/delete, this code wouldn''t get much testing ;) Best regards, Steven
Steven Post
2013-Sep-01 21:43 UTC
Re: Device delete returns "unable to go below four devices on raid10" on 5 drive setup
On Sun, 2013-09-01 at 14:08 +0200, Steven Post wrote:> On Sat, 2013-08-31 at 18:03 -0600, Chris Murphy wrote: > > On Aug 31, 2013, at 5:55 PM, Steven Post <redalert.commander@gmail.com> wrote: > > > > > On Sat, 2013-08-31 at 11:42 -0600, Chris Murphy wrote: > > >> > > >> Yes. It might take a few minutes after the chunks are reallocated for the device to be removed from the volume. I''ve had some cases where even a reboot was needed for the information in fi sh to refresh. > > > > > > I see, so that might be normal behaviour. > > > > No, I think it''s expected for deleted devices to not appear in the volume listing anymore, but with older kernels I had that experience. I haven''t tried it recently with newer kernels. > > I might have phrased that a bit incorrectly, with ''normal behaviour'' I > meant it was known to do that and not cause major problems. Of course I > would expect the entry to just disappear. > I''ll let you know the result of the ''device delete'' operation on the > second machine (3.10.7 kernel).The 3.10.7 kernel doesn''t seem to have this issue, the removed device is not listed anymore immediately after the ''device delete'' command returns. Although in that instance the array still had 6 drives, not 5 as with the old one. [...] Best regards, Steven