Hi, thank you for your reply. I'll continue inline... Dne 09.09.2020 v 3:15 John Stoffel napsal(a):> Miloslav> Hello, > Miloslav> I sent this into the Linux Kernel Btrfs mailing list and I got reply: > Miloslav> "RAID-1 would be preferable" > Miloslav> (https://lore.kernel.org/linux-btrfs/7b364356-7041-7d18-bd77-f60e0e2e2112 at lechevalier.se/T/). > Miloslav> May I ask you for the comments as from people around the Dovecot? > > > Miloslav> We are using btrfs RAID-10 (/data, 4.7TB) on a physical Supermicro > Miloslav> server with Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz and 125GB of RAM. > Miloslav> We run 'btrfs scrub start -B -d /data' every Sunday as a cron task. It > Miloslav> takes about 50 minutes to finish. > > Miloslav> # uname -a > Miloslav> Linux imap 4.9.0-12-amd64 #1 SMP Debian 4.9.210-1 (2020-01-20) x86_64 > Miloslav> GNU/Linux > > Miloslav> RAID is a composition of 16 harddrives. Harddrives are connected via > Miloslav> AVAGO MegaRAID SAS 9361-8i as a RAID-0 devices. All harddrives are SAS > Miloslav> 2.5" 15k drives. > > Can you post the output of "cat /proc/mdstat" or since you say you're > using btrfs, are you using their own RAID0 setup? If so, please post > the output of 'btrfs stats' or whatever the command is you use to view > layout info?There is a one PCIe RAID controller in a chasis. AVAGO MegaRAID SAS 9361-8i. And 16x SAS 15k drives conneced to it. Because the controller does not support pass-through for the drives, we use 16x RAID-0 on controller. So, we get /dev/sda ... /dev/sdp (roughly) in OS. And over that we have single btrfs RAID-10, composed of 16 devices, mounted as /data. We have chosen this wiring for severeal reasons: - easy to increase a capacity - easy to replace drives by larger ones - due to checksuming, btrfs does not need fsck in case of power failure - btrfs scrub discovers failing drive sooner than S.M.A.R.T. or RAID controller> Miloslav> Server serves as a IMAP with Dovecot 2.2.27-3+deb9u6, 4104 accounts, > Miloslav> Mailbox format, LMTP delivery. > > How ofter are these accounts hitting the server?IMAP serves for a univesity. So there are typical rush hours from 7AM to 3PM. Lowers during the evening, almost not used during the night.> Miloslav> We run 'rsync' to remote NAS daily. It takes about 6.5 hours to finish, > Miloslav> 12'265'387 files last night. > > That's.... sucky. So basically you're hitting the drives hard with > random IOPs and you're probably running out of performance. How much > space are you using on the filesystem?It's not so sucky how it seems. rsync runs during the night. And even reading is high, server load stays low. We have problems with writes.> And why not use brtfs send to ship off snapshots instead of using > rsync? I'm sure that would be an improvement...We run backup to external NAS (NetApp) for a disaster recovery scenario. Moreover NAS is spreaded across multiple locations. Then we create NAS snapshot, tens days backward. All snapshots easily available via NFS mount. And NAS capacity is cheaper.> Miloslav> Last half year, we encoutered into performace > Miloslav> troubles. Server load grows up to 30 in rush hours, due to > Miloslav> IO waits. We tried to attach next harddrives (the 838G ones > Miloslav> in a list below) and increase a free space by rebalace. I > Miloslav> think, it helped a little bit, not not so rapidly. > > If you're IOPs bound, but not space bound, then you *really* want to > get an SSD in there for the indexes and such. Basically the stuff > that gets written/read from all the time no matter what, but which > isn't large in terms of space.Yes. We are now on 66% capacity. Adding SSD for indexes is our next step.> Also, adding in another controller card or two would also probably > help spread the load across more PCI channels, and reduce contention > on the SATA/SAS bus as well.Probably we will wait how SSD helps first, but as you wrote, it is possible next step.> Miloslav> Is this a reasonable setup and use case for btrfs RAID-10? > Miloslav> If so, are there some recommendations to achieve better > Miloslav> performance? > > 1. move HOT data to SSD based volume RAID 1 pair. On a seperate > controller.OK> 2. add more controllers, which also means you're more redundant in > case one controller fails.OK> 3. Clone the system and put Dovecot IMAP director in from of the > setup.I still hope that one server can handle 4105 accounts.> 4. Stop using rsync for copying to your DR site, use the btrfs snap > send, or whatever the commands are.I hope it is not needed in our scenario.> 5. check which dovecot backend you're using and think about moving to > one which doesn't involve nearly as many files.Maildir is comfortable for us. Time to time, users call us with: "I accidentally deleted the folder" and it is super easy to copy it back from backup.> 6. Find out who your biggest users are, in terms of emails and move > them to SSDs if step 1 is too hard to do at first.OK> Can you also grab some 'iostat -dhm 30 60' output, which is 30 > minutes of data over 30 second intervals? That should help you narrow > down which (if any) disk is your hotspot.OK, thanks for the tip.> It's not clear to me if you have one big btrfs filesystem, or a bunch > of smaller ones stiched together. In any case, it should be very easy > to get better performance here.I hope I've made it clear above.> I think someone else mentioned that you should look at your dovecot > backend, and you should move to the fastest one you can find. > > Good luck! > JohnThank you for your time and advices! Kind regards Milo> Miloslav> # megaclisas-status > Miloslav> -- Controller information -- > Miloslav> -- ID | H/W Model | RAM | Temp | BBU | Firmware > Miloslav> c0 | AVAGO MegaRAID SAS 9361-8i | 1024MB | 72C | Good | FW: > Miloslav> 24.16.0-0082 > > Miloslav> -- Array information -- > Miloslav> -- ID | Type | Size | Strpsz | Flags | DskCache | Status | OS > Miloslav> Path | CacheCade |InProgress > Miloslav> c0u0 | RAID-0 | 838G | 256 KB | RA,WB | Enabled | Optimal | > Miloslav> /dev/sdq | None |None > Miloslav> c0u1 | RAID-0 | 558G | 256 KB | RA,WB | Enabled | Optimal | > Miloslav> /dev/sda | None |None > Miloslav> c0u2 | RAID-0 | 558G | 256 KB | RA,WB | Enabled | Optimal | > Miloslav> /dev/sdb | None |None > Miloslav> c0u3 | RAID-0 | 558G | 256 KB | RA,WB | Enabled | Optimal | > Miloslav> /dev/sdc | None |None > Miloslav> c0u4 | RAID-0 | 558G | 256 KB | RA,WB | Enabled | Optimal | > Miloslav> /dev/sdd | None |None > Miloslav> c0u5 | RAID-0 | 558G | 256 KB | RA,WB | Enabled | Optimal | > Miloslav> /dev/sde | None |None > Miloslav> c0u6 | RAID-0 | 558G | 256 KB | RA,WB | Enabled | Optimal | > Miloslav> /dev/sdf | None |None > Miloslav> c0u7 | RAID-0 | 558G | 256 KB | RA,WB | Enabled | Optimal | > Miloslav> /dev/sdg | None |None > Miloslav> c0u8 | RAID-0 | 558G | 256 KB | RA,WB | Enabled | Optimal | > Miloslav> /dev/sdh | None |None > Miloslav> c0u9 | RAID-0 | 558G | 256 KB | RA,WB | Enabled | Optimal | > Miloslav> /dev/sdi | None |None > Miloslav> c0u10 | RAID-0 | 558G | 256 KB | RA,WB | Enabled | Optimal | > Miloslav> /dev/sdj | None |None > Miloslav> c0u11 | RAID-0 | 558G | 256 KB | RA,WB | Enabled | Optimal | > Miloslav> /dev/sdk | None |None > Miloslav> c0u12 | RAID-0 | 558G | 256 KB | RA,WB | Enabled | Optimal | > Miloslav> /dev/sdl | None |None > Miloslav> c0u13 | RAID-0 | 558G | 256 KB | RA,WB | Enabled | Optimal | > Miloslav> /dev/sdm | None |None > Miloslav> c0u14 | RAID-0 | 558G | 256 KB | RA,WB | Enabled | Optimal | > Miloslav> /dev/sdn | None |None > Miloslav> c0u15 | RAID-0 | 838G | 256 KB | RA,WB | Enabled | Optimal | > Miloslav> /dev/sdr | None |None > > Miloslav> -- Disk information -- > Miloslav> -- ID | Type | Drive Model | Size | Status > Miloslav> | Speed | Temp | Slot ID | LSI ID > Miloslav> c0u0p0 | HDD | SEAGATE ST900MP0006 N003WAG0Q3S3 | 837.8 Gb | Online, > Miloslav> Spun Up | 12.0Gb/s | 53C | [8:14] | 32 > Miloslav> c0u1p0 | HDD | HGST HUC156060CSS200 A3800XV250TJ | 558.4 Gb | Online, > Miloslav> Spun Up | 12.0Gb/s | 38C | [8:0] | 12 > Miloslav> c0u2p0 | HDD | HGST HUC156060CSS200 A3800XV3XT4J | 558.4 Gb | Online, > Miloslav> Spun Up | 12.0Gb/s | 43C | [8:1] | 11 > Miloslav> c0u3p0 | HDD | HGST HUC156060CSS200 ADB05ZG4XLZU | 558.4 Gb | Online, > Miloslav> Spun Up | 12.0Gb/s | 46C | [8:2] | 25 > Miloslav> c0u4p0 | HDD | HGST HUC156060CSS200 A3800XV3DWRL | 558.4 Gb | Online, > Miloslav> Spun Up | 12.0Gb/s | 48C | [8:3] | 14 > Miloslav> c0u5p0 | HDD | HGST HUC156060CSS200 A3800XV3XZTL | 558.4 Gb | Online, > Miloslav> Spun Up | 12.0Gb/s | 52C | [8:4] | 18 > Miloslav> c0u6p0 | HDD | HGST HUC156060CSS200 A3800XV3VSKJ | 558.4 Gb | Online, > Miloslav> Spun Up | 12.0Gb/s | 55C | [8:5] | 15 > Miloslav> c0u7p0 | HDD | SEAGATE ST600MP0006 N003WAF1LWKE | 558.4 Gb | Online, > Miloslav> Spun Up | 12.0Gb/s | 56C | [8:6] | 28 > Miloslav> c0u8p0 | HDD | HGST HUC156060CSS200 A3800XV3XTDJ | 558.4 Gb | Online, > Miloslav> Spun Up | 12.0Gb/s | 55C | [8:7] | 20 > Miloslav> c0u9p0 | HDD | HGST HUC156060CSS200 A3800XV3T8XL | 558.4 Gb | Online, > Miloslav> Spun Up | 12.0Gb/s | 57C | [8:8] | 19 > Miloslav> c0u10p0 | HDD | HGST HUC156060CSS200 A7030XHL0ZYP | 558.4 Gb | Online, > Miloslav> Spun Up | 12.0Gb/s | 61C | [8:9] | 23 > Miloslav> c0u11p0 | HDD | HGST HUC156060CSS200 ADB05ZG4VR3P | 558.4 Gb | Online, > Miloslav> Spun Up | 12.0Gb/s | 60C | [8:10] | 24 > Miloslav> c0u12p0 | HDD | SEAGATE ST600MP0006 N003WAF195KA | 558.4 Gb | Online, > Miloslav> Spun Up | 12.0Gb/s | 60C | [8:11] | 29 > Miloslav> c0u13p0 | HDD | SEAGATE ST600MP0006 N003WAF1LTZW | 558.4 Gb | Online, > Miloslav> Spun Up | 12.0Gb/s | 56C | [8:12] | 26 > Miloslav> c0u14p0 | HDD | SEAGATE ST600MP0006 N003WAF1LWH6 | 558.4 Gb | Online, > Miloslav> Spun Up | 12.0Gb/s | 55C | [8:13] | 27 > Miloslav> c0u15p0 | HDD | SEAGATE ST900MP0006 N003WAG0Q414 | 837.8 Gb | Online, > Miloslav> Spun Up | 12.0Gb/s | 47C | [8:15] | 33 > > > > Miloslav> # btrfs --version > Miloslav> btrfs-progs v4.7.3 > > > > Miloslav> # btrfs fi show > Miloslav> Label: 'DATA' uuid: 5b285a46-e55d-4191-924f-0884fa06edd8 > Miloslav> Total devices 16 FS bytes used 3.49TiB > Miloslav> devid 1 size 558.41GiB used 448.66GiB path /dev/sda > Miloslav> devid 2 size 558.41GiB used 448.66GiB path /dev/sdb > Miloslav> devid 4 size 558.41GiB used 448.66GiB path /dev/sdd > Miloslav> devid 5 size 558.41GiB used 448.66GiB path /dev/sde > Miloslav> devid 7 size 558.41GiB used 448.66GiB path /dev/sdg > Miloslav> devid 8 size 558.41GiB used 448.66GiB path /dev/sdh > Miloslav> devid 9 size 558.41GiB used 448.66GiB path /dev/sdf > Miloslav> devid 10 size 558.41GiB used 448.66GiB path /dev/sdi > Miloslav> devid 11 size 558.41GiB used 448.66GiB path /dev/sdj > Miloslav> devid 13 size 558.41GiB used 448.66GiB path /dev/sdk > Miloslav> devid 14 size 558.41GiB used 448.66GiB path /dev/sdc > Miloslav> devid 15 size 558.41GiB used 448.66GiB path /dev/sdl > Miloslav> devid 16 size 558.41GiB used 448.66GiB path /dev/sdm > Miloslav> devid 17 size 558.41GiB used 448.66GiB path /dev/sdn > Miloslav> devid 18 size 837.84GiB used 448.66GiB path /dev/sdr > Miloslav> devid 19 size 837.84GiB used 448.66GiB path /dev/sdq > > > > Miloslav> # btrfs fi df /data/ > Miloslav> Data, RAID10: total=3.48TiB, used=3.47TiB > Miloslav> System, RAID10: total=256.00MiB, used=320.00KiB > Miloslav> Metadata, RAID10: total=21.00GiB, used=18.17GiB > Miloslav> GlobalReserve, single: total=512.00MiB, used=0.00B > > > > Miloslav> I do not attach whole dmesg log. It is almost empty, without errors. > Miloslav> Only lines about BTRFS are about relocations, like: > > Miloslav> BTRFS info (device sda): relocating block group 29435663220736 flags 65 > Miloslav> BTRFS info (device sda): found 54460 extents > Miloslav> BTRFS info (device sda): found 54459 extents >
The 9361-8i does support passthrough ( JBOD mode ). Make sure you have the latest firmware. On Wednesday, 09/09/2020 at 03:55 Miloslav H?la wrote: Hi, thank you for your reply. I'll continue inline... Dne 09.09.2020 v 3:15 John Stoffel napsal(a):> Miloslav> Hello, > Miloslav> I sent this into the Linux Kernel Btrfs mailing list and Igot reply:> Miloslav> "RAID-1 would be preferable" > Miloslav>(https://lore.kernel.org/linux-btrfs/7b364356-7041-7d18-bd77-f60e0e2e2112 at lechevalier.se/T/).> Miloslav> May I ask you for the comments as from people around theDovecot?> > > Miloslav> We are using btrfs RAID-10 (/data, 4.7TB) on a physicalSupermicro> Miloslav> server with Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz and125GB of RAM.> Miloslav> We run 'btrfs scrub start -B -d /data' every Sunday as acron task. It> Miloslav> takes about 50 minutes to finish. > > Miloslav> # uname -a > Miloslav> Linux imap 4.9.0-12-amd64 #1 SMP Debian 4.9.210-1(2020-01-20) x86_64> Miloslav> GNU/Linux > > Miloslav> RAID is a composition of 16 harddrives. Harddrives areconnected via> Miloslav> AVAGO MegaRAID SAS 9361-8i as a RAID-0 devices. Allharddrives are SAS> Miloslav> 2.5" 15k drives. > > Can you post the output of "cat /proc/mdstat" or since you sayyou're> using btrfs, are you using their own RAID0 setup???If so, pleasepost> the output of 'btrfs stats' or whatever the command is you use toview> layout info?There is a one PCIe RAID controller in a chasis. AVAGO MegaRAID SAS 9361-8i. And 16x SAS 15k drives conneced to it. Because the controller does not support pass-through for the drives, we use 16x RAID-0 on controller. So, we get /dev/sda ... /dev/sdp (roughly) in OS. And over that we have single btrfs RAID-10, composed of 16 devices, mounted as /data. We have chosen this wiring for severeal reasons: - easy to increase a capacity - easy to replace drives by larger ones - due to checksuming, btrfs does not need fsck in case of power failure - btrfs scrub discovers failing drive sooner than S.M.A.R.T. or RAID controller> Miloslav> Server serves as a IMAP with Dovecot 2.2.27-3+deb9u6, 4104accounts,> Miloslav> Mailbox format, LMTP delivery. > > How ofter are these accounts hitting the server?IMAP serves for a univesity. So there are typical rush hours from 7AM to 3PM. Lowers during the evening, almost not used during the night.> Miloslav> We run 'rsync' to remote NAS daily. It takes about 6.5hours to finish,> Miloslav> 12'265'387 files last night. > > That's.... sucky.??So basically you're hitting the drives hardwith> random IOPs and you're probably running out of performance.??Howmuch> space are you using on the filesystem?It's not so sucky how it seems. rsync runs during the night. And even reading is high, server load stays low. We have problems with writes.> And why not use brtfs send to ship off snapshots instead of using > rsync???I'm sure that would be an improvement...We run backup to external NAS (NetApp) for a disaster recovery scenario. Moreover NAS is spreaded across multiple locations. Then we create NAS snapshot, tens days backward. All snapshots easily available via NFS mount. And NAS capacity is cheaper.> Miloslav> Last half year, we encoutered into performace > Miloslav> troubles. Server load grows up to 30 in rush hours, due to > Miloslav> IO waits. We tried to attach next harddrives (the 838Gones> Miloslav> in a list below) and increase a free space by rebalace. I > Miloslav> think, it helped a little bit, not not so rapidly. > > If you're IOPs bound, but not space bound, then you *really* want to > get an SSD in there for the indexes and such.??Basically the stuff > that gets written/read from all the time no matter what, but which > isn't large in terms of space.Yes. We are now on 66% capacity. Adding SSD for indexes is our next step.> Also, adding in another controller card or two would also probably > help spread the load across more PCI channels, and reduce contention > on the SATA/SAS bus as well.Probably we will wait how SSD helps first, but as you wrote, it is possible next step.> Miloslav> Is this a reasonable setup and use case for btrfs RAID-10? > Miloslav> If so, are there some recommendations to achieve better > Miloslav> performance? > > 1. move HOT data to SSD based volume RAID 1 pair.??On a seperate >???? controller.OK> 2. add more controllers, which also means you're more redundant in >???? case one controller fails.OK> 3. Clone the system and put Dovecot IMAP director in from of the >???? setup.I still hope that one server can handle 4105 accounts.> 4. Stop using rsync for copying to your DR site, use the btrfs snap >???? send, or whatever the commands are.I hope it is not needed in our scenario.> 5. check which dovecot backend you're using and think about movingto>???? one which doesn't involve nearly as many files.Maildir is comfortable for us. Time to time, users call us with: "I accidentally deleted the folder" and it is super easy to copy it back from backup.> 6. Find out who your biggest users are, in terms of emails and move >???? them to SSDs if step 1 is too hard to do at first.OK> Can you also grab some 'iostat -dhm 30 60'??output, which is 30 > minutes of data over 30 second intervals???That should help younarrow> down which (if any) disk is your hotspot.OK, thanks for the tip.> It's not clear to me if you have one big btrfs filesystem, or abunch> of smaller ones stiched together.??In any case, it should be veryeasy> to get better performance here.I hope I've made it clear above.> I think someone else mentioned that you should look at your dovecot > backend, and you should move to the fastest one you can find. > > Good luck! > JohnThank you for your time and advices! Kind regards Milo> Miloslav> # megaclisas-status > Miloslav> -- Controller information -- > Miloslav> -- ID | H/W Model??????????????????|RAM????| Temp | BBU????| Firmware> Miloslav> c0????| AVAGO MegaRAID SAS 9361-8i | 1024MB | 72C??|Good?? | FW:> Miloslav> 24.16.0-0082 > > Miloslav> -- Array information -- > Miloslav> -- ID | Type?? |????Size |??Strpsz | Flags |DskCache |?? Status |??OS> Miloslav> Path | CacheCade |InProgress > Miloslav> c0u0??| RAID-0 |????838G |??256 KB | RA,WB|??Enabled |??Optimal |> Miloslav> /dev/sdq | None??????|None > Miloslav> c0u1??| RAID-0 |????558G |??256 KB | RA,WB|??Enabled |??Optimal |> Miloslav> /dev/sda | None??????|None > Miloslav> c0u2??| RAID-0 |????558G |??256 KB | RA,WB|??Enabled |??Optimal |> Miloslav> /dev/sdb | None??????|None > Miloslav> c0u3??| RAID-0 |????558G |??256 KB | RA,WB|??Enabled |??Optimal |> Miloslav> /dev/sdc | None??????|None > Miloslav> c0u4??| RAID-0 |????558G |??256 KB | RA,WB|??Enabled |??Optimal |> Miloslav> /dev/sdd | None??????|None > Miloslav> c0u5??| RAID-0 |????558G |??256 KB | RA,WB|??Enabled |??Optimal |> Miloslav> /dev/sde | None??????|None > Miloslav> c0u6??| RAID-0 |????558G |??256 KB | RA,WB|??Enabled |??Optimal |> Miloslav> /dev/sdf | None??????|None > Miloslav> c0u7??| RAID-0 |????558G |??256 KB | RA,WB|??Enabled |??Optimal |> Miloslav> /dev/sdg | None??????|None > Miloslav> c0u8??| RAID-0 |????558G |??256 KB | RA,WB|??Enabled |??Optimal |> Miloslav> /dev/sdh | None??????|None > Miloslav> c0u9??| RAID-0 |????558G |??256 KB | RA,WB|??Enabled |??Optimal |> Miloslav> /dev/sdi | None??????|None > Miloslav> c0u10 | RAID-0 |????558G |??256 KB | RA,WB|??Enabled |??Optimal |> Miloslav> /dev/sdj | None??????|None > Miloslav> c0u11 | RAID-0 |????558G |??256 KB | RA,WB|??Enabled |??Optimal |> Miloslav> /dev/sdk | None??????|None > Miloslav> c0u12 | RAID-0 |????558G |??256 KB | RA,WB|??Enabled |??Optimal |> Miloslav> /dev/sdl | None??????|None > Miloslav> c0u13 | RAID-0 |????558G |??256 KB | RA,WB|??Enabled |??Optimal |> Miloslav> /dev/sdm | None??????|None > Miloslav> c0u14 | RAID-0 |????558G |??256 KB | RA,WB|??Enabled |??Optimal |> Miloslav> /dev/sdn | None??????|None > Miloslav> c0u15 | RAID-0 |????838G |??256 KB | RA,WB|??Enabled |??Optimal |> Miloslav> /dev/sdr | None??????|None > > Miloslav> -- Disk information -- > Miloslav> -- ID?? | Type | DriveModel?????????????????????? | Size???? | Status> Miloslav>?????? | Speed????| Temp | Slot ID??| LSI ID > Miloslav> c0u0p0??| HDD??| SEAGATE ST900MP0006 N003WAG0Q3S3??|837.8 Gb | Online,> Miloslav> Spun Up | 12.0Gb/s | 53C??| [8:14]?? | 32 > Miloslav> c0u1p0??| HDD??| HGST HUC156060CSS200 A3800XV250TJ |558.4 Gb | Online,> Miloslav> Spun Up | 12.0Gb/s | 38C??| [8:0]????| 12 > Miloslav> c0u2p0??| HDD??| HGST HUC156060CSS200 A3800XV3XT4J |558.4 Gb | Online,> Miloslav> Spun Up | 12.0Gb/s | 43C??| [8:1]????| 11 > Miloslav> c0u3p0??| HDD??| HGST HUC156060CSS200 ADB05ZG4XLZU |558.4 Gb | Online,> Miloslav> Spun Up | 12.0Gb/s | 46C??| [8:2]????| 25 > Miloslav> c0u4p0??| HDD??| HGST HUC156060CSS200 A3800XV3DWRL |558.4 Gb | Online,> Miloslav> Spun Up | 12.0Gb/s | 48C??| [8:3]????| 14 > Miloslav> c0u5p0??| HDD??| HGST HUC156060CSS200 A3800XV3XZTL |558.4 Gb | Online,> Miloslav> Spun Up | 12.0Gb/s | 52C??| [8:4]????| 18 > Miloslav> c0u6p0??| HDD??| HGST HUC156060CSS200 A3800XV3VSKJ |558.4 Gb | Online,> Miloslav> Spun Up | 12.0Gb/s | 55C??| [8:5]????| 15 > Miloslav> c0u7p0??| HDD??| SEAGATE ST600MP0006 N003WAF1LWKE??|558.4 Gb | Online,> Miloslav> Spun Up | 12.0Gb/s | 56C??| [8:6]????| 28 > Miloslav> c0u8p0??| HDD??| HGST HUC156060CSS200 A3800XV3XTDJ |558.4 Gb | Online,> Miloslav> Spun Up | 12.0Gb/s | 55C??| [8:7]????| 20 > Miloslav> c0u9p0??| HDD??| HGST HUC156060CSS200 A3800XV3T8XL |558.4 Gb | Online,> Miloslav> Spun Up | 12.0Gb/s | 57C??| [8:8]????| 19 > Miloslav> c0u10p0 | HDD??| HGST HUC156060CSS200 A7030XHL0ZYP |558.4 Gb | Online,> Miloslav> Spun Up | 12.0Gb/s | 61C??| [8:9]????| 23 > Miloslav> c0u11p0 | HDD??| HGST HUC156060CSS200 ADB05ZG4VR3P |558.4 Gb | Online,> Miloslav> Spun Up | 12.0Gb/s | 60C??| [8:10]?? | 24 > Miloslav> c0u12p0 | HDD??| SEAGATE ST600MP0006 N003WAF195KA??|558.4 Gb | Online,> Miloslav> Spun Up | 12.0Gb/s | 60C??| [8:11]?? | 29 > Miloslav> c0u13p0 | HDD??| SEAGATE ST600MP0006 N003WAF1LTZW??|558.4 Gb | Online,> Miloslav> Spun Up | 12.0Gb/s | 56C??| [8:12]?? | 26 > Miloslav> c0u14p0 | HDD??| SEAGATE ST600MP0006 N003WAF1LWH6??|558.4 Gb | Online,> Miloslav> Spun Up | 12.0Gb/s | 55C??| [8:13]?? | 27 > Miloslav> c0u15p0 | HDD??| SEAGATE ST900MP0006 N003WAG0Q414??|837.8 Gb | Online,> Miloslav> Spun Up | 12.0Gb/s | 47C??| [8:15]?? | 33 > > > > Miloslav> # btrfs --version > Miloslav> btrfs-progs v4.7.3 > > > > Miloslav> # btrfs fi show > Miloslav> Label: 'DATA'??uuid:5b285a46-e55d-4191-924f-0884fa06edd8> Miloslav>??????????Total devices 16 FS bytes used 3.49TiB > Miloslav>??????????devid????1 size 558.41GiB used448.66GiB path /dev/sda> Miloslav>??????????devid????2 size 558.41GiB used448.66GiB path /dev/sdb> Miloslav>??????????devid????4 size 558.41GiB used448.66GiB path /dev/sdd> Miloslav>??????????devid????5 size 558.41GiB used448.66GiB path /dev/sde> Miloslav>??????????devid????7 size 558.41GiB used448.66GiB path /dev/sdg> Miloslav>??????????devid????8 size 558.41GiB used448.66GiB path /dev/sdh> Miloslav>??????????devid????9 size 558.41GiB used448.66GiB path /dev/sdf> Miloslav>??????????devid?? 10 size 558.41GiB used448.66GiB path /dev/sdi> Miloslav>??????????devid?? 11 size 558.41GiB used448.66GiB path /dev/sdj> Miloslav>??????????devid?? 13 size 558.41GiB used448.66GiB path /dev/sdk> Miloslav>??????????devid?? 14 size 558.41GiB used448.66GiB path /dev/sdc> Miloslav>??????????devid?? 15 size 558.41GiB used448.66GiB path /dev/sdl> Miloslav>??????????devid?? 16 size 558.41GiB used448.66GiB path /dev/sdm> Miloslav>??????????devid?? 17 size 558.41GiB used448.66GiB path /dev/sdn> Miloslav>??????????devid?? 18 size 837.84GiB used448.66GiB path /dev/sdr> Miloslav>??????????devid?? 19 size 837.84GiB used448.66GiB path /dev/sdq> > > > Miloslav> # btrfs fi df /data/ > Miloslav> Data, RAID10: total=3.48TiB, used=3.47TiB > Miloslav> System, RAID10: total=256.00MiB, used=320.00KiB > Miloslav> Metadata, RAID10: total=21.00GiB, used=18.17GiB > Miloslav> GlobalReserve, single: total=512.00MiB, used=0.00B > > > > Miloslav> I do not attach whole dmesg log. It is almost empty,without errors.> Miloslav> Only lines about BTRFS are about relocations, like: > > Miloslav> BTRFS info (device sda): relocating block group29435663220736 flags 65> Miloslav> BTRFS info (device sda): found 54460 extents > Miloslav> BTRFS info (device sda): found 54459 extents >-------------- next part -------------- An HTML attachment was scrubbed... URL: <https://dovecot.org/pipermail/dovecot/attachments/20200909/71ffa83f/attachment-0001.html>
>>>>> "Miloslav" == Miloslav H?la <miloslav.hula at gmail.com> writes:Miloslav> Hi, thank you for your reply. I'll continue inline... Me too... please look for further comments. Esp about 'fio' and Netapp useage. Miloslav> Dne 09.09.2020 v 3:15 John Stoffel napsal(a): Miloslav> Hello, Miloslav> I sent this into the Linux Kernel Btrfs mailing list and I got reply: Miloslav> "RAID-1 would be preferable" Miloslav> (https://lore.kernel.org/linux-btrfs/7b364356-7041-7d18-bd77-f60e0e2e2112 at lechevalier.se/T/). Miloslav> May I ask you for the comments as from people around the Dovecot?>> >>Miloslav> We are using btrfs RAID-10 (/data, 4.7TB) on a physical Supermicro Miloslav> server with Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz and 125GB of RAM. Miloslav> We run 'btrfs scrub start -B -d /data' every Sunday as a cron task. It Miloslav> takes about 50 minutes to finish.>>Miloslav> # uname -a Miloslav> Linux imap 4.9.0-12-amd64 #1 SMP Debian 4.9.210-1 (2020-01-20) x86_64 Miloslav> GNU/Linux>>Miloslav> RAID is a composition of 16 harddrives. Harddrives are connected via Miloslav> AVAGO MegaRAID SAS 9361-8i as a RAID-0 devices. All harddrives are SAS Miloslav> 2.5" 15k drives.>> >> Can you post the output of "cat /proc/mdstat" or since you say you're >> using btrfs, are you using their own RAID0 setup? If so, please post >> the output of 'btrfs stats' or whatever the command is you use to view >> layout info?Miloslav> There is a one PCIe RAID controller in a chasis. AVAGO Miloslav> MegaRAID SAS 9361-8i. And 16x SAS 15k drives conneced to Miloslav> it. Because the controller does not support pass-through for Miloslav> the drives, we use 16x RAID-0 on controller. So, we get Miloslav> /dev/sda ... /dev/sdp (roughly) in OS. And over that we have Miloslav> single btrfs RAID-10, composed of 16 devices, mounted as Miloslav> /data. I will bet that this is one of your bottlenecks as well. Get a secord or third controller and split your disks across them evenly. Miloslav> We have chosen this wiring for severeal reasons: Miloslav> - easy to increase a capacity Miloslav> - easy to replace drives by larger ones Miloslav> - due to checksuming, btrfs does not need fsck in case of power failure Miloslav> - btrfs scrub discovers failing drive sooner than S.M.A.R.T. or RAID Miloslav> controller Miloslav> Server serves as a IMAP with Dovecot 2.2.27-3+deb9u6, 4104 accounts, Miloslav> Mailbox format, LMTP delivery.>> How ofter are these accounts hitting the server?Miloslav> IMAP serves for a univesity. So there are typical rush hours from 7AM to Miloslav> 3PM. Lowers during the evening, almost not used during the night. I can understand this, I used to work at a Uni so I can understand the population needs. Miloslav> We run 'rsync' to remote NAS daily. It takes about 6.5 hours to finish, Miloslav> 12'265'387 files last night.>> >> That's.... sucky. So basically you're hitting the drives hard with >> random IOPs and you're probably running out of performance. How much >> space are you using on the filesystem?Miloslav> It's not so sucky how it seems. rsync runs during the Miloslav> night. And even reading is high, server load stays low. We Miloslav> have problems with writes. Ok. So putting in an SSD pair to cache things should help.>> And why not use brtfs send to ship off snapshots instead of using >> rsync? I'm sure that would be an improvement...Miloslav> We run backup to external NAS (NetApp) for a disaster Miloslav> recovery scenario. Moreover NAS is spreaded across multiple Miloslav> locations. Then we create NAS snapshot, tens days Miloslav> backward. All snapshots easily available via NFS mount. And Miloslav> NAS capacity is cheaper. So why not run the backend storage on the Netapp, and just keep the indexes and such local to the system? I've run Netapps for many years and they work really well. And then you'd get automatic backups using schedule snapshots. Keep the index files local on disk/SSDs and put the maildirs out to NFSv3 volume(s) on the Netapp(s). Should do wonders. And you'll stop needing to do rsync at night. Miloslav> Last half year, we encoutered into performace Miloslav> troubles. Server load grows up to 30 in rush hours, due to Miloslav> IO waits. We tried to attach next harddrives (the 838G ones Miloslav> in a list below) and increase a free space by rebalace. I Miloslav> think, it helped a little bit, not not so rapidly.>> If you're IOPs bound, but not space bound, then you *really* want to >> get an SSD in there for the indexes and such. Basically the stuff >> that gets written/read from all the time no matter what, but which >> isn't large in terms of space.Miloslav> Yes. We are now on 66% capacity. Adding SSD for indexes is Miloslav> our next step. This *should* give you a boost in performance. But finding a way to take before and after latency/performance measurements is key. I would look into using 'fio' to test your latency numbers. You might also want to try using XFS or even ext4 as your filesystem. I understand not wanting to 'fsck', so that might be right out. Which leads me back to suggesting you use the Netapp as your primary storage, assuming the Netapp isn't bogged down with other work. Again, use 'fio' to run some tests and see how things look.>> Also, adding in another controller card or two would also probably >> help spread the load across more PCI channels, and reduce contention >> on the SATA/SAS bus as well.Miloslav> Probably we will wait how SSD helps first, but as you wrote, it is Miloslav> possible next step. Miloslav> Is this a reasonable setup and use case for btrfs RAID-10? Miloslav> If so, are there some recommendations to achieve better Miloslav> performance?>> >> 1. move HOT data to SSD based volume RAID 1 pair. On a seperate >> controller.Miloslav> OK>> 2. add more controllers, which also means you're more redundant in >> case one controller fails.Miloslav> OK>> 3. Clone the system and put Dovecot IMAP director in from of the >> setup.Miloslav> I still hope that one server can handle 4105 accounts.>> 4. Stop using rsync for copying to your DR site, use the btrfs snap >> send, or whatever the commands are.Miloslav> I hope it is not needed in our scenario.>> 5. check which dovecot backend you're using and think about moving to >> one which doesn't involve nearly as many files.Miloslav> Maildir is comfortable for us. Time to time, users call us with: "I Miloslav> accidentally deleted the folder" and it is super easy to copy it back Miloslav> from backup.>> 6. Find out who your biggest users are, in terms of emails and move >> them to SSDs if step 1 is too hard to do at first.Miloslav> OK>> Can you also grab some 'iostat -dhm 30 60' output, which is 30 >> minutes of data over 30 second intervals? That should help you narrow >> down which (if any) disk is your hotspot.Miloslav> OK, thanks for the tip.>> It's not clear to me if you have one big btrfs filesystem, or a bunch >> of smaller ones stiched together. In any case, it should be very easy >> to get better performance here.Miloslav> I hope I've made it clear above.>> I think someone else mentioned that you should look at your dovecot >> backend, and you should move to the fastest one you can find. >> >> Good luck! >> JohnMiloslav> Thank you for your time and advices! Miloslav> Kind regards Miloslav> Milo Miloslav> # megaclisas-status Miloslav> -- Controller information -- Miloslav> -- ID | H/W Model | RAM | Temp | BBU | Firmware Miloslav> c0 | AVAGO MegaRAID SAS 9361-8i | 1024MB | 72C | Good | FW: Miloslav> 24.16.0-0082>>Miloslav> -- Array information -- Miloslav> -- ID | Type | Size | Strpsz | Flags | DskCache | Status | OS Miloslav> Path | CacheCade |InProgress Miloslav> c0u0 | RAID-0 | 838G | 256 KB | RA,WB | Enabled | Optimal | Miloslav> /dev/sdq | None |None Miloslav> c0u1 | RAID-0 | 558G | 256 KB | RA,WB | Enabled | Optimal | Miloslav> /dev/sda | None |None Miloslav> c0u2 | RAID-0 | 558G | 256 KB | RA,WB | Enabled | Optimal | Miloslav> /dev/sdb | None |None Miloslav> c0u3 | RAID-0 | 558G | 256 KB | RA,WB | Enabled | Optimal | Miloslav> /dev/sdc | None |None Miloslav> c0u4 | RAID-0 | 558G | 256 KB | RA,WB | Enabled | Optimal | Miloslav> /dev/sdd | None |None Miloslav> c0u5 | RAID-0 | 558G | 256 KB | RA,WB | Enabled | Optimal | Miloslav> /dev/sde | None |None Miloslav> c0u6 | RAID-0 | 558G | 256 KB | RA,WB | Enabled | Optimal | Miloslav> /dev/sdf | None |None Miloslav> c0u7 | RAID-0 | 558G | 256 KB | RA,WB | Enabled | Optimal | Miloslav> /dev/sdg | None |None Miloslav> c0u8 | RAID-0 | 558G | 256 KB | RA,WB | Enabled | Optimal | Miloslav> /dev/sdh | None |None Miloslav> c0u9 | RAID-0 | 558G | 256 KB | RA,WB | Enabled | Optimal | Miloslav> /dev/sdi | None |None Miloslav> c0u10 | RAID-0 | 558G | 256 KB | RA,WB | Enabled | Optimal | Miloslav> /dev/sdj | None |None Miloslav> c0u11 | RAID-0 | 558G | 256 KB | RA,WB | Enabled | Optimal | Miloslav> /dev/sdk | None |None Miloslav> c0u12 | RAID-0 | 558G | 256 KB | RA,WB | Enabled | Optimal | Miloslav> /dev/sdl | None |None Miloslav> c0u13 | RAID-0 | 558G | 256 KB | RA,WB | Enabled | Optimal | Miloslav> /dev/sdm | None |None Miloslav> c0u14 | RAID-0 | 558G | 256 KB | RA,WB | Enabled | Optimal | Miloslav> /dev/sdn | None |None Miloslav> c0u15 | RAID-0 | 838G | 256 KB | RA,WB | Enabled | Optimal | Miloslav> /dev/sdr | None |None>>Miloslav> -- Disk information -- Miloslav> -- ID | Type | Drive Model | Size | Status Miloslav> | Speed | Temp | Slot ID | LSI ID Miloslav> c0u0p0 | HDD | SEAGATE ST900MP0006 N003WAG0Q3S3 | 837.8 Gb | Online, Miloslav> Spun Up | 12.0Gb/s | 53C | [8:14] | 32 Miloslav> c0u1p0 | HDD | HGST HUC156060CSS200 A3800XV250TJ | 558.4 Gb | Online, Miloslav> Spun Up | 12.0Gb/s | 38C | [8:0] | 12 Miloslav> c0u2p0 | HDD | HGST HUC156060CSS200 A3800XV3XT4J | 558.4 Gb | Online, Miloslav> Spun Up | 12.0Gb/s | 43C | [8:1] | 11 Miloslav> c0u3p0 | HDD | HGST HUC156060CSS200 ADB05ZG4XLZU | 558.4 Gb | Online, Miloslav> Spun Up | 12.0Gb/s | 46C | [8:2] | 25 Miloslav> c0u4p0 | HDD | HGST HUC156060CSS200 A3800XV3DWRL | 558.4 Gb | Online, Miloslav> Spun Up | 12.0Gb/s | 48C | [8:3] | 14 Miloslav> c0u5p0 | HDD | HGST HUC156060CSS200 A3800XV3XZTL | 558.4 Gb | Online, Miloslav> Spun Up | 12.0Gb/s | 52C | [8:4] | 18 Miloslav> c0u6p0 | HDD | HGST HUC156060CSS200 A3800XV3VSKJ | 558.4 Gb | Online, Miloslav> Spun Up | 12.0Gb/s | 55C | [8:5] | 15 Miloslav> c0u7p0 | HDD | SEAGATE ST600MP0006 N003WAF1LWKE | 558.4 Gb | Online, Miloslav> Spun Up | 12.0Gb/s | 56C | [8:6] | 28 Miloslav> c0u8p0 | HDD | HGST HUC156060CSS200 A3800XV3XTDJ | 558.4 Gb | Online, Miloslav> Spun Up | 12.0Gb/s | 55C | [8:7] | 20 Miloslav> c0u9p0 | HDD | HGST HUC156060CSS200 A3800XV3T8XL | 558.4 Gb | Online, Miloslav> Spun Up | 12.0Gb/s | 57C | [8:8] | 19 Miloslav> c0u10p0 | HDD | HGST HUC156060CSS200 A7030XHL0ZYP | 558.4 Gb | Online, Miloslav> Spun Up | 12.0Gb/s | 61C | [8:9] | 23 Miloslav> c0u11p0 | HDD | HGST HUC156060CSS200 ADB05ZG4VR3P | 558.4 Gb | Online, Miloslav> Spun Up | 12.0Gb/s | 60C | [8:10] | 24 Miloslav> c0u12p0 | HDD | SEAGATE ST600MP0006 N003WAF195KA | 558.4 Gb | Online, Miloslav> Spun Up | 12.0Gb/s | 60C | [8:11] | 29 Miloslav> c0u13p0 | HDD | SEAGATE ST600MP0006 N003WAF1LTZW | 558.4 Gb | Online, Miloslav> Spun Up | 12.0Gb/s | 56C | [8:12] | 26 Miloslav> c0u14p0 | HDD | SEAGATE ST600MP0006 N003WAF1LWH6 | 558.4 Gb | Online, Miloslav> Spun Up | 12.0Gb/s | 55C | [8:13] | 27 Miloslav> c0u15p0 | HDD | SEAGATE ST900MP0006 N003WAG0Q414 | 837.8 Gb | Online, Miloslav> Spun Up | 12.0Gb/s | 47C | [8:15] | 33>> >> >>Miloslav> # btrfs --version Miloslav> btrfs-progs v4.7.3>> >> >>Miloslav> # btrfs fi show Miloslav> Label: 'DATA' uuid: 5b285a46-e55d-4191-924f-0884fa06edd8 Miloslav> Total devices 16 FS bytes used 3.49TiB Miloslav> devid 1 size 558.41GiB used 448.66GiB path /dev/sda Miloslav> devid 2 size 558.41GiB used 448.66GiB path /dev/sdb Miloslav> devid 4 size 558.41GiB used 448.66GiB path /dev/sdd Miloslav> devid 5 size 558.41GiB used 448.66GiB path /dev/sde Miloslav> devid 7 size 558.41GiB used 448.66GiB path /dev/sdg Miloslav> devid 8 size 558.41GiB used 448.66GiB path /dev/sdh Miloslav> devid 9 size 558.41GiB used 448.66GiB path /dev/sdf Miloslav> devid 10 size 558.41GiB used 448.66GiB path /dev/sdi Miloslav> devid 11 size 558.41GiB used 448.66GiB path /dev/sdj Miloslav> devid 13 size 558.41GiB used 448.66GiB path /dev/sdk Miloslav> devid 14 size 558.41GiB used 448.66GiB path /dev/sdc Miloslav> devid 15 size 558.41GiB used 448.66GiB path /dev/sdl Miloslav> devid 16 size 558.41GiB used 448.66GiB path /dev/sdm Miloslav> devid 17 size 558.41GiB used 448.66GiB path /dev/sdn Miloslav> devid 18 size 837.84GiB used 448.66GiB path /dev/sdr Miloslav> devid 19 size 837.84GiB used 448.66GiB path /dev/sdq>> >> >>Miloslav> # btrfs fi df /data/ Miloslav> Data, RAID10: total=3.48TiB, used=3.47TiB Miloslav> System, RAID10: total=256.00MiB, used=320.00KiB Miloslav> Metadata, RAID10: total=21.00GiB, used=18.17GiB Miloslav> GlobalReserve, single: total=512.00MiB, used=0.00B>> >> >>Miloslav> I do not attach whole dmesg log. It is almost empty, without errors. Miloslav> Only lines about BTRFS are about relocations, like:>>Miloslav> BTRFS info (device sda): relocating block group 29435663220736 flags 65 Miloslav> BTRFS info (device sda): found 54460 extents Miloslav> BTRFS info (device sda): found 54459 extents>>
Some controllers has direct option "pass through to OS" for a drive, that's what I meant. I can't recall why we have chosen RAID-0 instead of JBOD, there was some reason, but I hope there is no difference with single drive. Thank you Milo Dne 09.09.2020 v 15:51 Scott Q. napsal(a):> The 9361-8i does support passthrough ( JBOD mode ). Make sure you have > the latest firmware.
Dne 09.09.2020 v 17:52 John Stoffel napsal(a):> Miloslav> There is a one PCIe RAID controller in a chasis. AVAGO > Miloslav> MegaRAID SAS 9361-8i. And 16x SAS 15k drives conneced to > Miloslav> it. Because the controller does not support pass-through for > Miloslav> the drives, we use 16x RAID-0 on controller. So, we get > Miloslav> /dev/sda ... /dev/sdp (roughly) in OS. And over that we have > Miloslav> single btrfs RAID-10, composed of 16 devices, mounted as > Miloslav> /data. > > I will bet that this is one of your bottlenecks as well. Get a secord > or third controller and split your disks across them evenly.That's plan for a next step.> Miloslav> We run 'rsync' to remote NAS daily. It takes about 6.5 hours to finish, > Miloslav> 12'265'387 files last night. >>> >>> That's.... sucky. So basically you're hitting the drives hard with >>> random IOPs and you're probably running out of performance. How much >>> space are you using on the filesystem? > > Miloslav> It's not so sucky how it seems. rsync runs during the > Miloslav> night. And even reading is high, server load stays low. We > Miloslav> have problems with writes. > > Ok. So putting in an SSD pair to cache things should help. > >>> And why not use brtfs send to ship off snapshots instead of using >>> rsync? I'm sure that would be an improvement... > > Miloslav> We run backup to external NAS (NetApp) for a disaster > Miloslav> recovery scenario. Moreover NAS is spreaded across multiple > Miloslav> locations. Then we create NAS snapshot, tens days > Miloslav> backward. All snapshots easily available via NFS mount. And > Miloslav> NAS capacity is cheaper. > > So why not run the backend storage on the Netapp, and just keep the > indexes and such local to the system? I've run Netapps for many years > and they work really well. And then you'd get automatic backups using > schedule snapshots. > > Keep the index files local on disk/SSDs and put the maildirs out to > NFSv3 volume(s) on the Netapp(s). Should do wonders. And you'll stop > needing to do rsync at night.It's the option we have in minds. As you wrote, NetApp is very solid. The main reason for local storage is, that IMAP server is completely isolated from network. But maybe one day will use it.> Miloslav> Last half year, we encoutered into performace > Miloslav> troubles. Server load grows up to 30 in rush hours, due to > Miloslav> IO waits. We tried to attach next harddrives (the 838G ones > Miloslav> in a list below) and increase a free space by rebalace. I > Miloslav> think, it helped a little bit, not not so rapidly. > >>> If you're IOPs bound, but not space bound, then you *really* want to >>> get an SSD in there for the indexes and such. Basically the stuff >>> that gets written/read from all the time no matter what, but which >>> isn't large in terms of space. > > Miloslav> Yes. We are now on 66% capacity. Adding SSD for indexes is > Miloslav> our next step. > > This *should* give you a boost in performance. But finding a way to > take before and after latency/performance measurements is key. I > would look into using 'fio' to test your latency numbers. You might > also want to try using XFS or even ext4 as your filesystem. I > understand not wanting to 'fsck', so that might be right out.Unfortunately, to quickly fix the problem and make server usable again, we already added SSD and moved indexes on it. So we have no measurements in old state. Situation is better, but I guess, problem still exists. I takes some time to load be growing. We will see. Thank you for the fio tip. Definetly I'll try that. Kind regards Milo