Until now, things were fine but today while doing a regular lvm snapshot backup, I noticed a huge lag in I/O. Whenever i ran dd on dom0, the load on Linux domUs increased to high values. A simple read/write test on Dom0 showed the speed of 1.2 MB/s. # dd if=/dev/zero of=./test bs=1k count=1048576 ^C^C^C^C 545259520 bytes (520 MB) copied, 402.32934 seconds, 1.3 MB/s Running Xen 4.2.1. Has anyone noticed anything similar?
To add to my last email. This happens to be related to LVM snapshot. Every time a snapshot is created for the LVM where a particular domu is on, load on that domu spikes up to 30 and things become sluggish. Ionice and lvm parameter formalities didn''t help! To the the real guys out there: How do you use LVM snapshots with Xen dom0, if any? To me, it seems like LVM snapshotting isn''t a short-term backup strategy at all! On Tue, Jul 9, 2013 at 1:52 AM, Micky <mickylmartin@gmail.com> wrote:> Until now, things were fine but today while doing a regular lvm > snapshot backup, I noticed a huge lag in I/O. > > Whenever i ran dd on dom0, the load on Linux domUs increased to high values. > > A simple read/write test on Dom0 showed the speed of 1.2 MB/s. > > # dd if=/dev/zero of=./test bs=1k count=1048576 > ^C^C^C^C > 545259520 bytes (520 MB) copied, 402.32934 seconds, 1.3 MB/s > > Running Xen 4.2.1. > > Has anyone noticed anything similar?
On 09/07/13 13:15, Micky wrote:> To add to my last email. > This happens to be related to LVM snapshot. Every time a snapshot is > created for the LVM where a particular domu is on, load on that domu > spikes up to 30 and things become sluggish. > > Ionice and lvm parameter formalities didn''t help! > > To the the real guys out there: > How do you use LVM snapshots with Xen dom0, if any? To me, it seems > like LVM snapshotting isn''t a short-term backup strategy at all!I''ve had similar issues, in fact, for the life of the LVM snapshot, performance seems to severely degrade. Usually a single snapshot is ok, but I wanted to have three snapshots, and each day delete the oldest and create a new one. I''ve found two "solutions": 1) Make your storage backend perform like a god so that after you take the snapshots performance is like a stroll down the road. (ie, I''ve upgraded to SSD based storage which can get approx 1.5TB/s write and 2.5TB/s read) .... 2) Only keep a single snapshot, and if possible, remove it as soon as your backup is completed.... and/or keep writes to a minimum while the snapshot is active. My plan is to do something like this: 1) Have two storage backend machines 2) Use DRBD to sync the two of them (primary sits on RAID device, secondary sits on LVM on RAID device) 3) Use LVM on top of the DRBD to create LV''s for each domU 5) Take a snapshot using the underlying LVM (below DRBD) on the secondary 6) Run your backup processes on the snapshot of the DRBD 7) Delete the snapshot The problem I have is that probably step 6 and 7 might involve disconnecting the backup server from the primary (break the DRBD), and promote it to primary, and make various changes to it (ie, create a split-brain scenario intentionally). After finished the backup process, you may need to invalidate the entire DRBD and re-sync, which could be too time consuming (and itself cause a performance issue). I haven''t yet got that far in the process, so if you do something it would be helpful to hear about it. Also any other people who can share what they do and what works well/doesn''t work would be nice to see. Finally, the other problem I have with LVM on Debian (stable) is that every week or two, it will freeze on lvremove, and other lvs or LV related commands will freeze. The only solution seems to be a reboot. (Using kernel 3.2.0-4-686-pae #1 SMP Debian 3.2.41-2 i686). I haven''t tracked this down or reported it yet, but it is frustrating to have to reboot the dom0 so often. Regards, Adam -- Adam Goryachev Website Managers www.websitemanagers.com.au
> I''ve had similar issues, in fact, for the life of the LVM snapshot, > performance seems to severely degrade. Usually a single snapshot is ok, but > I wanted to have three snapshots, and each day delete the oldest and create > a new one. >What a coincidence! I am doing exactly the same.> I''ve found two "solutions": > 1) Make your storage backend perform like a god so that after you take the > snapshots performance is like a stroll down the road. (ie, I''ve upgraded to > SSD based storage which can get approx 1.5TB/s write and 2.5TB/s read) .... > 2) Only keep a single snapshot, and if possible, remove it as soon as your > backup is completed.... and/or keep writes to a minimum while the snapshot > is active.That''s what the script I wrote, is doing. Check http://github.com/bassu/xen-scripts/ As for SSDs, I didn''t find them stable as in long-term production environments!> My plan is to do something like this: > 1) Have two storage backend machines > 2) Use DRBD to sync the two of them (primary sits on RAID device, secondary > sits on LVM on RAID device) > 3) Use LVM on top of the DRBD to create LV''s for each domU > 5) Take a snapshot using the underlying LVM (below DRBD) on the secondary > 6) Run your backup processes on the snapshot of the DRBD > 7) Delete the snapshotSounds a lot complicated. Block level snapshots under grouped block level devices -- seems like a lot of overhead! Gluster may be a lot more useful in this case -- just a slight guess.> I haven''t yet got that far in the process, so if you do something it would > be helpful to hear about it. > > Also any other people who can share what they do and what works well/doesn''t > work would be nice to see.I am experimenting with a few tricks. I will share the outcome like the script I just shared :)> Finally, the other problem I have with LVM on Debian (stable) is that every > week or two, it will freeze on lvremove, and other lvs or LV related > commands will freeze. The only solution seems to be a reboot. (Using kernel > 3.2.0-4-686-pae #1 SMP Debian 3.2.41-2 i686). I haven''t tracked this down or > reported it yet, but it is frustrating to have to reboot the dom0 so often.LVM is slow as heck when it comes to snapshots. And everywhere I look, people talk about the "copy on write" magic, but no one tells you that you are gonna bite your tongue!> Regards,Cheers.
On 09/07/13 18:49, Micky wrote:>> I''ve found two "solutions": >> 1) Make your storage backend perform like a god so that after you take the >> snapshots performance is like a stroll down the road. (ie, I''ve upgraded to >> SSD based storage which can get approx 1.5TB/s write and 2.5TB/s read) .... >> 2) Only keep a single snapshot, and if possible, remove it as soon as your >> backup is completed.... and/or keep writes to a minimum while the snapshot >> is active. > That''s what the script I wrote, is doing. Check > http://github.com/bassu/xen-scripts/I had a quick read through of your script... looks pretty nice and complete, just a couple of comments: 1) line 159 you do a killall -9 dd, but you know the pid of dd that you launched, you might accidentally kill another dd process run from another script/etc... so consider to change to killall -9 $ddpid 2) in find_lvm you call lvdisplay, and this is where I tend to have the same problem (various lvm2 processes hang forever, including lvs, and lvremove when removing snapshots). I don''t know a good way to solve that except reboot when it happens. 3) You set the snapshot chunk size to 512k, what does this do, does it really make much difference? 4) You are reading the full snapshot, writing out the full uncompressed copy of the image, then reading the copy back and writing the compressed copy out. You could optimize this by reading the snapshot, and writing compressed data directly in one step. If the CPU is faster than the disk, this will reduce the overall backup time, and might also reduce the time the snapshot hangs around. 5) I found if the LV is on the same disk as I am saving the dump to, then this drastically slows things down (reading/writing the same disk in different locations at the same time). Either backup to different disks if possible. My script is currently much simpler, I simply create the snapshots and remove the old ones (no full copies of the snapshots/etc). I use backuppc which I''ve got working for one system to snapshot the VM, mount the image, backup with rsync, then umount and remove the snapshot. I still like to keep a full image snapshot, and even better to send that raw image offsite. Another scenario I shutdown the VM (using an image file), then simply copy the file via some tools into chunks of 100M, then startup the VM.> As for SSDs, I didn''t find them stable as in long-term production environments!Interesting, I''ve had problems with a number of SSD''s, but since I started using the Intel 520s, I''ve not had any issues. I have one environment with about 10 heavily used windows domU''s, the SAN is using 5 x 480G SSD''s, and so far haven''t had any issues (I think over 12 months now). It would be interesting to hear if you have any additional information/comments?>> My plan is to do something like this: >> 1) Have two storage backend machines >> 2) Use DRBD to sync the two of them (primary sits on RAID device, secondary >> sits on LVM on RAID device) >> 3) Use LVM on top of the DRBD to create LV''s for each domU >> 5) Take a snapshot using the underlying LVM (below DRBD) on the secondary >> 6) Run your backup processes on the snapshot of the DRBD >> 7) Delete the snapshot > Sounds a lot complicated. Block level snapshots under grouped block > level devices -- seems like a lot of overhead! > Gluster may be a lot more useful in this case -- just a slight guess.In my opinion, gluster will add a lot of overhead anyway, and maybe is not sufficiently stable, and certainly I don''t know it well enough to put into production. While LVM + MD + DRBD are all simple, low overhead, well understood, etc... Each read/write with LVM/MD/DRBD is simply a remap process to a physical device read/write, while glusterfs seems more of a filesystem with more overhead/complexity.>> I haven''t yet got that far in the process, so if you do something it would >> be helpful to hear about it. >> >> Also any other people who can share what they do and what works well/doesn''t >> work would be nice to see. > I am experimenting with a few tricks. I will share the outcome like > the script I just shared :)Thanks, appreciated.>> Finally, the other problem I have with LVM on Debian (stable) is that every >> week or two, it will freeze on lvremove, and other lvs or LV related >> commands will freeze. The only solution seems to be a reboot. (Using kernel >> 3.2.0-4-686-pae #1 SMP Debian 3.2.41-2 i686). I haven''t tracked this down or >> reported it yet, but it is frustrating to have to reboot the dom0 so often. > LVM is slow as heck when it comes to snapshots. And everywhere I look, > people talk about the "copy on write" magic, > but no one tells you that you are gonna bite your tongue!If biting my tongue would help, I''d do it :) Running multiple VM''s on a single storage device, especially spinning disks, seems to be challenging to ensure the right performance with all the contention/etc... Using SSD''s should be a lot simpler/easier, but LVM performance is making that really difficult, and I still don''t understand why performance is so horrible. At some point, I''ll join the LVM list and investigate in more detail, but I''ve got "good enough" performance so far, and have other higher priority issues on my list... Thanks again. Adam -- Adam Goryachev Website Managers www.websitemanagers.com.au
First off, thanks for checking. Secondly, I have managed to resolve the disk dumping issues from LVM snapshots and preliminary tests are satisfactory. Turns out, the default scheduler CFQ was not suited for this workload. Dom0: echo deadline > /sys/block/sda/queue/scheduler DomU: echo noop > /sys/block/xvda/queue/scheduler If you need reasons, let me know and I''ll explain the findings further. Since I am using megaraid controller, I looked at LSI recommendations and tweaked kernel further. This overall gave me 50% performance boost on cheap Seagate disks. No more sluggishness!! About the script: 1) Good catch. That was indeed the purpose of creating $ddpid. Seems like a typo. 2) We use RHEL/CentOS in production so I have never had such an issue so didn''t consider. But you could do something like: [[ $(ps -p $(pidof lvdisplay) -o etimes:1=) -gt 300 ]] do something if it executes for more than 5 mins 3) My tests at time showed 512k snapshot chunk size gave more speed to dd writes. But now after I have switched to deadline scheduler, there are best results without specifying -c parameter to lvm and dd''ing with bs=100M. Also, there''s no need for ionice since it''s works with CFQ only. 4) It takes the same amount of CPU time though. Dumping and compressing large chunks at the same time with pipes and stdouts can cause weird issues with FIFOs. IMHO, why risk taking a chance of having corrupt backups when the only real way in the world to test the backups is by restoring them! A little certainty of knowing of not having a dirty backup is worth little more of I/O expense! 5) Affirmative. That is why two separate config variables exist there: BACKUP_DIR and PROCESS_DIR> My script is currently much simpler, I simply create the snapshots and > remove the old ones (no full copies of the snapshots/etc).Seems fine. In my case there are more than few nodes and tens of domains. So the above works pretty well for me as short term backup strategy!> I use backuppc which I''ve got working for one system to snapshot the VM, > mount the image, backup with rsync, then umount and remove the snapshot. I > still like to keep a full image snapshot, and even better to send that raw > image offsite.I use Burp from inside the domu.> It would be interesting to hear if you have any additional information/comments?Well, I started with few small machines and one after another SSDs died on me either due to a firmware problems or bad blocks. I tried Crucial, switched to Intel and then Samsung. The latter were ones that ran fine for the longest time. Now I just use these for personal laptops.> Another scenario I shutdown the VM (using an image file), then simply copy > the file via some tools into chunks of 100M, then startup the VM.Seems fine from administration point of view but people have become uptime conscious these days.> In my opinion, gluster will add a lot of overhead anyway, and maybe is not > sufficiently stable, and certainly I don''t know it well enough to put into > production. While LVM + MD + DRBD are all simple, low overhead, well > understood, etc... Each read/write with LVM/MD/DRBD is simply a remap > process to a physical device read/write, while glusterfs seems more of a > filesystem with more overhead/complexity.And I haven''t played much with DRBD so there are only guesses. My understanding with network based domains'' I/O is that unless you have high speed disks or network equipment or preferably a SAN, the domains will suffer from I/O latency if there are more than a few. Simply the gigabit switches and so called 6Gb/s SAS drives aren''t sufficient.> Running multiple VM''s on a single storage device, especially spinning disks, > seems to be challenging to ensure the right performance with all the > contention/etc... Using SSD''s should be a lot simpler/easier, but LVM > performance is making that really difficult, and I still don''t understand > why performance is so horrible. At some point, I''ll join the LVM list and > investigate in more detail, but I''ve got "good enough" performance so far, > and have other higher priority issues on my list...So true. Try the workaround I mentioned above of switching the scheduler to noop or deadline, and see if you find any improvements.> Thanks again.Quite welcome!
Adam: P.S. https://github.com/bassu/xen-scripts/commit/20294000bee25fa986adfe284fc3d0c2aa11965f On Wed, Jul 10, 2013 at 11:07 AM, Micky <mickylmartin@gmail.com> wrote:> First off, thanks for checking. > Secondly, I have managed to resolve the disk dumping issues from LVM > snapshots and preliminary tests are satisfactory. > > Turns out, the default scheduler CFQ was not suited for this workload. > Dom0: echo deadline > /sys/block/sda/queue/scheduler > DomU: echo noop > /sys/block/xvda/queue/scheduler > > If you need reasons, let me know and I''ll explain the findings further. > > Since I am using megaraid controller, I looked at LSI recommendations > and tweaked kernel further. > This overall gave me 50% performance boost on cheap Seagate disks. > > No more sluggishness!! > > About the script: > > 1) Good catch. That was indeed the purpose of creating $ddpid. Seems > like a typo. > > 2) We use RHEL/CentOS in production so I have never had such an issue > so didn''t consider. But you could do something like: > [[ $(ps -p $(pidof lvdisplay) -o etimes:1=) -gt 300 ]] do something if > it executes for more than 5 mins > > 3) My tests at time showed 512k snapshot chunk size gave more speed to > dd writes. But now after I have switched to deadline scheduler, there > are best results without specifying -c parameter to lvm and dd''ing > with bs=100M. Also, there''s no need for ionice since it''s works with > CFQ only. > > 4) It takes the same amount of CPU time though. Dumping and > compressing large chunks at the same time with pipes and stdouts can > cause weird issues with FIFOs. IMHO, why risk taking a chance of > having corrupt backups when the only real way in the world to test the > backups is by restoring them! A little certainty of knowing of not > having a dirty backup is worth little more of I/O expense! > > 5) Affirmative. That is why two separate config variables exist there: > BACKUP_DIR and PROCESS_DIR > >> My script is currently much simpler, I simply create the snapshots and >> remove the old ones (no full copies of the snapshots/etc). > > Seems fine. In my case there are more than few nodes and tens of > domains. So the above works pretty well for me as short term backup > strategy! > >> I use backuppc which I''ve got working for one system to snapshot the VM, >> mount the image, backup with rsync, then umount and remove the snapshot. I >> still like to keep a full image snapshot, and even better to send that raw >> image offsite. > > I use Burp from inside the domu. > >> It would be interesting to hear if you have any additional information/comments? > > Well, I started with few small machines and one after another SSDs > died on me either due to a firmware problems or bad blocks. I tried > Crucial, switched to Intel and then Samsung. The latter were ones that > ran fine for the longest time. Now I just use these for personal > laptops. > >> Another scenario I shutdown the VM (using an image file), then simply copy >> the file via some tools into chunks of 100M, then startup the VM. > > Seems fine from administration point of view but people have become > uptime conscious these days. > >> In my opinion, gluster will add a lot of overhead anyway, and maybe is not >> sufficiently stable, and certainly I don''t know it well enough to put into >> production. While LVM + MD + DRBD are all simple, low overhead, well >> understood, etc... Each read/write with LVM/MD/DRBD is simply a remap >> process to a physical device read/write, while glusterfs seems more of a >> filesystem with more overhead/complexity. > > And I haven''t played much with DRBD so there are only guesses. My > understanding with network based domains'' I/O is that unless you have > high speed disks or network equipment or preferably a SAN, the domains > will suffer from I/O latency if there are more than a few. Simply the > gigabit switches and so called 6Gb/s SAS drives aren''t sufficient. > >> Running multiple VM''s on a single storage device, especially spinning disks, >> seems to be challenging to ensure the right performance with all the >> contention/etc... Using SSD''s should be a lot simpler/easier, but LVM >> performance is making that really difficult, and I still don''t understand >> why performance is so horrible. At some point, I''ll join the LVM list and >> investigate in more detail, but I''ve got "good enough" performance so far, >> and have other higher priority issues on my list... > > So true. Try the workaround I mentioned above of switching the > scheduler to noop or deadline, and see if you find any improvements. > >> Thanks again. > Quite welcome!