RayLicon
2010-Jan-27 19:42 UTC
[zfs-code] Performance of partition based SWAP vs. ZFS zvol SWAP
Has anyone done research into the performance of SWAP on the traditional partitioned based SWAP device as compared to a SWAP area set up on ZFS with a zvol? I can find no best practices for this issue. In the old days it was considered important to separate the swap devices onto individual disks (controllers) and select the outer cylinder groups for the partition (to gain some read speed). How does this compare to creating a single SWAP zvol within a rootpool and then mirroring the rootpool across two separate disks? -- This message posted from opensolaris.org
Richard Elling
2010-Jan-27 20:08 UTC
[zfs-code] Performance of partition based SWAP vs. ZFS zvol SWAP
On Jan 27, 2010, at 11:42 AM, RayLicon wrote:> Has anyone done research into the performance of SWAP on the traditional partitioned based SWAP device as compared to a SWAP area set up on ZFS with a zvol?If you have to use the swap device, you have no performance. Case closed. -- richard
Jason Ozolins
2010-Feb-17 03:16 UTC
[zfs-code] Performance of partition based SWAP vs. ZFS zvol SWAP
> On Jan 27, 2010, at 11:42 AM, RayLicon wrote: > > Has anyone done research into the performance of > SWAP on the traditional partitioned based SWAP device > as compared to a SWAP area set up on ZFS with a zvol? > > If you have to use the swap device, you have no > performance. Case closed.Umm, I think that very much depends on the use case. I know one where high swap rates make a big difference, but it is kind of specific to HPC clusters such as the 12,000 core cluster that Sun is installing downstairs from where I work. See http://nf.nci.org.au for more on the machine. On this machine, and the previous 1920 core Altix 3000 cluster, heavy swapping is an integral part of the compute job preemption strategy. Large parallel compute jobs (certainly up to hundreds of CPUs, not sure of the largest jobs people are running) can be scheduled on nodes running small jobs by pre-empting the small jobs (using SIGSTOP sent to the job process group) and allowing the new job to push the pre-empted job out to swap. The jobs scheduled and actually running on each node are always constrained to the amount of physical memory, but there will also be large amounts (multiple GB) of swap actively used to hold scheduled but pre-empted jobs. The high swap rates achievable with Linux are a big part of why this strategy works well. Once the large parallel job finishes, the small jobs get restarted and have to page themselves back in. The key distinction is between pathological swapping because the sum of your process working sets exceeded physical memory, but all those processes are still getting the CPU and causing page faults; and in the job preemption case, heavy swapping to clear out stuff that is no longer part of a working set because the process(es) being swapped out are suspended. The latter is much more like "ye olde schoole" UNIX whole process swapping. Pagein rates are also a big deal when those suspended processes get sent a SIGCONT to start them running. You might say this is a rare use case, but it''s something that the cluster admin always mentions as a real deal-breaker, whenever a gentle Solaris vs Linux discussion starts up... -Jason -- This message posted from opensolaris.org