thr3ads.net - zfs code - [zfs-code] Performance of partition based SWAP vs. ZFS zvol SWAP [Jan 2010]

If this information is useful, please help other people find it:
Share via:

RayLicon

2010-Jan-27 19:42 UTC

[zfs-code] Performance of partition based SWAP vs. ZFS zvol SWAP

Has anyone done research into the performance of SWAP on the traditional
partitioned based SWAP device as compared to a SWAP area set up on ZFS with a
zvol?

 I can find no best practices for this issue. In the old days it was considered
important to separate the swap devices onto individual disks (controllers)  and
select the outer cylinder groups for the partition (to gain some read speed). 
How does this compare to creating a single SWAP zvol within a rootpool and then
mirroring the rootpool across two separate disks?
-- 
This message posted from opensolaris.org

Richard Elling

2010-Jan-27 20:08 UTC

head link

[zfs-code] Performance of partition based SWAP vs. ZFS zvol SWAP

On Jan 27, 2010, at 11:42 AM, RayLicon wrote:
> Has anyone done research into the performance of SWAP on the traditional
partitioned based SWAP device as compared to a SWAP area set up on ZFS with a
zvol?
If you have to use the swap device, you have no performance. Case closed.
 -- richard

Jason Ozolins

2010-Feb-17 03:16 UTC

head link

[zfs-code] Performance of partition based SWAP vs. ZFS zvol SWAP

> On Jan 27, 2010, at 11:42 AM, RayLicon wrote:
> > Has anyone done research into the performance of
> SWAP on the traditional partitioned based SWAP device
> as compared to a SWAP area set up on ZFS with a zvol?
> 
> If you have to use the swap device, you have no
> performance. Case closed.
Umm, I think that very much depends on the use case.  I know one where high swap
rates make a big difference, but it is kind of specific to HPC clusters such as
the 12,000 core cluster that Sun is installing downstairs from where I work. 
See http://nf.nci.org.au for more on the machine.

On this machine, and the previous 1920 core Altix 3000 cluster, heavy swapping
is an integral part of the compute job preemption strategy.  Large parallel
compute jobs (certainly up to hundreds of CPUs, not sure of the largest jobs
people are running) can be scheduled on nodes running small jobs by pre-empting
the small jobs (using SIGSTOP sent to the job process group) and allowing the
new job to push the pre-empted job out to swap.  The jobs scheduled and actually
running on each node are always constrained to the amount of physical memory,
but there will also be large amounts (multiple GB) of swap actively used to hold
scheduled but pre-empted jobs.   The high swap rates achievable with Linux are a
big part of why this strategy works well.  Once the large parallel job finishes,
the small jobs get restarted and have to page themselves back in.

The key distinction is between pathological swapping because the sum of your
process working sets exceeded physical memory, but all those processes are still
getting the CPU and causing page faults; and in the job preemption case, heavy
swapping to clear out stuff that is no longer part of a working set because the
process(es) being swapped out are suspended.  The latter is much more like
"ye olde schoole" UNIX whole process swapping.  Pagein rates are also
a big deal when those suspended processes get sent a SIGCONT to start them
running.

You might say this is a rare use case, but it''s something that the
cluster admin always mentions as a real deal-breaker, whenever a gentle Solaris
vs Linux discussion starts up...

-Jason
-- 
This message posted from opensolaris.org

zfs code - Jan 2010 - Performance of partition based SWAP vs. ZFS zvol SWAP

[zfs-code] Performance of partition based SWAP vs. ZFS zvol SWAP

[zfs-code] Performance of partition based SWAP vs. ZFS zvol SWAP

[zfs-code] Performance of partition based SWAP vs. ZFS zvol SWAP