thr3ads.net - zfs discuss - [zfs-discuss] Issue with simultaneous IO to lots of ZFS pools [Apr 2008]

If this information is useful, please help other people find it:
Share via:

Chris Siebenmann

2008-Apr-30 17:53 UTC

[zfs-discuss] Issue with simultaneous IO to lots of ZFS pools

I have a test system with 132 (small) ZFS pools[*], as part of our
work to validate a new ZFS-based fileserver environment. In testing,
it appears that we can produce situations that will run the kernel out
of memory, or at least out of some resource such that things start
complaining ''bash: fork: Resource temporarily unavailable''.
Sometimes
the system locks up solid.

 I''ve found at least two situations that reliably do this:
- trying to ''zpool scrub'' each pool in sequence (waiting for
each scrub
  to complete before starting the next one).
- starting simultaneous sequential read IO from all pools from a NFS client.
  (trying to do the same IO from the server basically kills the server
  entirely.)

 If I aggregate the same disk space into 12 pools instead of 132, the
same IO load does not kill the system.

 The ZFS machine is an X2100 M2 with 2GB of physical memory and 1GB
of swap, running 64-bit Solaris 10 U4 with an almost current set of
patches; it gets the storage from another machine via ISCSI. The pools
are non-redundant, with each vdev being a whole ISCSI LUN.

 Is this a known issue (or issues)? If this isn''t a known issue, does
anyone have pointers to good tools to trace down what might be happening
and where memory is disappearing and so on? Does the system plain need
more memory for this number of pools and if so, does anyone know how
much?

 Thanks in advance.

(I was pointed to mdb -k''s ''::kmastat'' by some people
on the OpenSolaris
IRC channel but I haven''t spotted anything particularly enlightening in
its output, and I can''t run it once the system has gone over the edge.)

	- cks
[*: we have an outstanding uncertainty over how many ZFS pools a
    single system can sensibly support, so testing something larger
    than we''d use in production seemed sensible.]

Bill Moore

2008-Apr-30 18:48 UTC

head link

[zfs-discuss] Issue with simultaneous IO to lots of ZFS pools

A silly question:  Why are you using 132 ZFS pools as opposed to a
single ZFS pool with 132 ZFS filesystems?


--Bill

On Wed, Apr 30, 2008 at 01:53:32PM -0400, Chris Siebenmann
wrote:>  I have a test system with 132 (small) ZFS pools[*], as part of our
> work to validate a new ZFS-based fileserver environment. In testing,
> it appears that we can produce situations that will run the kernel out
> of memory, or at least out of some resource such that things start
> complaining ''bash: fork: Resource temporarily
unavailable''. Sometimes
> the system locks up solid.
> 
>  I''ve found at least two situations that reliably do this:
> - trying to ''zpool scrub'' each pool in sequence (waiting
for each scrub
>   to complete before starting the next one).
> - starting simultaneous sequential read IO from all pools from a NFS
client.
>   (trying to do the same IO from the server basically kills the server
>   entirely.)
> 
>  If I aggregate the same disk space into 12 pools instead of 132, the
> same IO load does not kill the system.
> 
>  The ZFS machine is an X2100 M2 with 2GB of physical memory and 1GB
> of swap, running 64-bit Solaris 10 U4 with an almost current set of
> patches; it gets the storage from another machine via ISCSI. The pools
> are non-redundant, with each vdev being a whole ISCSI LUN.
> 
>  Is this a known issue (or issues)? If this isn''t a known issue,
does
> anyone have pointers to good tools to trace down what might be happening
> and where memory is disappearing and so on? Does the system plain need
> more memory for this number of pools and if so, does anyone know how
> much?
> 
>  Thanks in advance.
> 
> (I was pointed to mdb -k''s ''::kmastat'' by some
people on the OpenSolaris
> IRC channel but I haven''t spotted anything particularly
enlightening in
> its output, and I can''t run it once the system has gone over the
edge.)
> 
> 	- cks
> [*: we have an outstanding uncertainty over how many ZFS pools a
>     single system can sensibly support, so testing something larger
>     than we''d use in production seemed sensible.]
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Jeff Bonwick

2008-Apr-30 21:18 UTC

head link

[zfs-discuss] Issue with simultaneous IO to lots of ZFS pools

Indeed, things should be simpler with fewer (generally one) pool.

That said, I suspect I know the reason for the particular problem
you''re seeing: we currently do a bit too much vdev-level caching.
Each vdev can have up to 10MB of cache.  With 132 pools, even if
each pool is just a single iSCSI device, that''s 1.32GB of cache.

We need to fix this, obviously.  In the interim, you might try
setting zfs_vdev_cache_size to some smaller value, like 1MB.

Still, I''m curious -- why lots of pools?  Administration would
be simpler with a single pool containing many filesystems.

Jeff

On Wed, Apr 30, 2008 at 11:48:07AM -0700, Bill Moore
wrote:> A silly question:  Why are you using 132 ZFS pools as opposed to a
> single ZFS pool with 132 ZFS filesystems?
> 
> 
> --Bill
> 
> On Wed, Apr 30, 2008 at 01:53:32PM -0400, Chris Siebenmann wrote:
> >  I have a test system with 132 (small) ZFS pools[*], as part of our
> > work to validate a new ZFS-based fileserver environment. In testing,
> > it appears that we can produce situations that will run the kernel out
> > of memory, or at least out of some resource such that things start
> > complaining ''bash: fork: Resource temporarily
unavailable''. Sometimes
> > the system locks up solid.
> > 
> >  I''ve found at least two situations that reliably do this:
> > - trying to ''zpool scrub'' each pool in sequence
(waiting for each scrub
> >   to complete before starting the next one).
> > - starting simultaneous sequential read IO from all pools from a NFS
client.
> >   (trying to do the same IO from the server basically kills the server
> >   entirely.)
> > 
> >  If I aggregate the same disk space into 12 pools instead of 132, the
> > same IO load does not kill the system.
> > 
> >  The ZFS machine is an X2100 M2 with 2GB of physical memory and 1GB
> > of swap, running 64-bit Solaris 10 U4 with an almost current set of
> > patches; it gets the storage from another machine via ISCSI. The pools
> > are non-redundant, with each vdev being a whole ISCSI LUN.
> > 
> >  Is this a known issue (or issues)? If this isn''t a known
issue, does
> > anyone have pointers to good tools to trace down what might be
happening
> > and where memory is disappearing and so on? Does the system plain need
> > more memory for this number of pools and if so, does anyone know how
> > much?
> > 
> >  Thanks in advance.
> > 
> > (I was pointed to mdb -k''s ''::kmastat'' by
some people on the OpenSolaris
> > IRC channel but I haven''t spotted anything particularly
enlightening in
> > its output, and I can''t run it once the system has gone over
the edge.)
> > 
> > 	- cks
> > [*: we have an outstanding uncertainty over how many ZFS pools a
> >     single system can sensibly support, so testing something larger
> >     than we''d use in production seemed sensible.]
> > _______________________________________________
> > zfs-discuss mailing list
> > zfs-discuss at opensolaris.org
> > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Chris Siebenmann

2008-Apr-30 21:42 UTC

head link

[zfs-discuss] Issue with simultaneous IO to lots of ZFS pools

| Still, I''m curious -- why lots of pools?  Administration would be
| simpler with a single pool containing many filesystems.

 The short answer is that it is politically and administratively easier
to use (at least) one pool per storage-buying group in our environment.
This got discussed in more detail in the ''How many ZFS pools is it
sensible to use on a single server'' zfs-discuss thread I started
earlier
this month[*].

(Trying to answer the question myself is the reason I wound up setting
up 132 pools on my test system and discovering this issue.)

	- cks
[*: http://opensolaris.org/jive/thread.jspa?threadID=56802]

Darren J Moffat

2008-May-01 09:08 UTC

head link

[zfs-discuss] Issue with simultaneous IO to lots of ZFS pools

Chris Siebenmann wrote:> | Still, I''m curious -- why lots of pools?  Administration would
be
> | simpler with a single pool containing many filesystems.
> 
>  The short answer is that it is politically and administratively easier
> to use (at least) one pool per storage-buying group in our environment.
I think the root cause of the issue is that multiple groups are buying 
physical rather than virtual storage yet it is all being attached to a 
single system.  I will likely be a huge up hill battle but: if all the 
physical storage could be purchased by one group and a combination of 
ZFS reservations and quotas used on "top level" (eg one level down
from
the pool) datasets to allocate the virtual storage, and appropriate 
amounts charged to the groups, you could technical be able to use ZFS 
how it was intended with much fewer (hopefully 1 or 2) pools.

-- 
Darren J Moffat

David Collier-Brown

2008-May-01 13:00 UTC

head link

[zfs-discuss] Issue with simultaneous IO to lots of ZFS pools

Darren J Moffat <darrenm at opensolaris.org>
wrote:> Chris Siebenmann wrote:
>>| Still, I''m curious -- why lots of pools?  Administration
would be
>>| simpler with a single pool containing many filesystems.
>>
>> The short answer is that it is politically and administratively easier
>>to use (at least) one pool per storage-buying group in our environment.
> 
> 
> I think the root cause of the issue is that multiple groups are buying 
> physical rather than virtual storage yet it is all being attached to a 
> single system.  I will likely be a huge up hill battle but: if all the 
> physical storage could be purchased by one group and a combination of 
> ZFS reservations and quotas used on "top level" (eg one level
down from
> the pool) datasets to allocate the virtual storage, and appropriate 
> amounts charged to the groups, you could technical be able to use ZFS 
> how it was intended with much fewer (hopefully 1 or 2) pools.
The scenario Chris describes is one I see repeatedly at customers
buying SAN storage (as late as last month!) and is considered
a best practice on the business side.

We may want to make this issue and it''s management visible, as
people moving from SAN to ZFS are likely to trip over it.

In particular, I''d like to see a blueprint or at least a 
wiki discussion by someone from the SAN world on how to 
map those kinds of purchases to ZFS pools, how few one 
wants to have, what happens when it goes wrong, and how 
to mitigate it (;-))

--dave
ps: as always, having asked for something, I''m also volunteering to
help provide it: I''m not a storage or ZFS guy, but I am an author,
and will happily help my Smarter Colleagues[tm] to write it up.

-- 
David Collier-Brown            | Always do right. This will gratify
Sun Microsystems, Toronto      | some people and astonish the rest
davecb at sun.com                 |                      -- Mark Twain
(905) 943-1983, cell: (647) 833-9377, (800) 555-9786 x56583
bridge: (877) 385-4099 code: 506 9191#

Chris Siebenmann

2008-May-01 15:02 UTC

head link

[zfs-discuss] Issue with simultaneous IO to lots of ZFS pools

| I think the root cause of the issue is that multiple groups are buying
| physical rather than virtual storage yet it is all being attached to a
| single system.

 They''re actually buying constant-sized chunks of virtual storage,
which
is provided through a pool of SAN-based disk space. This means that
we''re always going to have a certain number of logical pools of storage
space to manage that are expanded in fixed-size chunks; the question is
whether to manage them as separate ZFS pools or to aggregate them into
fewer ZFS pools and then use quotas on sub-hierarchies.

(With local storage you wouldn''t have much choice; the physical disk
size is not likely to map nicely into the constant-sized chunks you sell
to people. With SAN storage you can pretty much make the
''disks'' that
Solaris sees map straight to the chunk size.)

	- cks

Richard Elling

2008-May-01 15:19 UTC

head link

[zfs-discuss] Issue with simultaneous IO to lots of ZFS pools

David Collier-Brown wrote:> Darren J Moffat <darrenm at opensolaris.org> wrote:
>   
>> Chris Siebenmann wrote:
>>     
>>> | Still, I''m curious -- why lots of pools?  Administration
would be
>>> | simpler with a single pool containing many filesystems.
>>>
>>> The short answer is that it is politically and administratively
easier
>>> to use (at least) one pool per storage-buying group in our
environment.
>>>       
>> I think the root cause of the issue is that multiple groups are buying 
>> physical rather than virtual storage yet it is all being attached to a 
>> single system.  I will likely be a huge up hill battle but: if all the 
>> physical storage could be purchased by one group and a combination of 
>> ZFS reservations and quotas used on "top level" (eg one level
down from
>> the pool) datasets to allocate the virtual storage, and appropriate 
>> amounts charged to the groups, you could technical be able to use ZFS 
>> how it was intended with much fewer (hopefully 1 or 2) pools.
>>     
>
> The scenario Chris describes is one I see repeatedly at customers
> buying SAN storage (as late as last month!) and is considered
> a best practice on the business side.
>
> We may want to make this issue and it''s management visible, as
> people moving from SAN to ZFS are likely to trip over it.
>
> In particular, I''d like to see a blueprint or at least a 
> wiki discussion by someone from the SAN world on how to 
> map those kinds of purchases to ZFS pools, how few one 
> wants to have, what happens when it goes wrong, and how 
> to mitigate it (;-))
>   
There are two issues here.  One is the number of pools, but the other
is the small amount of RAM in the server.  To be honest, most laptops
today come with 2 GBytes, and most servers are in the 8-16 GByte
range (hmmm... I suppose I could look up the average size we sell...)
> --dave
> ps: as always, having asked for something, I''m also volunteering
to
> help provide it: I''m not a storage or ZFS guy, but I am an author,
> and will happily help my Smarter Colleagues[tm] to write it up.
>
>   
Should be relatively straightforward... we would need some help from
someone in this situation to provide some sort of performance results
(on a system with plenty of RAM).
 -- richard

Chris Siebenmann

2008-May-01 15:29 UTC

head link

[zfs-discuss] Issue with simultaneous IO to lots of ZFS pools

| There are two issues here.  One is the number of pools, but the other
| is the small amount of RAM in the server.  To be honest, most laptops
| today come with 2 GBytes, and most servers are in the 8-16 GByte range
| (hmmm... I suppose I could look up the average size we sell...)

 Speaking as a sysadmin (and a Sun customer), why on earth would I have
to provision 8 GB+ of RAM on my NFS fileservers? I would much rather
have that memory in the NFS client machines, where it can actually be
put to work by user programs.

(If I have decently provisioned NFS client machines, I don''t expect
much
from the NFS fileserver''s cache. Given that the clients have caches
too,
I believe that the server''s cache will mostly be hit for things that
the
clients cannot cache because of NFS semantics, like NFS GETATTR requests
for revalidation and the like.)

	- cks

Bart Smaalders

2008-May-01 15:59 UTC

head link

[zfs-discuss] Issue with simultaneous IO to lots of ZFS pools

Chris Siebenmann wrote:> | There are two issues here.  One is the number of pools, but the other
> | is the small amount of RAM in the server.  To be honest, most laptops
> | today come with 2 GBytes, and most servers are in the 8-16 GByte range
> | (hmmm... I suppose I could look up the average size we sell...)
> 
>  Speaking as a sysadmin (and a Sun customer), why on earth would I have
> to provision 8 GB+ of RAM on my NFS fileservers? I would much rather
> have that memory in the NFS client machines, where it can actually be
> put to work by user programs.
This depends entirely on the amount of disk & CPU on the fileserver...

A Thumper w/ 48 TB of disk and two dual-core CPUS is prob. somewhat 
under-provisioned
w/ 8 GB of RAM.

- Bart


-- 
Bart Smaalders			Solaris Kernel Performance
barts at cyber.eng.sun.com		http://blogs.sun.com/barts
"You will contribute more with mercurial than with thunderbird."

David Collier-Brown

2008-May-01 16:10 UTC

head link

[zfs-discuss] Issue with simultaneous IO to lots of ZFS pools

Chris Siebenmann <cks at cs.toronto.edu> wrote:
|  Speaking as a sysadmin (and a Sun customer), why on earth would I have
| to provision 8 GB+ of RAM on my NFS fileservers? I would much rather
| have that memory in the NFS client machines, where it can actually be
| put to work by user programs.
|
| (If I have decently provisioned NFS client machines, I don''t expect
much
| from the NFS fileserver''s cache. Given that the clients have caches
too,
| I believe that the server''s cache will mostly be hit for things that
the
| clients cannot cache because of NFS semantics, like NFS GETATTR requests
| for revalidation and the like.)

That''s certainly true for the NFS part of the NFS fileserver, but to
get
the ZFS feature-set, you trade off cycles and memory.  If we investigate
this a bit, we should be able to figure out a rule of thumb for how
little memory we need for an NFS->home-directories workload without 
cutting into performance.

--dave
-- 
David Collier-Brown            | Always do right. This will gratify
Sun Microsystems, Toronto      | some people and astonish the rest
davecb at sun.com                 |                      -- Mark Twain
(905) 943-1983, cell: (647) 833-9377, (800) 555-9786 x56583
bridge: (877) 385-4099 code: 506 9191#

Richard Elling

2008-May-01 17:14 UTC

head link

[zfs-discuss] Issue with simultaneous IO to lots of ZFS pools

Bart Smaalders wrote:> Chris Siebenmann wrote:
>   
>> | There are two issues here.  One is the number of pools, but the other
>> | is the small amount of RAM in the server.  To be honest, most laptops
>> | today come with 2 GBytes, and most servers are in the 8-16 GByte
range
>> | (hmmm... I suppose I could look up the average size we sell...)
>>
>>  Speaking as a sysadmin (and a Sun customer), why on earth would I have
>> to provision 8 GB+ of RAM on my NFS fileservers? I would much rather
>> have that memory in the NFS client machines, where it can actually be
>> put to work by user programs.
>>     
Chris, thanks for the comment.  We will have to be very specific here.
There is a minimum expected RAM per pool of 10 MBytes.  So it is
more of a function of the number of pools than the services rendered.
I can see that crafting the message clearly will take some time.
>
> This depends entirely on the amount of disk & CPU on the fileserver...
>
> A Thumper w/ 48 TB of disk and two dual-core CPUS is prob. somewhat 
> under-provisioned
> w/ 8 GB of RAM.
>   
Except the smallest thumper we sell has 16 GBytes... :-)
 -- richard

Chris Siebenmann

2008-May-05 15:23 UTC

head link

[zfs-discuss] Issue with simultaneous IO to lots of ZFS pools

[Jeff Bonwick:]
| That said, I suspect I know the reason for the particular problem
| you''re seeing: we currently do a bit too much vdev-level caching.
| Each vdev can have up to 10MB of cache.  With 132 pools, even if
| each pool is just a single iSCSI device, that''s 1.32GB of cache.
| 
| We need to fix this, obviously.  In the interim, you might try
| setting zfs_vdev_cache_size to some smaller value, like 1MB.

 I wanted to update the mailing list with a success story: I added
another 2GB of memory to the server (bringing it to 4GB total),
tried my 132-pool tests again, and things worked fine. So this seems
to have been the issue and I''m calling it fixed now.

(I decided that adding some more memory to the server was simpler
in the long run than setting system parameters.)

 I can still make the Solaris system lock up solidly if I do extreme
things, like doing ''zfs scrub <pool> &'' for all 132
pools, but I''m not
too surprised by that; you can always kill a system if you try hard
enough. The important thing for me is that routine things don''t kill
the
system any more just because it has so many pools.

 So: thank you, everyone.

	- cks

zfs discuss - Apr 2008 - Issue with simultaneous IO to lots of ZFS pools

[zfs-discuss] Issue with simultaneous IO to lots of ZFS pools

[zfs-discuss] Issue with simultaneous IO to lots of ZFS pools

[zfs-discuss] Issue with simultaneous IO to lots of ZFS pools

[zfs-discuss] Issue with simultaneous IO to lots of ZFS pools

[zfs-discuss] Issue with simultaneous IO to lots of ZFS pools

[zfs-discuss] Issue with simultaneous IO to lots of ZFS pools

[zfs-discuss] Issue with simultaneous IO to lots of ZFS pools

[zfs-discuss] Issue with simultaneous IO to lots of ZFS pools

[zfs-discuss] Issue with simultaneous IO to lots of ZFS pools

[zfs-discuss] Issue with simultaneous IO to lots of ZFS pools

[zfs-discuss] Issue with simultaneous IO to lots of ZFS pools

[zfs-discuss] Issue with simultaneous IO to lots of ZFS pools

[zfs-discuss] Issue with simultaneous IO to lots of ZFS pools