Chris Siebenmann
2008-Apr-30 17:53 UTC
[zfs-discuss] Issue with simultaneous IO to lots of ZFS pools
I have a test system with 132 (small) ZFS pools[*], as part of our work to validate a new ZFS-based fileserver environment. In testing, it appears that we can produce situations that will run the kernel out of memory, or at least out of some resource such that things start complaining ''bash: fork: Resource temporarily unavailable''. Sometimes the system locks up solid. I''ve found at least two situations that reliably do this: - trying to ''zpool scrub'' each pool in sequence (waiting for each scrub to complete before starting the next one). - starting simultaneous sequential read IO from all pools from a NFS client. (trying to do the same IO from the server basically kills the server entirely.) If I aggregate the same disk space into 12 pools instead of 132, the same IO load does not kill the system. The ZFS machine is an X2100 M2 with 2GB of physical memory and 1GB of swap, running 64-bit Solaris 10 U4 with an almost current set of patches; it gets the storage from another machine via ISCSI. The pools are non-redundant, with each vdev being a whole ISCSI LUN. Is this a known issue (or issues)? If this isn''t a known issue, does anyone have pointers to good tools to trace down what might be happening and where memory is disappearing and so on? Does the system plain need more memory for this number of pools and if so, does anyone know how much? Thanks in advance. (I was pointed to mdb -k''s ''::kmastat'' by some people on the OpenSolaris IRC channel but I haven''t spotted anything particularly enlightening in its output, and I can''t run it once the system has gone over the edge.) - cks [*: we have an outstanding uncertainty over how many ZFS pools a single system can sensibly support, so testing something larger than we''d use in production seemed sensible.]
Bill Moore
2008-Apr-30 18:48 UTC
[zfs-discuss] Issue with simultaneous IO to lots of ZFS pools
A silly question: Why are you using 132 ZFS pools as opposed to a single ZFS pool with 132 ZFS filesystems? --Bill On Wed, Apr 30, 2008 at 01:53:32PM -0400, Chris Siebenmann wrote:> I have a test system with 132 (small) ZFS pools[*], as part of our > work to validate a new ZFS-based fileserver environment. In testing, > it appears that we can produce situations that will run the kernel out > of memory, or at least out of some resource such that things start > complaining ''bash: fork: Resource temporarily unavailable''. Sometimes > the system locks up solid. > > I''ve found at least two situations that reliably do this: > - trying to ''zpool scrub'' each pool in sequence (waiting for each scrub > to complete before starting the next one). > - starting simultaneous sequential read IO from all pools from a NFS client. > (trying to do the same IO from the server basically kills the server > entirely.) > > If I aggregate the same disk space into 12 pools instead of 132, the > same IO load does not kill the system. > > The ZFS machine is an X2100 M2 with 2GB of physical memory and 1GB > of swap, running 64-bit Solaris 10 U4 with an almost current set of > patches; it gets the storage from another machine via ISCSI. The pools > are non-redundant, with each vdev being a whole ISCSI LUN. > > Is this a known issue (or issues)? If this isn''t a known issue, does > anyone have pointers to good tools to trace down what might be happening > and where memory is disappearing and so on? Does the system plain need > more memory for this number of pools and if so, does anyone know how > much? > > Thanks in advance. > > (I was pointed to mdb -k''s ''::kmastat'' by some people on the OpenSolaris > IRC channel but I haven''t spotted anything particularly enlightening in > its output, and I can''t run it once the system has gone over the edge.) > > - cks > [*: we have an outstanding uncertainty over how many ZFS pools a > single system can sensibly support, so testing something larger > than we''d use in production seemed sensible.] > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Jeff Bonwick
2008-Apr-30 21:18 UTC
[zfs-discuss] Issue with simultaneous IO to lots of ZFS pools
Indeed, things should be simpler with fewer (generally one) pool. That said, I suspect I know the reason for the particular problem you''re seeing: we currently do a bit too much vdev-level caching. Each vdev can have up to 10MB of cache. With 132 pools, even if each pool is just a single iSCSI device, that''s 1.32GB of cache. We need to fix this, obviously. In the interim, you might try setting zfs_vdev_cache_size to some smaller value, like 1MB. Still, I''m curious -- why lots of pools? Administration would be simpler with a single pool containing many filesystems. Jeff On Wed, Apr 30, 2008 at 11:48:07AM -0700, Bill Moore wrote:> A silly question: Why are you using 132 ZFS pools as opposed to a > single ZFS pool with 132 ZFS filesystems? > > > --Bill > > On Wed, Apr 30, 2008 at 01:53:32PM -0400, Chris Siebenmann wrote: > > I have a test system with 132 (small) ZFS pools[*], as part of our > > work to validate a new ZFS-based fileserver environment. In testing, > > it appears that we can produce situations that will run the kernel out > > of memory, or at least out of some resource such that things start > > complaining ''bash: fork: Resource temporarily unavailable''. Sometimes > > the system locks up solid. > > > > I''ve found at least two situations that reliably do this: > > - trying to ''zpool scrub'' each pool in sequence (waiting for each scrub > > to complete before starting the next one). > > - starting simultaneous sequential read IO from all pools from a NFS client. > > (trying to do the same IO from the server basically kills the server > > entirely.) > > > > If I aggregate the same disk space into 12 pools instead of 132, the > > same IO load does not kill the system. > > > > The ZFS machine is an X2100 M2 with 2GB of physical memory and 1GB > > of swap, running 64-bit Solaris 10 U4 with an almost current set of > > patches; it gets the storage from another machine via ISCSI. The pools > > are non-redundant, with each vdev being a whole ISCSI LUN. > > > > Is this a known issue (or issues)? If this isn''t a known issue, does > > anyone have pointers to good tools to trace down what might be happening > > and where memory is disappearing and so on? Does the system plain need > > more memory for this number of pools and if so, does anyone know how > > much? > > > > Thanks in advance. > > > > (I was pointed to mdb -k''s ''::kmastat'' by some people on the OpenSolaris > > IRC channel but I haven''t spotted anything particularly enlightening in > > its output, and I can''t run it once the system has gone over the edge.) > > > > - cks > > [*: we have an outstanding uncertainty over how many ZFS pools a > > single system can sensibly support, so testing something larger > > than we''d use in production seemed sensible.] > > _______________________________________________ > > zfs-discuss mailing list > > zfs-discuss at opensolaris.org > > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Chris Siebenmann
2008-Apr-30 21:42 UTC
[zfs-discuss] Issue with simultaneous IO to lots of ZFS pools
| Still, I''m curious -- why lots of pools? Administration would be | simpler with a single pool containing many filesystems. The short answer is that it is politically and administratively easier to use (at least) one pool per storage-buying group in our environment. This got discussed in more detail in the ''How many ZFS pools is it sensible to use on a single server'' zfs-discuss thread I started earlier this month[*]. (Trying to answer the question myself is the reason I wound up setting up 132 pools on my test system and discovering this issue.) - cks [*: http://opensolaris.org/jive/thread.jspa?threadID=56802]
Darren J Moffat
2008-May-01 09:08 UTC
[zfs-discuss] Issue with simultaneous IO to lots of ZFS pools
Chris Siebenmann wrote:> | Still, I''m curious -- why lots of pools? Administration would be > | simpler with a single pool containing many filesystems. > > The short answer is that it is politically and administratively easier > to use (at least) one pool per storage-buying group in our environment.I think the root cause of the issue is that multiple groups are buying physical rather than virtual storage yet it is all being attached to a single system. I will likely be a huge up hill battle but: if all the physical storage could be purchased by one group and a combination of ZFS reservations and quotas used on "top level" (eg one level down from the pool) datasets to allocate the virtual storage, and appropriate amounts charged to the groups, you could technical be able to use ZFS how it was intended with much fewer (hopefully 1 or 2) pools. -- Darren J Moffat
David Collier-Brown
2008-May-01 13:00 UTC
[zfs-discuss] Issue with simultaneous IO to lots of ZFS pools
Darren J Moffat <darrenm at opensolaris.org> wrote:> Chris Siebenmann wrote: >>| Still, I''m curious -- why lots of pools? Administration would be >>| simpler with a single pool containing many filesystems. >> >> The short answer is that it is politically and administratively easier >>to use (at least) one pool per storage-buying group in our environment. > > > I think the root cause of the issue is that multiple groups are buying > physical rather than virtual storage yet it is all being attached to a > single system. I will likely be a huge up hill battle but: if all the > physical storage could be purchased by one group and a combination of > ZFS reservations and quotas used on "top level" (eg one level down from > the pool) datasets to allocate the virtual storage, and appropriate > amounts charged to the groups, you could technical be able to use ZFS > how it was intended with much fewer (hopefully 1 or 2) pools.The scenario Chris describes is one I see repeatedly at customers buying SAN storage (as late as last month!) and is considered a best practice on the business side. We may want to make this issue and it''s management visible, as people moving from SAN to ZFS are likely to trip over it. In particular, I''d like to see a blueprint or at least a wiki discussion by someone from the SAN world on how to map those kinds of purchases to ZFS pools, how few one wants to have, what happens when it goes wrong, and how to mitigate it (;-)) --dave ps: as always, having asked for something, I''m also volunteering to help provide it: I''m not a storage or ZFS guy, but I am an author, and will happily help my Smarter Colleagues[tm] to write it up. -- David Collier-Brown | Always do right. This will gratify Sun Microsystems, Toronto | some people and astonish the rest davecb at sun.com | -- Mark Twain (905) 943-1983, cell: (647) 833-9377, (800) 555-9786 x56583 bridge: (877) 385-4099 code: 506 9191#
Chris Siebenmann
2008-May-01 15:02 UTC
[zfs-discuss] Issue with simultaneous IO to lots of ZFS pools
| I think the root cause of the issue is that multiple groups are buying | physical rather than virtual storage yet it is all being attached to a | single system. They''re actually buying constant-sized chunks of virtual storage, which is provided through a pool of SAN-based disk space. This means that we''re always going to have a certain number of logical pools of storage space to manage that are expanded in fixed-size chunks; the question is whether to manage them as separate ZFS pools or to aggregate them into fewer ZFS pools and then use quotas on sub-hierarchies. (With local storage you wouldn''t have much choice; the physical disk size is not likely to map nicely into the constant-sized chunks you sell to people. With SAN storage you can pretty much make the ''disks'' that Solaris sees map straight to the chunk size.) - cks
Richard Elling
2008-May-01 15:19 UTC
[zfs-discuss] Issue with simultaneous IO to lots of ZFS pools
David Collier-Brown wrote:> Darren J Moffat <darrenm at opensolaris.org> wrote: > >> Chris Siebenmann wrote: >> >>> | Still, I''m curious -- why lots of pools? Administration would be >>> | simpler with a single pool containing many filesystems. >>> >>> The short answer is that it is politically and administratively easier >>> to use (at least) one pool per storage-buying group in our environment. >>> >> I think the root cause of the issue is that multiple groups are buying >> physical rather than virtual storage yet it is all being attached to a >> single system. I will likely be a huge up hill battle but: if all the >> physical storage could be purchased by one group and a combination of >> ZFS reservations and quotas used on "top level" (eg one level down from >> the pool) datasets to allocate the virtual storage, and appropriate >> amounts charged to the groups, you could technical be able to use ZFS >> how it was intended with much fewer (hopefully 1 or 2) pools. >> > > The scenario Chris describes is one I see repeatedly at customers > buying SAN storage (as late as last month!) and is considered > a best practice on the business side. > > We may want to make this issue and it''s management visible, as > people moving from SAN to ZFS are likely to trip over it. > > In particular, I''d like to see a blueprint or at least a > wiki discussion by someone from the SAN world on how to > map those kinds of purchases to ZFS pools, how few one > wants to have, what happens when it goes wrong, and how > to mitigate it (;-)) >There are two issues here. One is the number of pools, but the other is the small amount of RAM in the server. To be honest, most laptops today come with 2 GBytes, and most servers are in the 8-16 GByte range (hmmm... I suppose I could look up the average size we sell...)> --dave > ps: as always, having asked for something, I''m also volunteering to > help provide it: I''m not a storage or ZFS guy, but I am an author, > and will happily help my Smarter Colleagues[tm] to write it up. > >Should be relatively straightforward... we would need some help from someone in this situation to provide some sort of performance results (on a system with plenty of RAM). -- richard
Chris Siebenmann
2008-May-01 15:29 UTC
[zfs-discuss] Issue with simultaneous IO to lots of ZFS pools
| There are two issues here. One is the number of pools, but the other | is the small amount of RAM in the server. To be honest, most laptops | today come with 2 GBytes, and most servers are in the 8-16 GByte range | (hmmm... I suppose I could look up the average size we sell...) Speaking as a sysadmin (and a Sun customer), why on earth would I have to provision 8 GB+ of RAM on my NFS fileservers? I would much rather have that memory in the NFS client machines, where it can actually be put to work by user programs. (If I have decently provisioned NFS client machines, I don''t expect much from the NFS fileserver''s cache. Given that the clients have caches too, I believe that the server''s cache will mostly be hit for things that the clients cannot cache because of NFS semantics, like NFS GETATTR requests for revalidation and the like.) - cks
Bart Smaalders
2008-May-01 15:59 UTC
[zfs-discuss] Issue with simultaneous IO to lots of ZFS pools
Chris Siebenmann wrote:> | There are two issues here. One is the number of pools, but the other > | is the small amount of RAM in the server. To be honest, most laptops > | today come with 2 GBytes, and most servers are in the 8-16 GByte range > | (hmmm... I suppose I could look up the average size we sell...) > > Speaking as a sysadmin (and a Sun customer), why on earth would I have > to provision 8 GB+ of RAM on my NFS fileservers? I would much rather > have that memory in the NFS client machines, where it can actually be > put to work by user programs.This depends entirely on the amount of disk & CPU on the fileserver... A Thumper w/ 48 TB of disk and two dual-core CPUS is prob. somewhat under-provisioned w/ 8 GB of RAM. - Bart -- Bart Smaalders Solaris Kernel Performance barts at cyber.eng.sun.com http://blogs.sun.com/barts "You will contribute more with mercurial than with thunderbird."
David Collier-Brown
2008-May-01 16:10 UTC
[zfs-discuss] Issue with simultaneous IO to lots of ZFS pools
Chris Siebenmann <cks at cs.toronto.edu> wrote: | Speaking as a sysadmin (and a Sun customer), why on earth would I have | to provision 8 GB+ of RAM on my NFS fileservers? I would much rather | have that memory in the NFS client machines, where it can actually be | put to work by user programs. | | (If I have decently provisioned NFS client machines, I don''t expect much | from the NFS fileserver''s cache. Given that the clients have caches too, | I believe that the server''s cache will mostly be hit for things that the | clients cannot cache because of NFS semantics, like NFS GETATTR requests | for revalidation and the like.) That''s certainly true for the NFS part of the NFS fileserver, but to get the ZFS feature-set, you trade off cycles and memory. If we investigate this a bit, we should be able to figure out a rule of thumb for how little memory we need for an NFS->home-directories workload without cutting into performance. --dave -- David Collier-Brown | Always do right. This will gratify Sun Microsystems, Toronto | some people and astonish the rest davecb at sun.com | -- Mark Twain (905) 943-1983, cell: (647) 833-9377, (800) 555-9786 x56583 bridge: (877) 385-4099 code: 506 9191#
Richard Elling
2008-May-01 17:14 UTC
[zfs-discuss] Issue with simultaneous IO to lots of ZFS pools
Bart Smaalders wrote:> Chris Siebenmann wrote: > >> | There are two issues here. One is the number of pools, but the other >> | is the small amount of RAM in the server. To be honest, most laptops >> | today come with 2 GBytes, and most servers are in the 8-16 GByte range >> | (hmmm... I suppose I could look up the average size we sell...) >> >> Speaking as a sysadmin (and a Sun customer), why on earth would I have >> to provision 8 GB+ of RAM on my NFS fileservers? I would much rather >> have that memory in the NFS client machines, where it can actually be >> put to work by user programs. >>Chris, thanks for the comment. We will have to be very specific here. There is a minimum expected RAM per pool of 10 MBytes. So it is more of a function of the number of pools than the services rendered. I can see that crafting the message clearly will take some time.> > This depends entirely on the amount of disk & CPU on the fileserver... > > A Thumper w/ 48 TB of disk and two dual-core CPUS is prob. somewhat > under-provisioned > w/ 8 GB of RAM. >Except the smallest thumper we sell has 16 GBytes... :-) -- richard
Chris Siebenmann
2008-May-05 15:23 UTC
[zfs-discuss] Issue with simultaneous IO to lots of ZFS pools
[Jeff Bonwick:] | That said, I suspect I know the reason for the particular problem | you''re seeing: we currently do a bit too much vdev-level caching. | Each vdev can have up to 10MB of cache. With 132 pools, even if | each pool is just a single iSCSI device, that''s 1.32GB of cache. | | We need to fix this, obviously. In the interim, you might try | setting zfs_vdev_cache_size to some smaller value, like 1MB. I wanted to update the mailing list with a success story: I added another 2GB of memory to the server (bringing it to 4GB total), tried my 132-pool tests again, and things worked fine. So this seems to have been the issue and I''m calling it fixed now. (I decided that adding some more memory to the server was simpler in the long run than setting system parameters.) I can still make the Solaris system lock up solidly if I do extreme things, like doing ''zfs scrub <pool> &'' for all 132 pools, but I''m not too surprised by that; you can always kill a system if you try hard enough. The important thing for me is that routine things don''t kill the system any more just because it has so many pools. So: thank you, everyone. - cks