I performed a SPEC SFS97 benchmark on Solaris 10u2/Sparc with 4 64GB LUNs, connected via FC SAN. The filesystems that were created on LUNS: UFS,VxFS,ZFS. Unfortunately the ZFS test couldn''t complete bacuase the box was hung under very moderate load (3000 IOPs). Additional tests were done using UFS and VxFS that were built on ZFS raw devices (Zvolumes). Results can be seen here: http://napobo3.blogspot.com/2006/08/spec-sfs-bencmark-of-zfsufsvxfs.html -- Leon
William D. Hathaway
2006-Aug-07 12:18 UTC
[zfs-discuss] Re: SPEC SFS97 benchmark of ZFS,UFS,VxFS
If this is reproducible, can you force a panic so it can be analyzed? This message posted from opensolaris.org
On 8/7/06, William D. Hathaway <william.hathaway at versatile.com> wrote:> If this is reproducible, can you force a panic so it can be analyzed? >The core files and explorer output are here: http://napobo3.lk.net/vinc/ The core files were created after the box was hung....break to OBP...sync
George Wilson
2006-Aug-07 15:10 UTC
[zfs-discuss] Re: SPEC SFS97 benchmark of ZFS,UFS,VxFS
Leon, Looking at the corefile doesn''t really show much from the zfs side. It looks like you were having problems with your san though: /scsi_vhci/ssd at g001738010080001c (ssd5) offline /scsi_vhci/ssd at g001738010080001c (ssd5) multipath status: failed, path /pci at 1d,700000/SUNW,emlxs at 1/fp at 0,0 (fp2) to target address: w1000001738043811,3 is offline Load balancing: none /scsi_vhci/ssd at g001738010080001f (ssd6) offline /scsi_vhci/ssd at g001738010080001f (ssd6) multipath status: failed, path /pci at 1d,700000/SUNW,emlxs at 1/fp at 0,0 (fp2) to target address: w1000001738043811,2 is offline Load balancing: none /scsi_vhci/ssd at g001738010080001e (ssd7) offline /scsi_vhci/ssd at g001738010080001e (ssd7) multipath status: failed, path /pci at 1d,700000/SUNW,emlxs at 1/fp at 0,0 (fp2) to target address: w1000001738043811,1 is offline Load balancing: none WARNING: /scsi_vhci/ssd at g001738010080001a (ssd8): transport rejected fatal error WARNING: fp(0)::GPN_ID for D_ID=10400 failed WARNING: fp(0)::N_x Port with D_ID=10400, PWWN=1000001738279c10 disappeared from fabric /pci at 1d,700000/SUNW,emlxs at 1,1/fp at 0,0 (fcp0): Lun=0 for target=10400 disappeared WARNING: /pci at 1d,700000/SUNW,emlxs at 1,1/fp at 0,0 (fcp0): FCP: target=10400 reported NO Luns WARNING: fp(0)::GPN_ID for D_ID=10400 failed WARNING: fp(0)::N_x Port with D_ID=10400, PWWN=1000001738279c10 disappeared from fabric /pci at 1d,700000/SUNW,emlxs at 1,1/fp at 0,0 (fcp0): Lun=0 for target=10400 disappeared WARNING: /pci at 1d,700000/SUNW,emlxs at 1,1/fp at 0,0 (fcp0): FCP: target=10400 reported NO Luns /pci at 1d,700000/SUNW,emlxs at 1,1/fp at 0,0 (fcp0): Lun=0 for target=10400 disappeared WARNING: /pci at 1d,700000/SUNW,emlxs at 1,1/fp at 0,0 (fcp0): FCP: target=10400 reported NO Luns /scsi_vhci/ssd at g001738010080001c (ssd5) multipath status: failed, path /pci at 1d,700000/SUNW,emlxs at 1/fp at 0,0 (fp2) to target address: w1000001738043811,3 is offline Load balancing: none /scsi_vhci/ssd at g001738010080001c (ssd5) offline /scsi_vhci/ssd at g001738010080001f (ssd6) multipath status: failed, path /pci at 1d,700000/SUNW,emlxs at 1/fp at 0,0 (fp2) to target address: w1000001738043811,2 is offline Load balancing: none /scsi_vhci/ssd at g001738010080001f (ssd6) offline /scsi_vhci/ssd at g001738010080001e (ssd7) multipath status: failed, path /pci at 1d,700000/SUNW,emlxs at 1/fp at 0,0 (fp2) to target address: w1000001738043811,1 is offline Load balancing: none /scsi_vhci/ssd at g001738010080001e (ssd7) offline /scsi_vhci/ssd at g001738010080001a (ssd8) multipath status: failed, path /pci at 1d,700000/SUNW,emlxs at 1/fp at 0,0 (fp2) to target address: w1000001738043811,0 is offline Load balancing: none panic[cpu0]/thread=2a10057dcc0: BAD TRAP: type=31 rp=2a10057cee0 addr=0 mmu_fsr=0 occurred in module "unix" due to a NULL pointer dereference Can you reproduce this hang? Thanks, George Leon Koll wrote:> On 8/7/06, William D. Hathaway <william.hathaway at versatile.com> wrote: >> If this is reproducible, can you force a panic so it can be analyzed? >> > The core files and explorer output are here: > http://napobo3.lk.net/vinc/ > The core files were created after the box was hung....break to OBP...sync > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
On 8/7/06, George Wilson <George.Wilson at sun.com> wrote:> Leon, > > Looking at the corefile doesn''t really show much from the zfs side. It > looks like you were having problems with your san though: > > /scsi_vhci/ssd at g001738010080001c (ssd5) offline > /scsi_vhci/ssd at g001738010080001c (ssd5) multipath status: failed, path > /pci at 1d,700000/SUNW,emlxs at 1/fp at 0,0 (fp2) to target address: > w1000001738043811,3 is offline Load balancing: none > /scsi_vhci/ssd at g001738010080001f (ssd6) offline > /scsi_vhci/ssd at g001738010080001f (ssd6) multipath status: failed, path > /pci at 1d,700000/SUNW,emlxs at 1/fp at 0,0 (fp2) to target address: > w1000001738043811,2 is offline Load balancing: none > /scsi_vhci/ssd at g001738010080001e (ssd7) offline > /scsi_vhci/ssd at g001738010080001e (ssd7) multipath status: failed, path > /pci at 1d,700000/SUNW,emlxs at 1/fp at 0,0 (fp2) to target address: > w1000001738043811,1 is offline Load balancing: none > WARNING: /scsi_vhci/ssd at g001738010080001a (ssd8): > transport rejected fatal error > > WARNING: fp(0)::GPN_ID for D_ID=10400 failed > > WARNING: fp(0)::N_x Port with D_ID=10400, PWWN=1000001738279c10 > disappeared from fabric > > /pci at 1d,700000/SUNW,emlxs at 1,1/fp at 0,0 (fcp0): > Lun=0 for target=10400 disappeared > WARNING: /pci at 1d,700000/SUNW,emlxs at 1,1/fp at 0,0 (fcp0): > FCP: target=10400 reported NO Luns > WARNING: fp(0)::GPN_ID for D_ID=10400 failed > > WARNING: fp(0)::N_x Port with D_ID=10400, PWWN=1000001738279c10 > disappeared from fabric > > /pci at 1d,700000/SUNW,emlxs at 1,1/fp at 0,0 (fcp0): > Lun=0 for target=10400 disappeared > WARNING: /pci at 1d,700000/SUNW,emlxs at 1,1/fp at 0,0 (fcp0): > FCP: target=10400 reported NO Luns > /pci at 1d,700000/SUNW,emlxs at 1,1/fp at 0,0 (fcp0): > Lun=0 for target=10400 disappeared > WARNING: /pci at 1d,700000/SUNW,emlxs at 1,1/fp at 0,0 (fcp0): > FCP: target=10400 reported NO Luns > /scsi_vhci/ssd at g001738010080001c (ssd5) multipath status: failed, path > /pci at 1d,700000/SUNW,emlxs at 1/fp at 0,0 (fp2) to target address: > w1000001738043811,3 is offline Load balancing: none > /scsi_vhci/ssd at g001738010080001c (ssd5) offline > /scsi_vhci/ssd at g001738010080001f (ssd6) multipath status: failed, path > /pci at 1d,700000/SUNW,emlxs at 1/fp at 0,0 (fp2) to target address: > w1000001738043811,2 is offline Load balancing: none > /scsi_vhci/ssd at g001738010080001f (ssd6) offline > /scsi_vhci/ssd at g001738010080001e (ssd7) multipath status: failed, path > /pci at 1d,700000/SUNW,emlxs at 1/fp at 0,0 (fp2) to target address: > w1000001738043811,1 is offline Load balancing: none > /scsi_vhci/ssd at g001738010080001e (ssd7) offline > /scsi_vhci/ssd at g001738010080001a (ssd8) multipath status: failed, path > /pci at 1d,700000/SUNW,emlxs at 1/fp at 0,0 (fp2) to target address: > w1000001738043811,0 is offline Load balancing: none > > panic[cpu0]/thread=2a10057dcc0: > BAD TRAP: type=31 rp=2a10057cee0 addr=0 mmu_fsr=0 occurred in module > "unix" due > to a NULL pointer dereference > > Can you reproduce this hang?George, Doing it now. Thanks, -- Leon
On Mon, Leon Koll wrote:> I performed a SPEC SFS97 benchmark on Solaris 10u2/Sparc with 4 64GB > LUNs, connected via FC SAN. > The filesystems that were created on LUNS: UFS,VxFS,ZFS. > Unfortunately the ZFS test couldn''t complete bacuase the box was hung > under very moderate load (3000 IOPs). > Additional tests were done using UFS and VxFS that were built on ZFS > raw devices (Zvolumes). > Results can be seen here: > http://napobo3.blogspot.com/2006/08/spec-sfs-bencmark-of-zfsufsvxfs.htmlLeon, Might I suggest that you provide the details as specified in the SPEC SFS run and reporting rules? They can be buried in a link from your blog but it would be helpful to have that information available to your readers. Spencer
Leon Koll wrote:> I performed a SPEC SFS97 benchmark on Solaris 10u2/Sparc with 4 64GB > LUNs, connected via FC SAN. > The filesystems that were created on LUNS: UFS,VxFS,ZFS. > Unfortunately the ZFS test couldn''t complete bacuase the box was hung > under very moderate load (3000 IOPs). > Additional tests were done using UFS and VxFS that were built on ZFS > raw devices (Zvolumes). > Results can be seen here: > http://napobo3.blogspot.com/2006/08/spec-sfs-bencmark-of-zfsufsvxfs.html >hiya leon, Out of curiosity, how was the setup for each filesystem type done? I wasn''t sure what "4 ZFS''es" in "The bad news that the test on 4 ZFS''es couldn''t run at all" meant... so something like ''zpool status'' would be great. Do you know what you''re limiting factor was for ZFS (CPU, memory, I/O...)? eric> -- Leon > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
On 8/8/06, eric kustarz <eric.kustarz at sun.com> wrote:> Leon Koll wrote: > > > I performed a SPEC SFS97 benchmark on Solaris 10u2/Sparc with 4 64GB > > LUNs, connected via FC SAN. > > The filesystems that were created on LUNS: UFS,VxFS,ZFS. > > Unfortunately the ZFS test couldn''t complete bacuase the box was hung > > under very moderate load (3000 IOPs). > > Additional tests were done using UFS and VxFS that were built on ZFS > > raw devices (Zvolumes). > > Results can be seen here: > > http://napobo3.blogspot.com/2006/08/spec-sfs-bencmark-of-zfsufsvxfs.html > > > > hiya leon, > > Out of curiosity, how was the setup for each filesystem type done? > > I wasn''t sure what "4 ZFS''es" in "The bad news that the test on 4 ZFS''es > couldn''t run at all" meant... so something like ''zpool status'' would be > great.Hi Eric, here it is: root at vinc ~ # zpool status pool: pool1 state: ONLINE scrub: none requested config: NAME STATE READ WRITE CKSUM pool1 ONLINE 0 0 0 c4t001738010140000Bd0 ONLINE 0 0 0 errors: No known data errors pool: pool2 state: ONLINE scrub: none requested config: NAME STATE READ WRITE CKSUM pool2 ONLINE 0 0 0 c4t001738010140000Cd0 ONLINE 0 0 0 errors: No known data errors pool: pool3 state: ONLINE scrub: none requested config: NAME STATE READ WRITE CKSUM pool3 ONLINE 0 0 0 c4t001738010140001Cd0 ONLINE 0 0 0 errors: No known data errors pool: pool4 state: ONLINE scrub: none requested config: NAME STATE READ WRITE CKSUM pool4 ONLINE 0 0 0 c4t0017380101400012d0 ONLINE 0 0 0 errors: No known data errors> > Do you know what you''re limiting factor was for ZFS (CPU, memory, I/O...)?Thanks to George Wilson who pointed me to the fact that the memory was fully consumed. I removed the line "set ncsize = 0x100000" from /etc/system and the now the host isn''t hung during the test anymore. But performance is still an issue. -- Leon
Leon Koll wrote:> On 8/8/06, eric kustarz <eric.kustarz at sun.com> wrote: > >> Leon Koll wrote: >> >> > I performed a SPEC SFS97 benchmark on Solaris 10u2/Sparc with 4 64GB >> > LUNs, connected via FC SAN. >> > The filesystems that were created on LUNS: UFS,VxFS,ZFS. >> > Unfortunately the ZFS test couldn''t complete bacuase the box was hung >> > under very moderate load (3000 IOPs). >> > Additional tests were done using UFS and VxFS that were built on ZFS >> > raw devices (Zvolumes). >> > Results can be seen here: >> > >> http://napobo3.blogspot.com/2006/08/spec-sfs-bencmark-of-zfsufsvxfs.html >> > >> >> hiya leon, >> >> Out of curiosity, how was the setup for each filesystem type done? >> >> I wasn''t sure what "4 ZFS''es" in "The bad news that the test on 4 ZFS''es >> couldn''t run at all" meant... so something like ''zpool status'' would be >> great. > > > Hi Eric, > here it is: > > root at vinc ~ # zpool status > pool: pool1 > state: ONLINE > scrub: none requested > config: > > NAME STATE READ WRITE CKSUM > pool1 ONLINE 0 0 0 > c4t001738010140000Bd0 ONLINE 0 0 0 > > errors: No known data errors > > pool: pool2 > state: ONLINE > scrub: none requested > config: > > NAME STATE READ WRITE CKSUM > pool2 ONLINE 0 0 0 > c4t001738010140000Cd0 ONLINE 0 0 0 > > errors: No known data errors > > pool: pool3 > state: ONLINE > scrub: none requested > config: > > NAME STATE READ WRITE CKSUM > pool3 ONLINE 0 0 0 > c4t001738010140001Cd0 ONLINE 0 0 0 > > errors: No known data errors > > pool: pool4 > state: ONLINE > scrub: none requested > config: > > NAME STATE READ WRITE CKSUM > pool4 ONLINE 0 0 0 > c4t0017380101400012d0 ONLINE 0 0 0 > > errors: No known data errorsSo having 4 pools isn''t a recommended config - i would destroy those 4 pools and just create 1 RAID-0 pool: #zpool create sfsrocks c4t001738010140000Bd0 c4t001738010140000Cd0 c4t001738010140001Cd0 c4t0017380101400012d0 each of those devices is a 64GB lun, right?> >> >> Do you know what you''re limiting factor was for ZFS (CPU, memory, >> I/O...)? > > > Thanks to George Wilson who pointed me to the fact that the memory was > fully consumed. > I removed the line > "set ncsize = 0x100000" from /etc/system > and the now the host isn''t hung during the test anymore. > But performance is still an issue. >ah, you were limiting the # of dnlc entries... so you''re still seeing ZFS max out at 2000 ops/s? Let us know what happends when you switch to 1 pool. eric
<...>> So having 4 pools isn''t a recommended config - i would destroy those 4 > pools and just create 1 RAID-0 pool: > #zpool create sfsrocks c4t001738010140000Bd0 c4t001738010140000Cd0 > c4t001738010140001Cd0 c4t0017380101400012d0 > > each of those devices is a 64GB lun, right?I did it - created one pool, 4*64GB size, and running the benchmark now. I''ll update you on results, but one pool is definitely not what I need. My target is - SunCluster with HA ZFS where I need 2 or 4 pools per node.> > > > >> > >> Do you know what you''re limiting factor was for ZFS (CPU, memory, > >> I/O...)? > > > > > > Thanks to George Wilson who pointed me to the fact that the memory was > > fully consumed. > > I removed the line > > "set ncsize = 0x100000" from /etc/system > > and the now the host isn''t hung during the test anymore. > > But performance is still an issue. > > > > ah, you were limiting the # of dnlc entries... so you''re still seeing > ZFS max out at 2000 ops/s? Let us know what happends when you switch to > 1 pool.I''d say "increasing" instead of "limiting". TIA, -- Leon
Leon Koll wrote:> <...> > >> So having 4 pools isn''t a recommended config - i would destroy those 4 >> pools and just create 1 RAID-0 pool: >> #zpool create sfsrocks c4t001738010140000Bd0 c4t001738010140000Cd0 >> c4t001738010140001Cd0 c4t0017380101400012d0 >> >> each of those devices is a 64GB lun, right? > > > I did it - created one pool, 4*64GB size, and running the benchmark now. > I''ll update you on results, but one pool is definitely not what I need. > My target is - SunCluster with HA ZFS where I need 2 or 4 pools per node. >Why do you need 2 or 4 pools per node? If you''re doing HA-ZFS (which is SunCluster 3.2 - only available in beta right now), then you should divide your storage up to the number of *active* pools. So say you have 2 nodes and 4 luns (each lun being 64GB), and only need one active node - then you can create one pool of all 4 luns, and attach the 4 luns to both nodes. The way HA-ZFS basically works is that when the "active" node fails, it does a ''zpool export'', and the takeover node does a ''zpool import''. So both nodes are using the same storage, but they cannot use the same storage at the same time, see: http://www.opensolaris.org/jive/thread.jspa?messageID=49617 If however, you have 2 nodes, 4 luns, and wish both nodes to be active, then you can divy up the storage into two pools. So each node has one active pool of 2 luns. All 4 luns are doubly attached to both nodes, and when one node fails, the takeover node then has 2 active pools. So how many nodes do you have? and how many do you wish to be "active" at a time? And what was your configuration for VxFS and SVM/UFS? eric
Hello eric, Friday, August 11, 2006, 3:04:38 AM, you wrote: ek> Leon Koll wrote:>> <...> >> >>> So having 4 pools isn''t a recommended config - i would destroy those 4 >>> pools and just create 1 RAID-0 pool: >>> #zpool create sfsrocks c4t001738010140000Bd0 c4t001738010140000Cd0 >>> c4t001738010140001Cd0 c4t0017380101400012d0 >>> >>> each of those devices is a 64GB lun, right? >> >> >> I did it - created one pool, 4*64GB size, and running the benchmark now. >> I''ll update you on results, but one pool is definitely not what I need. >> My target is - SunCluster with HA ZFS where I need 2 or 4 pools per node. >>ek> Why do you need 2 or 4 pools per node? ek> If you''re doing HA-ZFS (which is SunCluster 3.2 - only available in beta ek> right now), then you should divide your storage up to the number of ek> *active* pools. So say you have 2 nodes and 4 luns (each lun being ek> 64GB), and only need one active node - then you can create one pool of ek> all 4 luns, and attach the 4 luns to both nodes. ek> The way HA-ZFS basically works is that when the "active" node fails, it ek> does a ''zpool export'', and the takeover node does a ''zpool import''. So ek> both nodes are using the same storage, but they cannot use the same ek> storage at the same time, see: ek> http://www.opensolaris.org/jive/thread.jspa?messageID=49617 ek> If however, you have 2 nodes, 4 luns, and wish both nodes to be active, ek> then you can divy up the storage into two pools. So each node has one ek> active pool of 2 luns. All 4 luns are doubly attached to both nodes, ek> and when one node fails, the takeover node then has 2 active pools. ek> So how many nodes do you have? and how many do you wish to be "active" ek> at a time? ek> And what was your configuration for VxFS and SVM/UFS? With 2-node NFS clusters normally I have one active one standby. However with many disks I always configure it that way so I have a possibility to split workload (pools,filesystems,...). I do it in that way I create two cluster groups each with its own IP, disks, etc. That way is I do have a performance problem related to a server performance and not an array itself I can quickly and temporarily solve it. So I think it is good to create at least two ZFS pools, two SC groups and normally set primary node for those two groups to the same node. -- Best regards, Robert mailto:rmilkowski at task.gda.pl http://milek.blogspot.com
On 8/11/06, eric kustarz <eric.kustarz at sun.com> wrote:> Leon Koll wrote: > > > <...> > > > >> So having 4 pools isn''t a recommended config - i would destroy those 4 > >> pools and just create 1 RAID-0 pool: > >> #zpool create sfsrocks c4t001738010140000Bd0 c4t001738010140000Cd0 > >> c4t001738010140001Cd0 c4t0017380101400012d0 > >> > >> each of those devices is a 64GB lun, right? > > > > > > I did it - created one pool, 4*64GB size, and running the benchmark now. > > I''ll update you on results, but one pool is definitely not what I need. > > My target is - SunCluster with HA ZFS where I need 2 or 4 pools per node. > > > Why do you need 2 or 4 pools per node? > > If you''re doing HA-ZFS (which is SunCluster 3.2 - only available in beta > right now), then you should divide your storage up to the number ofI know, I run the 3.2 now.> *active* pools. So say you have 2 nodes and 4 luns (each lun being > 64GB), and only need one active node - then you can create one pool ofTo have one active node doesn''t look smart to me. I want to distribute load between 2 nodes, not to have 1 active and 1 standby. The LUN size in this test is 64GB but in real configuration it will be 6TB> all 4 luns, and attach the 4 luns to both nodes. > > The way HA-ZFS basically works is that when the "active" node fails, it > does a ''zpool export'', and the takeover node does a ''zpool import''. So > both nodes are using the same storage, but they cannot use the same > storage at the same time, see: > http://www.opensolaris.org/jive/thread.jspa?messageID=49617Yes, it works this way.> > If however, you have 2 nodes, 4 luns, and wish both nodes to be active, > then you can divy up the storage into two pools. So each node has one > active pool of 2 luns. All 4 luns are doubly attached to both nodes, > and when one node fails, the takeover node then has 2 active pools.I agree with you - I can have 2 active pools, not 4 in case of dual-node cluster.> > So how many nodes do you have? and how many do you wish to be "active" > at a time?Currently - 2 nodes, both active. If I define 4 pools, I can easily expand the cluster to the 4-nodes configuration, that may be the good reason to have 4 pools.> > And what was your configuration for VxFS and SVM/UFS?4 SVM concat volumes (I need a concatenation of 1TB LUNs if I am in SC3.1 that doesn''t support EFI label) with UFS or VxFS on top. And now comes the questions - my short test showed that 1-pool config doesn''t behave better than 4-pools one - with the first the box was hung, with the second - didn''t. Why do you think the 1-pool config is better? TIA, -- Leon
Leon Koll wrote:> On 8/11/06, eric kustarz <eric.kustarz at sun.com> wrote: > >> Leon Koll wrote: >> >> > <...> >> > >> >> So having 4 pools isn''t a recommended config - i would destroy >> those 4 >> >> pools and just create 1 RAID-0 pool: >> >> #zpool create sfsrocks c4t001738010140000Bd0 c4t001738010140000Cd0 >> >> c4t001738010140001Cd0 c4t0017380101400012d0 >> >> >> >> each of those devices is a 64GB lun, right? >> > >> > >> > I did it - created one pool, 4*64GB size, and running the benchmark >> now. >> > I''ll update you on results, but one pool is definitely not what I >> need. >> > My target is - SunCluster with HA ZFS where I need 2 or 4 pools per >> node. >> > >> Why do you need 2 or 4 pools per node? >> >> If you''re doing HA-ZFS (which is SunCluster 3.2 - only available in beta >> right now), then you should divide your storage up to the number of > > > I know, I run the 3.2 now. > >> *active* pools. So say you have 2 nodes and 4 luns (each lun being >> 64GB), and only need one active node - then you can create one pool of > > > To have one active node doesn''t look smart to me. I want to distribute > load between 2 nodes, not to have 1 active and 1 standby. > The LUN size in this test is 64GB but in real configuration it will be > 6TB > >> all 4 luns, and attach the 4 luns to both nodes. >> >> The way HA-ZFS basically works is that when the "active" node fails, it >> does a ''zpool export'', and the takeover node does a ''zpool import''. So >> both nodes are using the same storage, but they cannot use the same >> storage at the same time, see: >> http://www.opensolaris.org/jive/thread.jspa?messageID=49617 > > > Yes, it works this way. > >> >> If however, you have 2 nodes, 4 luns, and wish both nodes to be active, >> then you can divy up the storage into two pools. So each node has one >> active pool of 2 luns. All 4 luns are doubly attached to both nodes, >> and when one node fails, the takeover node then has 2 active pools. > > > I agree with you - I can have 2 active pools, not 4 in case of > dual-node cluster. > >> >> So how many nodes do you have? and how many do you wish to be "active" >> at a time? > > > Currently - 2 nodes, both active. If I define 4 pools, I can easily > expand the cluster to the 4-nodes configuration, that may be the good > reason to have 4 pools.Ok, that makes sense.>> >> And what was your configuration for VxFS and SVM/UFS? > > > 4 SVM concat volumes (I need a concatenation of 1TB LUNs if I am in > SC3.1 that doesn''t support EFI label) with UFS or VxFS on top.So you have 2 nodes, 2 file systems (of either UFS or VxFS) on each node? I''m just trying to make sure its a fair comparison bewteen ZFS, UFS, and VxFS.> > And now comes the questions - my short test showed that 1-pool config > doesn''t behave better than 4-pools one - with the first the box was > hung, with the second - didn''t. > Why do you think the 1-pool config is better?I suggested the 1 pool config before i knew you were doing HA-ZFS :) Purposely dividing up your storage (by creating separate pools) in a non-clustered environment usually doesn''t make sense (root being one notable exception). eric
On 8/11/06, eric kustarz <eric.kustarz at sun.com> wrote:> Leon Koll wrote: > > > On 8/11/06, eric kustarz <eric.kustarz at sun.com> wrote: > > > >> Leon Koll wrote: > >> > >> > <...> > >> > > >> >> So having 4 pools isn''t a recommended config - i would destroy > >> those 4 > >> >> pools and just create 1 RAID-0 pool: > >> >> #zpool create sfsrocks c4t001738010140000Bd0 c4t001738010140000Cd0 > >> >> c4t001738010140001Cd0 c4t0017380101400012d0 > >> >> > >> >> each of those devices is a 64GB lun, right? > >> > > >> > > >> > I did it - created one pool, 4*64GB size, and running the benchmark > >> now. > >> > I''ll update you on results, but one pool is definitely not what I > >> need. > >> > My target is - SunCluster with HA ZFS where I need 2 or 4 pools per > >> node. > >> > > >> Why do you need 2 or 4 pools per node? > >> > >> If you''re doing HA-ZFS (which is SunCluster 3.2 - only available in beta > >> right now), then you should divide your storage up to the number of > > > > > > I know, I run the 3.2 now. > > > >> *active* pools. So say you have 2 nodes and 4 luns (each lun being > >> 64GB), and only need one active node - then you can create one pool of > > > > > > To have one active node doesn''t look smart to me. I want to distribute > > load between 2 nodes, not to have 1 active and 1 standby. > > The LUN size in this test is 64GB but in real configuration it will be > > 6TB > > > >> all 4 luns, and attach the 4 luns to both nodes. > >> > >> The way HA-ZFS basically works is that when the "active" node fails, it > >> does a ''zpool export'', and the takeover node does a ''zpool import''. So > >> both nodes are using the same storage, but they cannot use the same > >> storage at the same time, see: > >> http://www.opensolaris.org/jive/thread.jspa?messageID=49617 > > > > > > Yes, it works this way. > > > >> > >> If however, you have 2 nodes, 4 luns, and wish both nodes to be active, > >> then you can divy up the storage into two pools. So each node has one > >> active pool of 2 luns. All 4 luns are doubly attached to both nodes, > >> and when one node fails, the takeover node then has 2 active pools. > > > > > > I agree with you - I can have 2 active pools, not 4 in case of > > dual-node cluster. > > > >> > >> So how many nodes do you have? and how many do you wish to be "active" > >> at a time? > > > > > > Currently - 2 nodes, both active. If I define 4 pools, I can easily > > expand the cluster to the 4-nodes configuration, that may be the good > > reason to have 4 pools. > > > Ok, that makes sense. > > >> > >> And what was your configuration for VxFS and SVM/UFS? > > > > > > 4 SVM concat volumes (I need a concatenation of 1TB LUNs if I am in > > SC3.1 that doesn''t support EFI label) with UFS or VxFS on top. > > > So you have 2 nodes, 2 file systems (of either UFS or VxFS) on each node?I have 2 nodes, 2 file systems per node. One share is working via bge0, the second one - via bge1.> > I''m just trying to make sure its a fair comparison bewteen ZFS, UFS, and > VxFS.After I saw that ZFS performance (when the box isn''t stuck) is about 3 times lower than UFS/VxFS, I understood I should wait with ZFS for Solaris 11official release. I don''t believe that it''s possible to do some magic with my setup and increase the ZFS performance 3 times. Fix me if I''m wrong.> > > > > And now comes the questions - my short test showed that 1-pool config > > doesn''t behave better than 4-pools one - with the first the box was > > hung, with the second - didn''t. > > Why do you think the 1-pool config is better? > > > I suggested the 1 pool config before i knew you were doing HA-ZFS :) > Purposely dividing up your storage (by creating separate pools) in a > non-clustered environment usually doesn''t make sense (root being one > notable exception).I see. Thanks, -- Leon
> After I saw that ZFS performance (when the box isn''t stuck) is about 3 > times lower than UFS/VxFS, I understood I should wait with ZFS for > Solaris 11official release. > I don''t believe that it''s possible to do some magic with my setup and > increase the ZFS performance 3 times. Fix me if I''m wrong.Yep, we''re working on this right now, though you shouldn''t have to wait until Solaris11 - hopefully a s10 update will be out earlier with the proper perf fixes. U3 already has some improvements over U2 (which you were running). I''m actually doing specSFS benchmarking right now, and i''ll keep the list updated. eric
On August 10, 2006 6:04:38 PM -0700 eric kustarz <eric.kustarz at sun.com> wrote:> If you''re doing HA-ZFS (which is SunCluster 3.2 - only available in beta right now),Is the 3.2 beta publicly available? I can only locate 3.1. -frank
Frank, The SC 3.2 beta maybe closed but I''m forwarding your request to Eric Redmond. Thanks, George Frank Cusack wrote:> On August 10, 2006 6:04:38 PM -0700 eric kustarz <eric.kustarz at sun.com> > wrote: >> If you''re doing HA-ZFS (which is SunCluster 3.2 - only available in >> beta right now), > > Is the 3.2 beta publicly available? I can only locate 3.1. > > -frank > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
George Wilson wrote On 08/18/06 14:08,:> Frank, > > The SC 3.2 beta maybe closed but I''m forwarding your request to Eric > Redmond.The Sun Cluster 3.2 Beta program has been extended. You can apply for the Beta via this URL: https://feedbackprograms.sun.com/callout/default.html?callid={11B4E37C-D608-433B-AF69-07F6CD714AA1} ------------------------------------------------------------------------ Sun Cluster 3.2: New Features *Ease of Use* _*New Sun Cluster Object Oriented Command Set*_ The new SC command line interface includes one command per cluster object type and consistent use of sub-command names and option letters. It also supports short and long command names. The commands output have been greatly improved with better help and error messages as well as more readable status and configuration reporting. In addition some commands include export and import options with use of portable XML-based configuration files allowing replication of portion of, or entire cluster configurations. This new interface is easier to learn and easier to use, thereby limiting human errors during administration of clusters. It also speeds up partial or full configuration cloning. _*Oracle RAC 10g improved integration and administration*_ Sun Cluster RAC packages installation as well as configuration are now integrated in the Sun Cluster procedures. New RAC specific resource types and properties can be used for finer grained control. Oracle RAC extended manageability leads to easier set-up of Oracle RAC within Sun Cluster as well as improved diagnosability and availability. _*Agent configuration wizards*_ New GUI-based wizard provides simplified configuration for popular applications via on-line help, automatic discovery of parameter choices and immediate validation. Supported applications include Oracle RAC and HA, NFS, Apache and SAP. The configuration of agents is easier and less error-prone enabling faster set-up of popular solutions. _*Flexible IP address scheme*_ Sun Cluster now allows a reduced range of IP addresses for its private interconnect. In addition it becomes also possible to customize the IP base address and its range during or after installation. These changes facilitate integration of Sun Cluster environments in existing networks with limited or regulated address spaces. *Availability* _*Cluster support for SMF services*_ Sun Cluster now integrates tightly with Solaris 10 Service Management Facility (SMF) and enables the encapsulation of SMF controlled applications in the Sun Cluster resource management model. Local service-level life-cycle management continues to be operated by SMF while whole resource level cluster-wide failure (node, storage, ...) handling operations are carried out by Sun Cluster. Moving applications from a single node Solaris 10 environment to multi-node Sun Cluster environment enables increased availability while requiring limited-to-no effort. _*Extended flexibility for fencing protocol*_ This new functionality allows the customization of the default fencing protocol: choices include SCSI 3, SCSI 2 or per-device discovery. This flexibility enables the default usage of SCSI 3, a more recent protocol, for better support for multi-pathing, easier integration with non-Sun storage and shorter recovery times on newer storage while still supporting the SC 3.0/3.1 behavior and SCSI 2 for older devices. _*Quorum Server*_ A new quorum device option is now available in Sun Cluster: instead of using a shared disk and SCSI reservation protocols, it is now possible to use a Solaris server outside of the cluster to run a quorum server module supporting an atomic reservation protocol over TCP/IP. This enables faster failover time but also lowers deployment costs: it removes the need of a shared quorum disk for any scenario where quorum is required (2-node) or desired. _*Disk path failure handling*_ Sun Cluster can now be configured to automatically reboot a node if all its path to shared disk have failed. Faster reaction in case of severe disk path failure enables improved availability. _*HA Storage plus availability improvements*_ HA Storage plus mount points are created automatically in case of mount failure to eliminate failure-to-failover cases thus improving availability of the environment *Flexibility* _*Solaris Container expanded support *_ Any application of scalable or failover type and their associated Sun Cluster agents can now run unmodified within Solaris Containers (except Oracle RAC). This allows the combination of the benefits of application containment offered by Solaris containers and the increased availability provided by Sun Cluster. Note: Currently only the following Sun Cluster Agents are supported in Solaris Containers * JES Application Server * JES Web Server * JES MQ Server * DNS * Apache * Kerberos * HA-Oracle _*HA ZFS*_ ZFS is supported as a failover file system in Sun Cluster. ZFS and Sun Cluster offers a best class file system solution combining high availability, data integrity, performance and scalability covering the needs of the most demanding environments. _*HDS TrueCopy campus cluster*_ Sun Cluster based campus clusters now support HDS TrueCopy controller based replication allowing for automated management of Truecopy configurations. Sun Cluster handles automatically and transparently the switch to the secondary campus site in case of fail-over making this procedure less error-prone and improving the overall availability of the solution. This new remote data replication infrastructure allows Sun Cluster to support new configurations for customers who have been standardizing on specific replication infrastructure like TrueCopy and for places where host based replication is not a viable solution because of distance or application incompatibility. This new combination brings improved availability and less complexity while lowering cost. Sun Cluster can make use of existing TrueCopy customer replication infrastructure limiting the need for additional replication solution. _*Multi-terabyte disk and EFI label support*_ Sun Cluster configurations can now include disks with capacity over 1TB thanks to the support of the new disk format called EFI. This format is required for multi-terabyte disks but can also be used with smaller capacity disks. This extends the supported Sun Cluster configurations to environments with high-end storage requirements. _*Extended support for Veritas software components *_ Veritas Volume Manager and File System part of Veritas Storage Foundation 5.0 are now supported on SPARC platforms as well as Veritas Volume Manager 4.1 with Solaris 10 OS on x86/x64 platforms. Veritas Volume Replicator (VVR) and Veritas Fast Mirror Resynchronisation (FMR), part of Veritas FlashSnap can now be used in Sun Cluster environments on SPARC platforms. In adding support for Veritas replication and synchronisation technology, x86/x64 version and the latest release of Veritas software Sun Cluster provides more choice for customers and allows them to use Sun Cluster in environments where 3rd. party storage management solutions like Veritas Storage Foundation are standards. _*Quota support*_ Quota management can now be used with HA Storageplus on local UFS file systems for better control of resource consumption. _*Oracle DataGuard support*_ Customers are now able to operate Oracle Data Guard data replication configurations under Sun Cluster control. Sun Cluster now offers improved usability for Oracle deployments including Data Guard data replication software. *OAMP* _*Dual partition software swap*_ With this new software swap feature the upgrade process is greatly simplified: any component(s) of the software stack along with Sun Cluster can be upgraded in one step: Solaris, Sun Cluster, File Systems and Volume Managers, Applications. This automation lowers the risk induced by human errors during cluster upgrade ,a very complex procedure and minimizes service outage occurring for a classical cluster upgrade . _*Live upgrade*_ The Live upgrade procedure can now be used with Sun Cluster. This procedures allows to reduce system downtime of a node during upgrade as well as unnecessary reboots therefore lowering the required maintenance window where the service is at risk. _*Optional GUI installation*_ Sun Cluster Manager, the Sun Cluster management GUI can be left out during installation. This removes web based access to the cluster to comply with potential security rules. _*SNMP event MIB*_ Sun Cluster includes a new Sun Cluster SNMP event mechanism as well as a new SNMP MIB. They now allow 3rd. party SNMP management applications to directly register with Sun Cluster and receive timely notifications of cluster events. Fine grained event notification and direct integration with 3rd. party enterprise management framework through standard SNMP support allow proactive monitoring and increase availability. _*Command logging*_ Commands can now be logged within Sun Cluster. This ability facilitates diagnostics of cluster failures and provides history of the administration actions for archiving or replication. _*Workload system resource monitoring*_ Sun Cluster offers new system resources utilization measurement and visualizatio tools including fine grained measurement of consumptions per node, resource, resource group. These new tools provide historical data as well as threshold management and CPU reservation and control. This improved control allows for better management of service level and capacity *Performance * Several performance improvements have been introduced in this latest Sun Cluster release. * Sun Cluster Manager previously known as SunPlex Manager has undergone several performance improvements in particular in navigating to the different screens. Some operations have been sped up to four_ times. * PxFS performance improvements of the order of 5-6 times are possible depending on the workload using Fastwrite option. * Switchover times for HA Storage Plus are improved up to five times thanks to parallelizing mounting of file systems under HAStoragePlus control. <http://www.sun.com/solaris> *Eric Redmond* Solaris Enterprise System, Beta Program Manager *Sun Microsystems, Inc.* 17 Network Circle Menlo Park, CA 94025 Phone: x85550/+1 650 786 5550 Fax: 650-786-5734 Email Eric.Redmond at Sun.COM <http://www.sun.com/solaris> -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20060819/dc50e091/attachment.html>
On August 19, 2006 10:53:55 AM -0700 Eric Redmond <Eric.Redmond at Sun.COM> wrote:> Sun Cluster 3.2: New Featureswow, this makes 3.1 sound like dog food. -frank
PxFS performance improvements of the order of 5-6 times are possible depending on the workload using Fastwrite option. Fantastic! Has this been targetted at directory operations? We''ve had issues with large directorys full of small files being very slow to handle over PxFS. Are there plans for PxFS on ZFS any time soon :) ? Or any plans to release PxFS as part of opensolaris? Cheers, Alan This message posted from opensolaris.org
Alan Romeril wrote:> PxFS performance improvements of the order of 5-6 times are possible > depending on the workload using Fastwrite option. > > Fantastic! Has this been targetted at directory operations? We''ve > had issues with large directorys full of small files being very slow > to handle over PxFS.The ''fastwrite option'' speeds up write operations. So this doesn''t do much for directory operations.> Are there plans for PxFS on ZFS any time soon :) ?PxFS on ZFS is unlikely to happen. A clusterized version of ZFS (as mentioned before on this alias) is being considered.> Or any plans to > release PxFS as part of opensolaris?PxFS is tightly couple to the cluster framework. Without open sourcing cluster, PxFS in its current form cannot be opensourced as it would not make sense. As for open sourcing cluster, my guess is as good as yours. Regards, Manoj -- Sun Cluster Engineering