thr3ads.net - zfs discuss - [zfs-discuss] Heavy writes freezing system [Jan 2007]

If this information is useful, please help other people find it:
Share via:

Rainer Heilke

2007-Jan-16 16:02 UTC

[zfs-discuss] Heavy writes freezing system

Greetings, everyone.

We are having issues with some Oracle databases on ZFS. We would appreciate any
useful feedback you can provide.

We are using Oracle Financials, with all databases, control files, and logs on
one big 2TB ZFS pool that is on a Hitachi SAN. (This is what the DBA group
wanted.) For the most part, the system runs fine.

When the DBA?s do clones or backups using RMAN, however, all of the write
activity almost freezes the system. The regular transactions back up, screen
refreshes get delayed, etc. The system, quite simply, grinds almost to a halt.
The immediate issue seemed to be the log file writes. If we separate the logs
space, it just shifts the problem to the DB/control files. If we split out the
backup space, things are OK during the RMAN backups, but the problems remain for
the clones/restores. The issue seems to be serious write contention/performance.
Some read issues also exhibit themselves, but they seem to be secondary to the
write issues.

We ran a test doing this for one environment on a test UFS LUN. While both the
backup and restore operations took twice as long, we did not have the system
lock issues we see with ZFS.

We?ve also been playing a little with ztune.sh on a test system, but we really
need to come to a proper solution (and a better understanding of what is causing
the problems). Can anyone explain to us why the writes are backing up like this,
and what we can do about this? Under Solaris 8, VxFS worked just fine in this
scenario. Due to the number of files, UFS was not an option. Since the
environment is going to RAC in six months, upgrading Veritas did not seem like a
justifiable option, with the (mistaken?) belief ZFS performance would be more
than adequate.

Thank you for any help you can provide.
Rainer
 
 
This message posted from opensolaris.org

Al Hopper

2007-Jan-16 16:09 UTC

head link

[zfs-discuss] Heavy writes freezing system

On Tue, 16 Jan 2007, Rainer Heilke wrote:
> Greetings, everyone.
>
> We are having issues with some Oracle databases on ZFS. We would appreciate
any useful feedback you can provide.
You did''nt give any details of the system (configuration) on which the
DB
runs... Not even SPARC or x86/AMD64... ??

Al Hopper  Logical Approach Inc, Plano, TX.  al at logical-approach.com
           Voice: 972.379.2133 Fax: 972.379.2134  Timezone: US CDT
OpenSolaris.Org Community Advisory Board (CAB) Member - Apr 2005
             OpenSolaris Governing Board (OGB) Member - Feb 2006

Jürgen Keil

2007-Jan-16 16:34 UTC

head link

[zfs-discuss] Re: Heavy writes freezing system

> We are having issues with some Oracle databases on
> ZFS. We would appreciate any useful feedback you can
> provide.
> [...]
> The issue seems to be
> serious write contention/performance. Some read
> issues also exhibit themselves, but they seem to be
> secondary to the write issues.
What hardware is used?  Sparc? x86 32-bit? x86 64-bit?
How much RAM is installed?
Which version of the OS? 


Did you already try to monitor kernel memory usage,
while writing to zfs?  Maybe the kernel is running out of
free memory?  (I''ve bugs like 6483887 in mind, 
"without direct management, arc ghost lists can run amok")

For a live system:

    echo ::kmastat | mdb -k
    echo ::memstat | mdb -k


In case you''ve got a crash dump for the hung system, you
can try the same ::kmastat and ::memstat commands using the 
kernel crash dumps saved in directory /var/crash/`hostname`

    # cd /var/crash/`hostname`
    # mdb -k unix.1 vmcore.1
    ::memstat
    ::kmastat
 
 
This message posted from opensolaris.org

Rainer Heilke

2007-Jan-16 16:45 UTC

head link

[zfs-discuss] Re: Heavy writes freezing system

> What hardware is used?  Sparc? x86 32-bit? x86
> 64-bit?
> How much RAM is installed?
> Which version of the OS? 
Sorry, this is happening on two systems (test and production). They''re
both Solaris 10, Update 2. Test is a V880 with 8 CPU''s and 32GB,
production is an E2900 with 12 dual-core CPU''s and 48GB.
> Did you already try to monitor kernel memory usage,
> while writing to zfs?  Maybe the kernel is running
> out of
> free memory?  (I''ve bugs like 6483887 in mind, 
> "without direct management, arc ghost lists can run
> amok")
We haven''t seen serious kernel memory usage that I know of
(I''ll be honest--I came into this problem late).
> For a live system:
> 
>     echo ::kmastat | mdb -k
> echo ::memstat | mdb -k
I can try this if the DBA group is willing to do another test, thanks.
> In case you''ve got a crash dump for the hung system,
> you
> can try the same ::kmastat and ::memstat commands
> using the 
> kernel crash dumps saved in directory
> /var/crash/`hostname`
> 
>     # cd /var/crash/`hostname`
> # mdb -k unix.1 vmcore.1
>     ::memstat
> ::kmastat
The system doesn''t actually crash. It also doesn''t freeze
_completely_. While I call it a freeze (best name for it), it actually just
slows down incredibly. It''s like the whole system bogs down like
molasses in January. Things happen, but very slowly.

Rainer
 
 
This message posted from opensolaris.org

Erblichs

2007-Jan-16 20:23 UTC

head link

[zfs-discuss] Re: Heavy writes freezing system

Rainer Heilke,

	You have 1/4 of the amount of memory that the 2900 system
	is capable of (192GBs : I think).

	Secondly, output from fsstat(1M) could be helpful.

	Run this command over time and check to see if the
	values change over time..

	Mitchell Erblich
	---------------
	
	

Rainer Heilke wrote:> 
> > What hardware is used?  Sparc? x86 32-bit? x86
> > 64-bit?
> > How much RAM is installed?
> > Which version of the OS?
> 
> Sorry, this is happening on two systems (test and production).
They''re both Solaris 10, Update 2. Test is a V880 with 8 CPU''s
and 32GB, production is an E2900 with 12 dual-core CPU''s and 48GB.
> 
> > Did you already try to monitor kernel memory usage,
> > while writing to zfs?  Maybe the kernel is running
> > out of
> > free memory?  (I''ve bugs like 6483887 in mind,
> > "without direct management, arc ghost lists can run
> > amok")
> 
> We haven''t seen serious kernel memory usage that I know of
(I''ll be honest--I came into this problem late).
> 
> > For a live system:
> >
> >     echo ::kmastat | mdb -k
> > echo ::memstat | mdb -k
> 
> I can try this if the DBA group is willing to do another test, thanks.
> 
> > In case you''ve got a crash dump for the hung system,
> > you
> > can try the same ::kmastat and ::memstat commands
> > using the
> > kernel crash dumps saved in directory
> > /var/crash/`hostname`
> >
> >     # cd /var/crash/`hostname`
> > # mdb -k unix.1 vmcore.1
> >     ::memstat
> > ::kmastat
> 
> The system doesn''t actually crash. It also doesn''t freeze
_completely_. While I call it a freeze (best name for it), it actually just
slows down incredibly. It''s like the whole system bogs down like
molasses in January. Things happen, but very slowly.
> 
> Rainer
> 
> 
> This message posted from opensolaris.org
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Rainer Heilke

2007-Jan-16 22:33 UTC

head link

[zfs-discuss] Re: Re: Heavy writes freezing system

> Rainer Heilke,
> 
> You have 1/4 of the amount of memory that the 2900
> 0 system is capable of (192GBs : I think).
Yep. The server does not hold the application (three-tier architecture) so this
is the standard build we bought. The memory has not indicated any problems. All
errors point to write issues.
> 	Secondly, output from fsstat(1M) could be helpful.
> 
> 	Run this command over time and check to see if the
> 	values change over time..
Thanks. I''ll pass this along to the person doing the testing.
He''s been doing some measuring, but I''m not sure if fsstat was
one of them.

Rainer
 
 
This message posted from opensolaris.org

Rainer Heilke

2007-Jan-16 22:49 UTC

head link

[zfs-discuss] Re: Re: Heavy writes freezing system

The DBA team isn''t wanting to do another test. They have "made up
their minds". We have a meeting with them tomorrow, though, and will try to
convince them of one more test so that we can try the mdb and fsstat tools. (The
admin doing the tests was using iostat, not fsstat.) I, at least, am interested
in finding exactly where the failure is, rather than just saying "ZFS
doesn''t work". :-(

Rainer
 
 
This message posted from opensolaris.org

Robert Milkowski

2007-Jan-16 23:58 UTC

head link

[zfs-discuss] Heavy writes freezing system

Hello Rainer,

Tuesday, January 16, 2007, 5:02:01 PM, you wrote:


RH> scenario. Due to the number of files, UFS was not an option. Since
RH> the environment is going to RAC in six months, upgrading Veritas
RH> did not seem like a justifiable option, with the (mistaken?)
RH> belief ZFS performance would be more than adequate.

What do you mean by UFS wasn''t an option due to number of files?


Also do you have any tunables in system?
Can you send ''zpool status'' output? (raidz, mirror, ...?)

"When the DBA?s do clones" - you mean that by just doing ''zfs
clone
...'' you get big performance problem? OR maybe just before when you do
''zfs snapshot'' first? How much free space is left in a pool?

Do you have sar data when problems occured? Any paging in a system?

And one advise - before any more testing I would definitely
upgrade/reinstall system to U3 when it comes to ZFS.




-- 
Best regards,
 Robert                            mailto:rmilkowski at task.gda.pl
                                       http://milek.blogspot.com

Anantha N. Srirama

2007-Jan-17 13:35 UTC

head link

[zfs-discuss] Re: Heavy writes freezing system

You''re probably hitting the same wall/bug that I came across; ZFS in
all versions up to and including Sol10U3 generates excessive I/O when it
encounters ''fssync'' or if any of the files were opened with
''O_DSYNC'' option.

I do believe Oracle (or any DB for that matter) opens the file with O_DSYNC
option. During normal times it does result in excessive I/O but is probably well
under your system capacity (it was in our case.) But when you are doing backups
or clones (Oracle clones by using RMAN or copying of db files?) you are going to
flood the I/O sub-system and that''s when the whole ZFS excessive I/O
starts to put a hurt on the DB performance.

Here are a few suggestions that can give you interim relief:

- Seggregate your I/O at filesystem level; the bug is at the filesystem level
not ZFS pool level. By this I mean ensure the online redo logs are in a ZFS FS
that nobody else uses, same for control files. As long as the writes to control
and online redo logs are met your system will be happy.
- Ensure that your clone and RMAN (if you''re going to disk) write to a
seperate ZFS FS that contains no production files.
- If the above two items don''t give you relieve then relocate the
online redo log and control files to a UFS filesystem. No need to downgrade the
entire ZFS to something else.
- Consider Oracle ASM (DB version permitting,) works very well. Why deal with
VxFS.

Feel free to drop me a line, I''ve over 17 years of Oracle DB experience
and love to troubleshoot problems like this. I''ve another vested
interest; we''re considering ZFS for widespread use in our environment
and any experience is good for us.
 
 
This message posted from opensolaris.org

Robert Milkowski

2007-Jan-17 14:42 UTC

head link

[zfs-discuss] Re: Heavy writes freezing system

Hello Anantha,

Wednesday, January 17, 2007, 2:35:01 PM, you wrote:

ANS> You''re probably hitting the same wall/bug that I came across;
ANS> ZFS in all versions up to and including Sol10U3 generates
ANS> excessive I/O when it encounters ''fssync'' or if any of
the files
ANS> were opened with ''O_DSYNC'' option.

ANS> I do believe Oracle (or any DB for that matter) opens the file
ANS> with O_DSYNC option. During normal times it does result in
ANS> excessive I/O but is probably well under your system capacity (it
ANS> was in our case.) But when you are doing backups or clones
ANS> (Oracle clones by using RMAN or copying of db files?) you are
ANS> going to flood the I/O sub-system and that''s when the whole ZFS
ANS> excessive I/O starts to put a hurt on the DB performance.

ANS> Here are a few suggestions that can give you interim relief:

ANS> - Seggregate your I/O at filesystem level; the bug is at the
ANS> filesystem level not ZFS pool level. By this I mean ensure the
ANS> online redo logs are in a ZFS FS that nobody else uses, same for
ANS> control files. As long as the writes to control and online redo
ANS> logs are met your system will be happy.
ANS> - Ensure that your clone and RMAN (if you''re going to disk)
ANS> write to a seperate ZFS FS that contains no production files.
ANS> - If the above two items don''t give you relieve then relocate
ANS> the online redo log and control files to a UFS filesystem. No
ANS> need to downgrade the entire ZFS to something else.
ANS> - Consider Oracle ASM (DB version permitting,) works very well. Why deal
with VxFS.

ANS> Feel free to drop me a line, I''ve over 17 years of Oracle DB
ANS> experience and love to troubleshoot problems like this. I''ve
ANS> another vested interest; we''re considering ZFS for widespread
use
ANS> in our environment and any experience is good for us.
ANS>  

Also as an workaround you could disable zil if it''s acceptable to you
(in case of system panic or hard reset you can endup with
unrecoverable database).


-- 
Best regards,
 Robert                            mailto:rmilkowski at task.gda.pl
                                       http://milek.blogspot.com

Grant Zhang

2007-Jan-17 15:16 UTC

head link

[zfs-discuss] Re: Heavy writes freezing system

Anantha N. Srirama ??:> You''re probably hitting the same wall/bug that I came across; ZFS
in all versions up to and including Sol10U3 generates excessive I/O when it
encounters ''fssync'' or if any of the files were opened with
''O_DSYNC'' option.
>   Why excessive I/O got generated? fsync or O_DSYNC may cause additional 
write cache-flush issued by ZIL.The total amount of I/O (not including 
cache flush) should remain the same, right?

Anantha N. Srirama

2007-Jan-17 15:32 UTC

head link

[zfs-discuss] Re: Re: Heavy writes freezing system

Bug 6413510 is the root cause. ZFS maestros please correct me if I''m
quoting an incorrect bug.
 
 
This message posted from opensolaris.org

Rainer Heilke

2007-Jan-17 15:41 UTC

head link

[zfs-discuss] Re: Heavy writes freezing system

> What do you mean by UFS wasn''t an option due to
> number of files?
Exactly that. UFS has a 1 million file limit under Solaris. Each Oracle
Financials environment well exceeds this limitation.
> Also do you have any tunables in system?
> Can you send ''zpool status'' output? (raidz, mirror,
> ...?)
Our tunables are:

set noexec_user_stack=1
set sd:sd_max_throttle = 32
set sd:sd_io_time = 0x3c

zpool status:

 > zpool status
  pool: d
 state: ONLINE
 scrub: none requested
config:

        NAME                                     STATE     READ WRITE CKSUM
        d                                        ONLINE       0     0     0
          c5t60060E800475AA00000075AA0000100Bd0  ONLINE       0     0     0
          c5t60060E800475AA00000075AA0000100Dd0  ONLINE       0     0     0
          c5t60060E800475AA00000075AA0000100Cd0  ONLINE       0     0     0
          c5t60060E800475AA00000075AA0000100Ed0  ONLINE       0     0     0

errors: No known data errors

> "When the DBA?s do clones" - you mean that by just
> doing ''zfs clone
> ...'' you get big performance problem? OR maybe just
> before when you do
> ''zfs snapshot'' first? How much free space is left in
> a pool?
Nope. The DBA group clones the production instance using OEM in order to build
copies for Education, development, etc. This is strictly an Oracle function, not
a file system (ZFS) operation.
> Do you have sar data when problems occured? Any
> paging in a system?
Some. I''ll have to have the other analyst try to pull out the times
when our testing was done, but I''ve been told nothing stood out. (I
love playing middle-man. NOT!)
> And one advise - before any more testing I would
> definitely
> upgrade/reinstall system to U3 when it comes to ZFS.
Not an option. This isn''t even a faint possibility. We''re
talking both our test/development servers, and our production/education.
That''s six servers to upgrade (remember, we have a the applications on
servers distinct from the database servers--the DBA''s would never let
us divurge the OS releases).

Rainer
 
 
This message posted from opensolaris.org

Rainer Heilke

2007-Jan-17 15:54 UTC

head link

[zfs-discuss] Re: Heavy writes freezing system

Thanks for the feedback! 

This does sound like what we''re hitting. From our testing, you are
absolutely correct--separating out the parts is a major help. The big problem we
still see, though, is doing the clones/recoveries. The DBA group clones the
production environment for Education. Since both of these instances live on the
same server and ZPool/filesystem, this kills the throughput. When doing cloning
or backups to a different area, whether UFS or ZFS, we don''t have the
issues.

I''ll know for sure later today or tomorrow, but it sounds like they are
seriously considering the ASM route. Since we will be going to RAC later this
year, this move makes the most sense. We''ll just have to hope that the
DBA group gets a better understanding of LUN''s and our SAN, as
they''ll be taking over part of the disk (LUN) management. :-/ We were
hoping we could get some interrim relief on the ZFS front through tuning or
something, but if what you''re saying is correct (and it sounds like it
is), we may be out of luck.

Thanks very much for the feedback.

Rainer
 
 
This message posted from opensolaris.org

Rainer Heilke

2007-Jan-17 15:55 UTC

head link

[zfs-discuss] Re: Re: Heavy writes freezing system

> Also as an workaround you could disable zil if it''s
> acceptable to you
> (in case of system panic or hard reset you can endup
> with
> unrecoverable database).
Again, not an option, but thatnks for the pointer. I read a bit about this last
week, and it sounds way too scary.

Rainer
 
 
This message posted from opensolaris.org

Richard Elling

2007-Jan-17 17:05 UTC

head link

[zfs-discuss] Re: Heavy writes freezing system

Rainer Heilke wrote:> I''ll know for sure later today or tomorrow, but it sounds like
they are
> seriously considering the ASM route. Since we will be going to RAC later 
> this year, this move makes the most sense. We''ll just have to hope
that
> the DBA group gets a better understanding of LUN''s and our SAN, as
they''ll
> be taking over part of the disk (LUN) management. :-/ We were hoping we 
> could get some interrim relief on the ZFS front through tuning or
something,
> but if what you''re saying is correct (and it sounds like it is),
we may be
> out of luck.
If you plan on RAC, then ASM makes good sense.  It is unclear (to me anyway)
if ASM over a zvol is better than ASM over a raw LUN.  It would be nice to
have some of the zfs features such as snapshots, without having to go through
extraordinary pain or buy expensive RAID arrays.  If someone has tried ASM
on a zvol, please speak up :-)
  -- richard

Neil Perrin

2007-Jan-17 17:36 UTC

head link

[zfs-discuss] Re: Re: Heavy writes freezing system

Anantha N. Srirama wrote On 01/17/07 08:32,:> Bug 6413510 is the root cause. ZFS maestros please correct me if
I''m quoting an incorrect bug.
Yes, Anantha is correct that is the bug id, which could be responsible
for more disk writes than expected.

Let me try to explain that bug.
The ZIL as described in http://blogs.sun.com/perrin
collects transactions in memory of all system calls until they are
committed in a transaction group (txg) at the pool level. If a request
arrives to force to stable stoarge a particular file (fsync or O_DSYNC)
then the ZIL used to write out all in memory transactions for the
file system. This meant transactions unrelated to that file
were written including directory creations, renames etc - which might
be important in being able to re-create the file. However,
it also pushed out user data for other files, which can be voluminous.
The problem was originally seen when a ksh history file was fsync-ed
during a large data write. It would take many seconds to flush
the large write through the log, just to ensure a "pwd" command typed
was safely on disk! This inefficiency occurs only when a "mismatch" of
applications use the same file system.

The fix was essentially to push out all meta data for the file system but
only the file data related to the file being fsync-ed or O_DSYC-ed.
This problem was fixed in snv_48 last September and will be
in S10_U4.

Neil.

Richard Elling

2007-Jan-17 18:10 UTC

head link

[zfs-discuss] Re: Heavy writes freezing system

Rainer Heilke wrote:>> What do you mean by UFS wasn''t an option due to
>> number of files?
> 
> Exactly that. UFS has a 1 million file limit under Solaris. Each Oracle 
> Financials environment well exceeds this limitation.
Really?!?  I thought Oracle would use a database for storage...
>> Also do you have any tunables in system?
>> Can you send ''zpool status'' output? (raidz, mirror,
>> ...?)
> 
> Our tunables are:
> 
> set noexec_user_stack=1
> set sd:sd_max_throttle = 32
> set sd:sd_io_time = 0x3c
EMC?
> zpool status:
> 
>  > zpool status
>   pool: d
>  state: ONLINE
>  scrub: none requested
> config:
> 
>         NAME                                     STATE     READ WRITE CKSUM
>         d                                        ONLINE       0     0     0
>           c5t60060E800475AA00000075AA0000100Bd0  ONLINE       0     0     0
>           c5t60060E800475AA00000075AA0000100Dd0  ONLINE       0     0     0
>           c5t60060E800475AA00000075AA0000100Cd0  ONLINE       0     0     0
>           c5t60060E800475AA00000075AA0000100Ed0  ONLINE       0     0     0
> 
> errors: No known data errors
> 
> 
>> "When the DBA?s do clones" - you mean that by just
>> doing ''zfs clone
>> ...'' you get big performance problem? OR maybe just
>> before when you do
>> ''zfs snapshot'' first? How much free space is left in
>> a pool?
> 
> Nope. The DBA group clones the production instance using OEM in order to
build copies for Education, development, etc. This is strictly an Oracle
function, not a file system (ZFS) operation.
> 
>> Do you have sar data when problems occured? Any
>> paging in a system?
> 
> Some. I''ll have to have the other analyst try to pull out the
times when our testing was done, but I''ve been told nothing stood out.
(I love playing middle-man. NOT!)
> 
>> And one advise - before any more testing I would
>> definitely
>> upgrade/reinstall system to U3 when it comes to ZFS.
> 
> Not an option. This isn''t even a faint possibility. We''re
talking both our test/development servers, and our production/education.
That''s six servers to upgrade (remember, we have a the applications on
servers distinct from the database servers--the DBA''s would never let
us divurge the OS releases).
Yes this is common, so you should look for the patches which should
fix at least the fsync problem.  Check the archives here for patch
update info from George Wilson.
  -- richard

Dennis Clarke

2007-Jan-17 19:22 UTC

head link

[zfs-discuss] Re: Heavy writes freezing system

>> What do you mean by UFS wasn''t an option due to
>> number of files?
>
> Exactly that. UFS has a 1 million file limit under Solaris. Each Oracle
> Financials environment well exceeds this limitation.
>
what ?

$ uname -a
SunOS core 5.10 Generic_118833-17 sun4u sparc SUNW,UltraSPARC-IIi-cEngine
$ df -F ufs -t
/                  (/dev/md/dsk/d0    ):  5367776 blocks   616328 files
                                  total: 13145340 blocks   792064 files
/export/nfs        (/dev/md/dsk/d8    ): 83981368 blocks 96621651 files
                                  total: 404209452 blocks 100534720 files
/export/home       (/dev/md/dsk/d7    ):   980894 blocks   260691 files
                                  total:   986496 blocks   260736 files
$

I think that I am 95,621,651 files over your 1 million limit right there!

Should I place a support call and file a bug report ?

Dennis

Michael Schuster

2007-Jan-17 19:33 UTC

head link

[zfs-discuss] Re: Heavy writes freezing system

Dennis Clarke wrote:>>> What do you mean by UFS wasn''t an option due to
>>> number of files?
>> Exactly that. UFS has a 1 million file limit under Solaris. Each Oracle
>> Financials environment well exceeds this limitation.
>>
> 
> what ?
> 
> $ uname -a
> SunOS core 5.10 Generic_118833-17 sun4u sparc SUNW,UltraSPARC-IIi-cEngine
> $ df -F ufs -t
> /                  (/dev/md/dsk/d0    ):  5367776 blocks   616328 files
>                                   total: 13145340 blocks   792064 files
> /export/nfs        (/dev/md/dsk/d8    ): 83981368 blocks 96621651 files
>                                   total: 404209452 blocks 100534720 files
> /export/home       (/dev/md/dsk/d7    ):   980894 blocks   260691 files
>                                   total:   986496 blocks   260736 files
> $
> 
> I think that I am 95,621,651 files over your 1 million limit right there!
is that a multi-terabyte-UFS? if no, ignore :-), it yes, the actual limit is
"1 million
inode PER Terabyte".

HTH
-- 
Michael Schuster
Sun Microsystems, Inc.

Rainer Heilke

2007-Jan-17 21:01 UTC

head link

[zfs-discuss] Re: Re: Heavy writes freezing system

We had a 2TB filesystem. No matter what options I set explicitly, the UFS
filesystem kept getting written with a 1 million file limit. Believe me, I tried
a lot of options, and they kept getting set back on me.

After a fair bit of poking around (Google, Sun''s site, etc.) I found
several other notes indicating that this was the limit for UFS file systems.
(For the pedants, keep in mind we are talking computers, so the actual number
will be some exponent of 2. "! million" is an approximation.)

If someone has gotten around this under UFS, I''d be very interested--as
an intellectual curiousity--in knowing what switches you passed to the
mkfs/newfs command(s).

Rainer
 
 
This message posted from opensolaris.org

Casper.Dik at Sun.COM

2007-Jan-17 21:14 UTC

head link

[zfs-discuss] Re: Re: Heavy writes freezing system

>We had a 2TB filesystem. No matter what options I set explicitly, the
>UFS filesystem kept getting written with a 1 million file limit.
>Believe me, I tried a lot of options, and they kept getting se t back
>on me.
The limit is documented as "1 million inodes per TB".  So something
must not have gone right.  But many people have complained and
you could take the newfs source and fix the limitation.

The discontinuity when going from <1TB to over 1TB is appaling.
(<1TB allows for 137million inodes; >= 1TB allows for 1million per).

The rationale is fsck time (but logging is forced anyway)

The 1 million limit is arbitrary and too low...

Casper

Jason J. W. Williams

2007-Jan-17 22:24 UTC

head link

[zfs-discuss] Re: Heavy writes freezing system

Hi Anantha,

I was curious why segregating at the FS level would provide adequate
I/O isolation? Since all FS are on the same pool, I assumed flogging a
FS would flog the pool and negatively affect all the other FS on that
pool?

Best Regards,
Jason

On 1/17/07, Anantha N. Srirama <anantha.srirama at cdc.hhs.gov>
wrote:> You''re probably hitting the same wall/bug that I came across; ZFS
in all versions up to and including Sol10U3 generates excessive I/O when it
encounters ''fssync'' or if any of the files were opened with
''O_DSYNC'' option.
>
> I do believe Oracle (or any DB for that matter) opens the file with O_DSYNC
option. During normal times it does result in excessive I/O but is probably well
under your system capacity (it was in our case.) But when you are doing backups
or clones (Oracle clones by using RMAN or copying of db files?) you are going to
flood the I/O sub-system and that''s when the whole ZFS excessive I/O
starts to put a hurt on the DB performance.
>
> Here are a few suggestions that can give you interim relief:
>
> - Seggregate your I/O at filesystem level; the bug is at the filesystem
level not ZFS pool level. By this I mean ensure the online redo logs are in a
ZFS FS that nobody else uses, same for control files. As long as the writes to
control and online redo logs are met your system will be happy.
> - Ensure that your clone and RMAN (if you''re going to disk) write
to a seperate ZFS FS that contains no production files.
> - If the above two items don''t give you relieve then relocate the
online redo log and control files to a UFS filesystem. No need to downgrade the
entire ZFS to something else.
> - Consider Oracle ASM (DB version permitting,) works very well. Why deal
with VxFS.
>
> Feel free to drop me a line, I''ve over 17 years of Oracle DB
experience and love to troubleshoot problems like this. I''ve another
vested interest; we''re considering ZFS for widespread use in our
environment and any experience is good for us.
>
>
> This message posted from opensolaris.org
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>

Rainer Heilke

2007-Jan-17 22:44 UTC

head link

[zfs-discuss] Re: Re: Heavy writes freezing system

It turns out we''re probably going to go the UFS/ZFS route, with 4
filesystems (the DB files on UFS with Directio).

It seems that the pain of moving from a single-node ASM to a RAC''d ASM
is great, and not worth it. The DBA group decided doing the migration to UFS for
the DB files now, and then to a RAC''d ASM later, will end up being the
easiest, safest route.

Rainer
Still curious as to if and when this bug will get fixed...
 
 
This message posted from opensolaris.org

Anantha N. Srirama

2007-Jan-17 22:48 UTC

head link

[zfs-discuss] Re: Heavy writes freezing system

Bag-o-tricks-r-us, I suggest the following in such a case:

- Two ZFS pools
  - One for production
  - One for Education
  - Isolate the LUNs feeding the pools if possible, don''t share
spindles. Remember on EMC/Hitachi you''ve logical LUNs created by
striping/concat''ng carved up physical disks, so you could have two LUNs
that share the same spindle. Don''t believe one word from your storage
admin about we''ve lot of cache to abstract the physical structure;
Oracle can push any storage sub-system over the edge. Almost all of the storage
vendors prevent one LUN from flooding the cache with writes, EMC gives no more
than 8x the initial allocation of cache (total cache/total disk space) and after
that it''ll stall your writes until destage is complete.

- At least two ZFS filesystems under Production pool
  - One for online redo logs and control files. If need be you can further
seggregate them onto two seperate ZFS filesystems.
  - One for db files. If need be you can isolate further by data, index, temp,
archived redo, ...
  - Don''t host the ''temp'' on ZFS, just feed it plain
old UFS or raw disk.
  - Match up your ZFS recordsize with your DB blocksize * multi block read
count. Don''t do this for the index filesystem, just the filesystem
hosting data

Rinse and repeat for your Education ZFS pool. This will give you substantial
isolation and improvement, sufficient enough to buy you time to plan out a
better deployment strategy given that you''re under the gun now.

Another thought is while ZFS works out its kinks why not use the BCV or
ShadowCopy or whatever IBM calls it to create Education instance. This will
reduce a tremendous amount of I/O.

Just this past weekend I re-did our SAS server to relocate [b]just[/b] the SAS
work area to good ol'' UFS and the payback is tremendous; not one
complaint about performance 3 days in a row (we used to hear daily complaints.)
By taking care of your online redo logs and control files (maybe skipping ZFS
for it all together and running it on UFS) you''ll breathe easier.

BTW, I''m curious what application using Oracle is creating more than a
million files?
 
 
This message posted from opensolaris.org

Anantha N. Srirama

2007-Jan-17 22:53 UTC

head link

[zfs-discuss] Re: Re: Heavy writes freezing system

I did some straight up Oracle/ZFS testing but not on Zvols. I''ll give
it a shot and report back, next week is the earliest.
 
 
This message posted from opensolaris.org

Neil Perrin

2007-Jan-17 22:54 UTC

head link

[zfs-discuss] Re: Re: Heavy writes freezing system

Rainer Heilke wrote On 01/17/07 15:44,:> It turns out we''re probably going to go the UFS/ZFS route, with 4
filesystems (the DB files on
 > UFS with Directio).> 
> It seems that the pain of moving from a single-node ASM to a RAC''d
ASM is great, and not worth it. > The DBA group decided doing the migration to UFS for the DB files now, and
 > then to a RAC''d ASM later, will end up being the easiest, safest
route.> 
> Rainer
> Still curious as to if and when this bug will get fixed...
If you''re referring to bug 6413510 that Anantha mentioned then my
earlier post today answered that:

 > This problem was fixed in snv_48 last September and will be
 > in S10_U4.

Neil

Rainer Heilke

2007-Jan-17 23:10 UTC

head link

[zfs-discuss] Re: Re: Re: Heavy writes freezing system

> The limit is documented as "1 million inodes per TB".
>  So something
> ust not have gone right.  But many people have
> complained and
> you could take the newfs source and fix the
> limitation.
"Patching" the source ourselves would not fly very far, but thanks for
the clarification. I guess I have to assume, then, that somewhere around this
million mark we also ran out of inodes. With the wide range in file sizes for
the files, this doesn''t surprise me. There was no way to tune the file
system for anything.
> The discontinuity when going from <1TB to over 1TB is
> appaling.
> (<1TB allows for 137million inodes; >= 1TB allows for
> 1million per).
Either way, we were stuck. Our test/devl environment goes way beyond 1 million
files (read: inodes). I think we hit the ceiling half-way into our data copy, if
memory serves.

I think the argument I saw for this inode disparity was that a >1TB FS
"was only for database files" and not the binaries, or something to
that effect.
> The rationale is fsck time (but logging is forced
> anyway)
I can''t remember for sure, but this might have been mentioned in one of
the notes I found.
> The 1 million limit is arbitrary and too low...
> 
> Casper
Thank you very much for the clarification, and for the candor. It is greatly
appreciated.

Rainer
 
 
This message posted from opensolaris.org

Robert Milkowski

2007-Jan-17 23:14 UTC

head link

[zfs-discuss] Re: Heavy writes freezing system

Hello Jason,

Wednesday, January 17, 2007, 11:24:50 PM, you wrote:

JJWW> Hi Anantha,

JJWW> I was curious why segregating at the FS level would provide adequate
JJWW> I/O isolation? Since all FS are on the same pool, I assumed flogging a
JJWW> FS would flog the pool and negatively affect all the other FS on that
JJWW> pool?

because of the bug which forces all outstanding writes in a file
system to commit to storage in case of one fsync to one file.
Now when you separate data to different file systems the bug will
affect only data in that file system which could greatly reduce imapct
on performance if it''s done right.

-- 
Best regards,
 Robert                            mailto:rmilkowski at task.gda.pl
                                       http://milek.blogspot.com

Jason J. W. Williams

2007-Jan-17 23:38 UTC

head link

[zfs-discuss] Re: Heavy writes freezing system

Hi Robert,

I see. So it really doesn''t get around the idea of putting DB files
and logs on separate spindles?

Best Regards,
Jason

On 1/17/07, Robert Milkowski <rmilkowski at task.gda.pl>
wrote:> Hello Jason,
>
> Wednesday, January 17, 2007, 11:24:50 PM, you wrote:
>
> JJWW> Hi Anantha,
>
> JJWW> I was curious why segregating at the FS level would provide
adequate
> JJWW> I/O isolation? Since all FS are on the same pool, I assumed
flogging a
> JJWW> FS would flog the pool and negatively affect all the other FS on
that
> JJWW> pool?
>
> because of the bug which forces all outstanding writes in a file
> system to commit to storage in case of one fsync to one file.
> Now when you separate data to different file systems the bug will
> affect only data in that file system which could greatly reduce imapct
> on performance if it''s done right.
>
> --
> Best regards,
>  Robert                            mailto:rmilkowski at task.gda.pl
>                                        http://milek.blogspot.com
>
>

Anton B. Rang

2007-Jan-18 03:31 UTC

head link

[zfs-discuss] Re: Re: Re: Heavy writes freezing system

> Yes, Anantha is correct that is the bug id, which could be responsible
> for more disk writes than expected.
I believe, though, that this would explain at most a factor of 2 of "write
expansion" (user data getting pushed to disk once in the intent log, then
again in its final location). If the writes are relatively large,
there''d be even less expansion, because the ZIL will write a large
enough block of data (would this be 128K?) into a block which can be used as its
final location. (If I''m understanding some earlier conversations right;
haven''t looked at the code lately.)

Anton
 
 
This message posted from opensolaris.org

Neil Perrin

2007-Jan-18 03:47 UTC

head link

[zfs-discuss] Re: Re: Re: Heavy writes freezing system

Anton B. Rang wrote On 01/17/07 20:31,:>>Yes, Anantha is correct that is the bug id, which could be responsible
>>for more disk writes than expected.
> 
> 
> I believe, though, that this would explain at most a factor of 2 > of "write expansion" (user data getting pushed to disk once in
the
 > intent log, then again in its final location).

Agreed.
> If the writes are > relatively large, there''d be even less expansion, because the ZIL
 > will write a large enough block of data (would this be 128K?)

Anything over zfs_immediate_write_sz (currently 32KB) is written
in this way.

 > into a  block which can be used as its final location. (If I''m
 > understanding some earlier conversations right; haven''t looked at
 > the code lately.)> 
> Anton
>  
>  
> This message posted from opensolaris.org
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Roch - PAE

2007-Jan-18 09:32 UTC

head link

[zfs-discuss] Re: Re: Heavy writes freezing system

If some aspect  of the load is writing  large amount of data
into the pool (through  the memory cache,  as opposed to the
zil)  and that leads  to a frozen system,  I  think that a
possible contributor should be:

	|6429205||each zpool needs to monitor its throughput and throttle heavy
writers|

-r

Anantha N. Srirama writes:
 > Bug 6413510 is the root cause. ZFS maestros please correct me if
I''m quoting an incorrect bug.
 >  
 >  
 > This message posted from opensolaris.org
 > _______________________________________________
 > zfs-discuss mailing list
 > zfs-discuss at opensolaris.org
 > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Roch - PAE

2007-Jan-18 10:22 UTC

head link

[zfs-discuss] Re: Heavy writes freezing system

Jason J. W. Williams writes:
 > Hi Anantha,
 > 
 > I was curious why segregating at the FS level would provide adequate
 > I/O isolation? Since all FS are on the same pool, I assumed flogging a
 > FS would flog the pool and negatively affect all the other FS on that
 > pool?
 > 
 > Best Regards,
 > Jason
 > 

Good point, If the problem is

	6413510 zfs: writing to ZFS filesystem slows down fsync() on other files

Then the seggegration to 2 filesystem on the same pool will
help.

But if the problem is more like

	6429205 each zpool needs to monitor its throughput and throttle heavy writers

then it 2 FS won''t help. 2 pools probably would though.

-r

Rainer Heilke

2007-Jan-18 15:53 UTC

head link

[zfs-discuss] Re: Heavy writes freezing system

> Bag-o-tricks-r-us, I suggest the following in such a case:
> 
> - Two ZFS pools
>   - One for production
> - One for Education
The DBA''s are very resistant to splitting our whole environments. There
are nine on the test/devl server! So, we''re going to put the DB files
and redo logs on separate (UFS with directio) LUN''s. Binaries and
backups will go onto two separate ZFS LUN''s. With production, they can
do their cloning at night to minimize impact. Not sure what they''ll do
on test/devl. The two ZFS file systems will probably also be separate zpools
(political as well as juggling Hitachi disk space reasons).

BTW, it wasn''t the storage guys who decided the "one filesystem to
rule them all" strategy, but my predecessors. It was part of the move from
Clarion arrays to Hitachi. The storage folks know about, understand, and agree
with us when we talk about these kinds of issues (at least, they do now).
We''ve pushed the caching and other subsystems often enough to make this
painfully clear.
> Another thought is while ZFS works out its kinks why
> not use the BCV or ShadowCopy or whatever IBM calls
> it to create Education instance. This will reduce a
> tremendous amount of I/O.
This means buying more software to alleviate a short-term problem (with RAC, the
whole design will be different, including moving to ASM). We have RMAN and OEM
already, so this argument won''t fly.
> BTW, I''m curious what application using Oracle is
> creating more than a million files?
Oracle Financials. The application includes everything but the kitchen sink (but
the bathroom sink is there!).

Thanks for all of your feedback and suggestions. They all sound bang on. If we
could just get all the pieces in place to move forward now, I think
we''ll be OK. One big issue for us will be finding the Hitachi disk
space--we''re pretty full-up right now. :-(

Rainer
 
 
This message posted from opensolaris.org

Rainer Heilke

2007-Jan-18 15:57 UTC

head link

[zfs-discuss] Re: Re: Re: Heavy writes freezing system

> > This problem was fixed in snv_48 last September
>  and will be
> > in S10_U4.
U4 doesn''t help us any. We need the fix now. :-(  By the time U4 is
out, we may even be finished (certainly well on our way) our RAC/ASM migration
and this whole issue will be moot.

Rainer
 
 
This message posted from opensolaris.org

Rainer Heilke

2007-Jan-18 16:00 UTC

head link

[zfs-discuss] Re: Re: Re: Heavy writes freezing system

Thanks for the detailed explanation of the bug. This makes it clearer to us as
to what''s happening, and why (which is something I _always_
appreciate!). Unfortunately, U4 doesn''t buy us anything for our current
problem.

Rainer
 
 
This message posted from opensolaris.org

Rainer Heilke

2007-Jan-18 16:22 UTC

head link

[zfs-discuss] Re: Re: Heavy writes freezing system

> If you plan on RAC, then ASM makes good sense.  It is
> unclear (to me anyway)
> if ASM over a zvol is better than ASM over a raw LUN.
Hmm. I thought ASM was really the _only_ effective way to do RAC, but then,
I''m not a DBA (and don''t want to be ;-)  We''ll be
just using raw LUN''s. While the zvol idea is interesting, the
DBA''s are very particular about making sure the environment is set up
in a way Oracle will support (and not hang up when we have a problem).

Rainer
 
 
This message posted from opensolaris.org

Bev Crair

2007-Jan-18 16:51 UTC

head link

[zfs-discuss] Re: Re: Re: Heavy writes freezing system

Rainer,
Have you considered looking for a patch?  If you have the supported 
version(s) of Solaris (which it sound like you do), this may already be 
available in a patch.
Bev.

Rainer Heilke wrote:> Thanks for the detailed explanation of the bug. This makes it clearer to us
as to what''s happening, and why (which is something I _always_
appreciate!). Unfortunately, U4 doesn''t buy us anything for our current
problem.
>
> Rainer
>  
>  
> This message posted from opensolaris.org
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>

Richard Elling

2007-Jan-18 18:01 UTC

head link

[zfs-discuss] Re: Re: Heavy writes freezing system

Rainer Heilke wrote:>> If you plan on RAC, then ASM makes good sense.  It is
>> unclear (to me anyway)
>> if ASM over a zvol is better than ASM over a raw LUN.
> 
> Hmm. I thought ASM was really the _only_ effective way to do RAC, 
> but then, I''m not a DBA (and don''t want to be ;-) 
We''ll be just
> using raw LUN''s. While the zvol idea is interesting, the
DBA''s
> are very particular about making sure the environment is set up 
> in a way Oracle will support (and not hang up when we have a problem).
ASM is relatively new technology. Traditionally, OPS and RAC were
built over raw devices, directly or as represented by cluster-aware
logical volume managers.  DBAs tend to not like raw, so Sun Cluster
(Solaris Cluster) supports RAC over QFS which is a very good solution.
Some Sun Cluster customers run RAC over NFS, which also works
surprisingly well.

Meanwhile, Oracle continues to develop ASM to appease the DBAs who
want filesystem-like solutions.  IMHO, in the long run, Oracle will
transition many customers to ASM and this means that it probably
isn''t worth the effort to make a file system be the best for Oracle,
at the expense of other features and workloads.
  -- richard

Rainer Heilke

2007-Jan-18 19:07 UTC

head link

[zfs-discuss] Re: Re: Re: Heavy writes freezing system

Sorry, I should have qualified that "effective" better. I was
specifically speaking in terms of Solaris and price. For companies without a SAN
(especially using Linux), something like a NetApp Filer using UFS is the way to
go, I realize. If you''re running Solaris, the cost of QFS becomes a
major factor. If you have a SAN, then getting a NetApp Filer seems silly. And so
on.

Oracle has suggested RAW disk for some time, I think. (Some?) DBA''s
don''t seem to like it largely because they cannot see the files, and so
on. ASM still has some of these limitations, but it''s getting better,
and DBA''s are starting to get used to the new paradigms. If I remember
a conversation last year correctly, OEM will become the window into some of
these ideas. Once ASM has industry acceptance on a large scale, then yes, making
file systems perform well especially for Oracle databases will be chasing the
wind. But, that may be a while down the road. I don''t know, my crystal
ball got cracked during the last comet transition.  ;-)

Rainer
 
 
This message posted from opensolaris.org

Rainer Heilke

2007-Jan-18 19:12 UTC

head link

[zfs-discuss] Re: Re: Re: Heavy writes freezing system

Rats, didn''t proof accurately. For "UFS", I meant NFS.

Rainer
 
 
This message posted from opensolaris.org

zfs discuss - Jan 2007 - Heavy writes freezing system

[zfs-discuss] Heavy writes freezing system

[zfs-discuss] Heavy writes freezing system

[zfs-discuss] Re: Heavy writes freezing system

[zfs-discuss] Re: Heavy writes freezing system

[zfs-discuss] Re: Heavy writes freezing system

[zfs-discuss] Re: Re: Heavy writes freezing system

[zfs-discuss] Re: Re: Heavy writes freezing system

[zfs-discuss] Heavy writes freezing system

[zfs-discuss] Re: Heavy writes freezing system

[zfs-discuss] Re: Heavy writes freezing system

[zfs-discuss] Re: Heavy writes freezing system

[zfs-discuss] Re: Re: Heavy writes freezing system

[zfs-discuss] Re: Heavy writes freezing system

[zfs-discuss] Re: Heavy writes freezing system

[zfs-discuss] Re: Re: Heavy writes freezing system

[zfs-discuss] Re: Heavy writes freezing system

[zfs-discuss] Re: Re: Heavy writes freezing system

[zfs-discuss] Re: Heavy writes freezing system

[zfs-discuss] Re: Heavy writes freezing system

[zfs-discuss] Re: Heavy writes freezing system

[zfs-discuss] Re: Re: Heavy writes freezing system

[zfs-discuss] Re: Re: Heavy writes freezing system

[zfs-discuss] Re: Heavy writes freezing system

[zfs-discuss] Re: Re: Heavy writes freezing system

[zfs-discuss] Re: Heavy writes freezing system

[zfs-discuss] Re: Re: Heavy writes freezing system

[zfs-discuss] Re: Re: Heavy writes freezing system

[zfs-discuss] Re: Re: Re: Heavy writes freezing system

[zfs-discuss] Re: Heavy writes freezing system

[zfs-discuss] Re: Heavy writes freezing system

[zfs-discuss] Re: Re: Re: Heavy writes freezing system

[zfs-discuss] Re: Re: Re: Heavy writes freezing system

[zfs-discuss] Re: Re: Heavy writes freezing system

[zfs-discuss] Re: Heavy writes freezing system

[zfs-discuss] Re: Heavy writes freezing system

[zfs-discuss] Re: Re: Re: Heavy writes freezing system

[zfs-discuss] Re: Re: Re: Heavy writes freezing system

[zfs-discuss] Re: Re: Heavy writes freezing system

[zfs-discuss] Re: Re: Re: Heavy writes freezing system

[zfs-discuss] Re: Re: Heavy writes freezing system

[zfs-discuss] Re: Re: Re: Heavy writes freezing system

[zfs-discuss] Re: Re: Re: Heavy writes freezing system