Greetings, everyone. We are having issues with some Oracle databases on ZFS. We would appreciate any useful feedback you can provide. We are using Oracle Financials, with all databases, control files, and logs on one big 2TB ZFS pool that is on a Hitachi SAN. (This is what the DBA group wanted.) For the most part, the system runs fine. When the DBA?s do clones or backups using RMAN, however, all of the write activity almost freezes the system. The regular transactions back up, screen refreshes get delayed, etc. The system, quite simply, grinds almost to a halt. The immediate issue seemed to be the log file writes. If we separate the logs space, it just shifts the problem to the DB/control files. If we split out the backup space, things are OK during the RMAN backups, but the problems remain for the clones/restores. The issue seems to be serious write contention/performance. Some read issues also exhibit themselves, but they seem to be secondary to the write issues. We ran a test doing this for one environment on a test UFS LUN. While both the backup and restore operations took twice as long, we did not have the system lock issues we see with ZFS. We?ve also been playing a little with ztune.sh on a test system, but we really need to come to a proper solution (and a better understanding of what is causing the problems). Can anyone explain to us why the writes are backing up like this, and what we can do about this? Under Solaris 8, VxFS worked just fine in this scenario. Due to the number of files, UFS was not an option. Since the environment is going to RAC in six months, upgrading Veritas did not seem like a justifiable option, with the (mistaken?) belief ZFS performance would be more than adequate. Thank you for any help you can provide. Rainer This message posted from opensolaris.org
On Tue, 16 Jan 2007, Rainer Heilke wrote:> Greetings, everyone. > > We are having issues with some Oracle databases on ZFS. We would appreciate any useful feedback you can provide.You did''nt give any details of the system (configuration) on which the DB runs... Not even SPARC or x86/AMD64... ?? Al Hopper Logical Approach Inc, Plano, TX. al at logical-approach.com Voice: 972.379.2133 Fax: 972.379.2134 Timezone: US CDT OpenSolaris.Org Community Advisory Board (CAB) Member - Apr 2005 OpenSolaris Governing Board (OGB) Member - Feb 2006
> We are having issues with some Oracle databases on > ZFS. We would appreciate any useful feedback you can > provide. > [...] > The issue seems to be > serious write contention/performance. Some read > issues also exhibit themselves, but they seem to be > secondary to the write issues.What hardware is used? Sparc? x86 32-bit? x86 64-bit? How much RAM is installed? Which version of the OS? Did you already try to monitor kernel memory usage, while writing to zfs? Maybe the kernel is running out of free memory? (I''ve bugs like 6483887 in mind, "without direct management, arc ghost lists can run amok") For a live system: echo ::kmastat | mdb -k echo ::memstat | mdb -k In case you''ve got a crash dump for the hung system, you can try the same ::kmastat and ::memstat commands using the kernel crash dumps saved in directory /var/crash/`hostname` # cd /var/crash/`hostname` # mdb -k unix.1 vmcore.1 ::memstat ::kmastat This message posted from opensolaris.org
> What hardware is used? Sparc? x86 32-bit? x86 > 64-bit? > How much RAM is installed? > Which version of the OS?Sorry, this is happening on two systems (test and production). They''re both Solaris 10, Update 2. Test is a V880 with 8 CPU''s and 32GB, production is an E2900 with 12 dual-core CPU''s and 48GB.> Did you already try to monitor kernel memory usage, > while writing to zfs? Maybe the kernel is running > out of > free memory? (I''ve bugs like 6483887 in mind, > "without direct management, arc ghost lists can run > amok")We haven''t seen serious kernel memory usage that I know of (I''ll be honest--I came into this problem late).> For a live system: > > echo ::kmastat | mdb -k > echo ::memstat | mdb -kI can try this if the DBA group is willing to do another test, thanks.> In case you''ve got a crash dump for the hung system, > you > can try the same ::kmastat and ::memstat commands > using the > kernel crash dumps saved in directory > /var/crash/`hostname` > > # cd /var/crash/`hostname` > # mdb -k unix.1 vmcore.1 > ::memstat > ::kmastatThe system doesn''t actually crash. It also doesn''t freeze _completely_. While I call it a freeze (best name for it), it actually just slows down incredibly. It''s like the whole system bogs down like molasses in January. Things happen, but very slowly. Rainer This message posted from opensolaris.org
Rainer Heilke, You have 1/4 of the amount of memory that the 2900 system is capable of (192GBs : I think). Secondly, output from fsstat(1M) could be helpful. Run this command over time and check to see if the values change over time.. Mitchell Erblich --------------- Rainer Heilke wrote:> > > What hardware is used? Sparc? x86 32-bit? x86 > > 64-bit? > > How much RAM is installed? > > Which version of the OS? > > Sorry, this is happening on two systems (test and production). They''re both Solaris 10, Update 2. Test is a V880 with 8 CPU''s and 32GB, production is an E2900 with 12 dual-core CPU''s and 48GB. > > > Did you already try to monitor kernel memory usage, > > while writing to zfs? Maybe the kernel is running > > out of > > free memory? (I''ve bugs like 6483887 in mind, > > "without direct management, arc ghost lists can run > > amok") > > We haven''t seen serious kernel memory usage that I know of (I''ll be honest--I came into this problem late). > > > For a live system: > > > > echo ::kmastat | mdb -k > > echo ::memstat | mdb -k > > I can try this if the DBA group is willing to do another test, thanks. > > > In case you''ve got a crash dump for the hung system, > > you > > can try the same ::kmastat and ::memstat commands > > using the > > kernel crash dumps saved in directory > > /var/crash/`hostname` > > > > # cd /var/crash/`hostname` > > # mdb -k unix.1 vmcore.1 > > ::memstat > > ::kmastat > > The system doesn''t actually crash. It also doesn''t freeze _completely_. While I call it a freeze (best name for it), it actually just slows down incredibly. It''s like the whole system bogs down like molasses in January. Things happen, but very slowly. > > Rainer > > > This message posted from opensolaris.org > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
> Rainer Heilke, > > You have 1/4 of the amount of memory that the 2900 > 0 system is capable of (192GBs : I think).Yep. The server does not hold the application (three-tier architecture) so this is the standard build we bought. The memory has not indicated any problems. All errors point to write issues.> Secondly, output from fsstat(1M) could be helpful. > > Run this command over time and check to see if the > values change over time..Thanks. I''ll pass this along to the person doing the testing. He''s been doing some measuring, but I''m not sure if fsstat was one of them. Rainer This message posted from opensolaris.org
The DBA team isn''t wanting to do another test. They have "made up their minds". We have a meeting with them tomorrow, though, and will try to convince them of one more test so that we can try the mdb and fsstat tools. (The admin doing the tests was using iostat, not fsstat.) I, at least, am interested in finding exactly where the failure is, rather than just saying "ZFS doesn''t work". :-( Rainer This message posted from opensolaris.org
Hello Rainer, Tuesday, January 16, 2007, 5:02:01 PM, you wrote: RH> scenario. Due to the number of files, UFS was not an option. Since RH> the environment is going to RAC in six months, upgrading Veritas RH> did not seem like a justifiable option, with the (mistaken?) RH> belief ZFS performance would be more than adequate. What do you mean by UFS wasn''t an option due to number of files? Also do you have any tunables in system? Can you send ''zpool status'' output? (raidz, mirror, ...?) "When the DBA?s do clones" - you mean that by just doing ''zfs clone ...'' you get big performance problem? OR maybe just before when you do ''zfs snapshot'' first? How much free space is left in a pool? Do you have sar data when problems occured? Any paging in a system? And one advise - before any more testing I would definitely upgrade/reinstall system to U3 when it comes to ZFS. -- Best regards, Robert mailto:rmilkowski at task.gda.pl http://milek.blogspot.com
You''re probably hitting the same wall/bug that I came across; ZFS in all versions up to and including Sol10U3 generates excessive I/O when it encounters ''fssync'' or if any of the files were opened with ''O_DSYNC'' option. I do believe Oracle (or any DB for that matter) opens the file with O_DSYNC option. During normal times it does result in excessive I/O but is probably well under your system capacity (it was in our case.) But when you are doing backups or clones (Oracle clones by using RMAN or copying of db files?) you are going to flood the I/O sub-system and that''s when the whole ZFS excessive I/O starts to put a hurt on the DB performance. Here are a few suggestions that can give you interim relief: - Seggregate your I/O at filesystem level; the bug is at the filesystem level not ZFS pool level. By this I mean ensure the online redo logs are in a ZFS FS that nobody else uses, same for control files. As long as the writes to control and online redo logs are met your system will be happy. - Ensure that your clone and RMAN (if you''re going to disk) write to a seperate ZFS FS that contains no production files. - If the above two items don''t give you relieve then relocate the online redo log and control files to a UFS filesystem. No need to downgrade the entire ZFS to something else. - Consider Oracle ASM (DB version permitting,) works very well. Why deal with VxFS. Feel free to drop me a line, I''ve over 17 years of Oracle DB experience and love to troubleshoot problems like this. I''ve another vested interest; we''re considering ZFS for widespread use in our environment and any experience is good for us. This message posted from opensolaris.org
Hello Anantha, Wednesday, January 17, 2007, 2:35:01 PM, you wrote: ANS> You''re probably hitting the same wall/bug that I came across; ANS> ZFS in all versions up to and including Sol10U3 generates ANS> excessive I/O when it encounters ''fssync'' or if any of the files ANS> were opened with ''O_DSYNC'' option. ANS> I do believe Oracle (or any DB for that matter) opens the file ANS> with O_DSYNC option. During normal times it does result in ANS> excessive I/O but is probably well under your system capacity (it ANS> was in our case.) But when you are doing backups or clones ANS> (Oracle clones by using RMAN or copying of db files?) you are ANS> going to flood the I/O sub-system and that''s when the whole ZFS ANS> excessive I/O starts to put a hurt on the DB performance. ANS> Here are a few suggestions that can give you interim relief: ANS> - Seggregate your I/O at filesystem level; the bug is at the ANS> filesystem level not ZFS pool level. By this I mean ensure the ANS> online redo logs are in a ZFS FS that nobody else uses, same for ANS> control files. As long as the writes to control and online redo ANS> logs are met your system will be happy. ANS> - Ensure that your clone and RMAN (if you''re going to disk) ANS> write to a seperate ZFS FS that contains no production files. ANS> - If the above two items don''t give you relieve then relocate ANS> the online redo log and control files to a UFS filesystem. No ANS> need to downgrade the entire ZFS to something else. ANS> - Consider Oracle ASM (DB version permitting,) works very well. Why deal with VxFS. ANS> Feel free to drop me a line, I''ve over 17 years of Oracle DB ANS> experience and love to troubleshoot problems like this. I''ve ANS> another vested interest; we''re considering ZFS for widespread use ANS> in our environment and any experience is good for us. ANS> Also as an workaround you could disable zil if it''s acceptable to you (in case of system panic or hard reset you can endup with unrecoverable database). -- Best regards, Robert mailto:rmilkowski at task.gda.pl http://milek.blogspot.com
Anantha N. Srirama ??:> You''re probably hitting the same wall/bug that I came across; ZFS in all versions up to and including Sol10U3 generates excessive I/O when it encounters ''fssync'' or if any of the files were opened with ''O_DSYNC'' option. >Why excessive I/O got generated? fsync or O_DSYNC may cause additional write cache-flush issued by ZIL.The total amount of I/O (not including cache flush) should remain the same, right?
Anantha N. Srirama
2007-Jan-17 15:32 UTC
[zfs-discuss] Re: Re: Heavy writes freezing system
Bug 6413510 is the root cause. ZFS maestros please correct me if I''m quoting an incorrect bug. This message posted from opensolaris.org
> What do you mean by UFS wasn''t an option due to > number of files?Exactly that. UFS has a 1 million file limit under Solaris. Each Oracle Financials environment well exceeds this limitation.> Also do you have any tunables in system? > Can you send ''zpool status'' output? (raidz, mirror, > ...?)Our tunables are: set noexec_user_stack=1 set sd:sd_max_throttle = 32 set sd:sd_io_time = 0x3c zpool status: > zpool status pool: d state: ONLINE scrub: none requested config: NAME STATE READ WRITE CKSUM d ONLINE 0 0 0 c5t60060E800475AA00000075AA0000100Bd0 ONLINE 0 0 0 c5t60060E800475AA00000075AA0000100Dd0 ONLINE 0 0 0 c5t60060E800475AA00000075AA0000100Cd0 ONLINE 0 0 0 c5t60060E800475AA00000075AA0000100Ed0 ONLINE 0 0 0 errors: No known data errors> "When the DBA?s do clones" - you mean that by just > doing ''zfs clone > ...'' you get big performance problem? OR maybe just > before when you do > ''zfs snapshot'' first? How much free space is left in > a pool?Nope. The DBA group clones the production instance using OEM in order to build copies for Education, development, etc. This is strictly an Oracle function, not a file system (ZFS) operation.> Do you have sar data when problems occured? Any > paging in a system?Some. I''ll have to have the other analyst try to pull out the times when our testing was done, but I''ve been told nothing stood out. (I love playing middle-man. NOT!)> And one advise - before any more testing I would > definitely > upgrade/reinstall system to U3 when it comes to ZFS.Not an option. This isn''t even a faint possibility. We''re talking both our test/development servers, and our production/education. That''s six servers to upgrade (remember, we have a the applications on servers distinct from the database servers--the DBA''s would never let us divurge the OS releases). Rainer This message posted from opensolaris.org
Thanks for the feedback! This does sound like what we''re hitting. From our testing, you are absolutely correct--separating out the parts is a major help. The big problem we still see, though, is doing the clones/recoveries. The DBA group clones the production environment for Education. Since both of these instances live on the same server and ZPool/filesystem, this kills the throughput. When doing cloning or backups to a different area, whether UFS or ZFS, we don''t have the issues. I''ll know for sure later today or tomorrow, but it sounds like they are seriously considering the ASM route. Since we will be going to RAC later this year, this move makes the most sense. We''ll just have to hope that the DBA group gets a better understanding of LUN''s and our SAN, as they''ll be taking over part of the disk (LUN) management. :-/ We were hoping we could get some interrim relief on the ZFS front through tuning or something, but if what you''re saying is correct (and it sounds like it is), we may be out of luck. Thanks very much for the feedback. Rainer This message posted from opensolaris.org
> Also as an workaround you could disable zil if it''s > acceptable to you > (in case of system panic or hard reset you can endup > with > unrecoverable database).Again, not an option, but thatnks for the pointer. I read a bit about this last week, and it sounds way too scary. Rainer This message posted from opensolaris.org
Rainer Heilke wrote:> I''ll know for sure later today or tomorrow, but it sounds like they are > seriously considering the ASM route. Since we will be going to RAC later > this year, this move makes the most sense. We''ll just have to hope that > the DBA group gets a better understanding of LUN''s and our SAN, as they''ll > be taking over part of the disk (LUN) management. :-/ We were hoping we > could get some interrim relief on the ZFS front through tuning or something, > but if what you''re saying is correct (and it sounds like it is), we may be > out of luck.If you plan on RAC, then ASM makes good sense. It is unclear (to me anyway) if ASM over a zvol is better than ASM over a raw LUN. It would be nice to have some of the zfs features such as snapshots, without having to go through extraordinary pain or buy expensive RAID arrays. If someone has tried ASM on a zvol, please speak up :-) -- richard
Anantha N. Srirama wrote On 01/17/07 08:32,:> Bug 6413510 is the root cause. ZFS maestros please correct me if I''m quoting an incorrect bug.Yes, Anantha is correct that is the bug id, which could be responsible for more disk writes than expected. Let me try to explain that bug. The ZIL as described in http://blogs.sun.com/perrin collects transactions in memory of all system calls until they are committed in a transaction group (txg) at the pool level. If a request arrives to force to stable stoarge a particular file (fsync or O_DSYNC) then the ZIL used to write out all in memory transactions for the file system. This meant transactions unrelated to that file were written including directory creations, renames etc - which might be important in being able to re-create the file. However, it also pushed out user data for other files, which can be voluminous. The problem was originally seen when a ksh history file was fsync-ed during a large data write. It would take many seconds to flush the large write through the log, just to ensure a "pwd" command typed was safely on disk! This inefficiency occurs only when a "mismatch" of applications use the same file system. The fix was essentially to push out all meta data for the file system but only the file data related to the file being fsync-ed or O_DSYC-ed. This problem was fixed in snv_48 last September and will be in S10_U4. Neil.
Rainer Heilke wrote:>> What do you mean by UFS wasn''t an option due to >> number of files? > > Exactly that. UFS has a 1 million file limit under Solaris. Each Oracle > Financials environment well exceeds this limitation.Really?!? I thought Oracle would use a database for storage...>> Also do you have any tunables in system? >> Can you send ''zpool status'' output? (raidz, mirror, >> ...?) > > Our tunables are: > > set noexec_user_stack=1 > set sd:sd_max_throttle = 32 > set sd:sd_io_time = 0x3cEMC?> zpool status: > > > zpool status > pool: d > state: ONLINE > scrub: none requested > config: > > NAME STATE READ WRITE CKSUM > d ONLINE 0 0 0 > c5t60060E800475AA00000075AA0000100Bd0 ONLINE 0 0 0 > c5t60060E800475AA00000075AA0000100Dd0 ONLINE 0 0 0 > c5t60060E800475AA00000075AA0000100Cd0 ONLINE 0 0 0 > c5t60060E800475AA00000075AA0000100Ed0 ONLINE 0 0 0 > > errors: No known data errors > > >> "When the DBA?s do clones" - you mean that by just >> doing ''zfs clone >> ...'' you get big performance problem? OR maybe just >> before when you do >> ''zfs snapshot'' first? How much free space is left in >> a pool? > > Nope. The DBA group clones the production instance using OEM in order to build copies for Education, development, etc. This is strictly an Oracle function, not a file system (ZFS) operation. > >> Do you have sar data when problems occured? Any >> paging in a system? > > Some. I''ll have to have the other analyst try to pull out the times when our testing was done, but I''ve been told nothing stood out. (I love playing middle-man. NOT!) > >> And one advise - before any more testing I would >> definitely >> upgrade/reinstall system to U3 when it comes to ZFS. > > Not an option. This isn''t even a faint possibility. We''re talking both our test/development servers, and our production/education. That''s six servers to upgrade (remember, we have a the applications on servers distinct from the database servers--the DBA''s would never let us divurge the OS releases).Yes this is common, so you should look for the patches which should fix at least the fsync problem. Check the archives here for patch update info from George Wilson. -- richard
>> What do you mean by UFS wasn''t an option due to >> number of files? > > Exactly that. UFS has a 1 million file limit under Solaris. Each Oracle > Financials environment well exceeds this limitation. >what ? $ uname -a SunOS core 5.10 Generic_118833-17 sun4u sparc SUNW,UltraSPARC-IIi-cEngine $ df -F ufs -t / (/dev/md/dsk/d0 ): 5367776 blocks 616328 files total: 13145340 blocks 792064 files /export/nfs (/dev/md/dsk/d8 ): 83981368 blocks 96621651 files total: 404209452 blocks 100534720 files /export/home (/dev/md/dsk/d7 ): 980894 blocks 260691 files total: 986496 blocks 260736 files $ I think that I am 95,621,651 files over your 1 million limit right there! Should I place a support call and file a bug report ? Dennis
Dennis Clarke wrote:>>> What do you mean by UFS wasn''t an option due to >>> number of files? >> Exactly that. UFS has a 1 million file limit under Solaris. Each Oracle >> Financials environment well exceeds this limitation. >> > > what ? > > $ uname -a > SunOS core 5.10 Generic_118833-17 sun4u sparc SUNW,UltraSPARC-IIi-cEngine > $ df -F ufs -t > / (/dev/md/dsk/d0 ): 5367776 blocks 616328 files > total: 13145340 blocks 792064 files > /export/nfs (/dev/md/dsk/d8 ): 83981368 blocks 96621651 files > total: 404209452 blocks 100534720 files > /export/home (/dev/md/dsk/d7 ): 980894 blocks 260691 files > total: 986496 blocks 260736 files > $ > > I think that I am 95,621,651 files over your 1 million limit right there!is that a multi-terabyte-UFS? if no, ignore :-), it yes, the actual limit is "1 million inode PER Terabyte". HTH -- Michael Schuster Sun Microsystems, Inc.
We had a 2TB filesystem. No matter what options I set explicitly, the UFS filesystem kept getting written with a 1 million file limit. Believe me, I tried a lot of options, and they kept getting set back on me. After a fair bit of poking around (Google, Sun''s site, etc.) I found several other notes indicating that this was the limit for UFS file systems. (For the pedants, keep in mind we are talking computers, so the actual number will be some exponent of 2. "! million" is an approximation.) If someone has gotten around this under UFS, I''d be very interested--as an intellectual curiousity--in knowing what switches you passed to the mkfs/newfs command(s). Rainer This message posted from opensolaris.org
Casper.Dik at Sun.COM
2007-Jan-17 21:14 UTC
[zfs-discuss] Re: Re: Heavy writes freezing system
>We had a 2TB filesystem. No matter what options I set explicitly, the >UFS filesystem kept getting written with a 1 million file limit. >Believe me, I tried a lot of options, and they kept getting se t back >on me.The limit is documented as "1 million inodes per TB". So something must not have gone right. But many people have complained and you could take the newfs source and fix the limitation. The discontinuity when going from <1TB to over 1TB is appaling. (<1TB allows for 137million inodes; >= 1TB allows for 1million per). The rationale is fsck time (but logging is forced anyway) The 1 million limit is arbitrary and too low... Casper
Hi Anantha, I was curious why segregating at the FS level would provide adequate I/O isolation? Since all FS are on the same pool, I assumed flogging a FS would flog the pool and negatively affect all the other FS on that pool? Best Regards, Jason On 1/17/07, Anantha N. Srirama <anantha.srirama at cdc.hhs.gov> wrote:> You''re probably hitting the same wall/bug that I came across; ZFS in all versions up to and including Sol10U3 generates excessive I/O when it encounters ''fssync'' or if any of the files were opened with ''O_DSYNC'' option. > > I do believe Oracle (or any DB for that matter) opens the file with O_DSYNC option. During normal times it does result in excessive I/O but is probably well under your system capacity (it was in our case.) But when you are doing backups or clones (Oracle clones by using RMAN or copying of db files?) you are going to flood the I/O sub-system and that''s when the whole ZFS excessive I/O starts to put a hurt on the DB performance. > > Here are a few suggestions that can give you interim relief: > > - Seggregate your I/O at filesystem level; the bug is at the filesystem level not ZFS pool level. By this I mean ensure the online redo logs are in a ZFS FS that nobody else uses, same for control files. As long as the writes to control and online redo logs are met your system will be happy. > - Ensure that your clone and RMAN (if you''re going to disk) write to a seperate ZFS FS that contains no production files. > - If the above two items don''t give you relieve then relocate the online redo log and control files to a UFS filesystem. No need to downgrade the entire ZFS to something else. > - Consider Oracle ASM (DB version permitting,) works very well. Why deal with VxFS. > > Feel free to drop me a line, I''ve over 17 years of Oracle DB experience and love to troubleshoot problems like this. I''ve another vested interest; we''re considering ZFS for widespread use in our environment and any experience is good for us. > > > This message posted from opensolaris.org > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >
It turns out we''re probably going to go the UFS/ZFS route, with 4 filesystems (the DB files on UFS with Directio). It seems that the pain of moving from a single-node ASM to a RAC''d ASM is great, and not worth it. The DBA group decided doing the migration to UFS for the DB files now, and then to a RAC''d ASM later, will end up being the easiest, safest route. Rainer Still curious as to if and when this bug will get fixed... This message posted from opensolaris.org
Bag-o-tricks-r-us, I suggest the following in such a case: - Two ZFS pools - One for production - One for Education - Isolate the LUNs feeding the pools if possible, don''t share spindles. Remember on EMC/Hitachi you''ve logical LUNs created by striping/concat''ng carved up physical disks, so you could have two LUNs that share the same spindle. Don''t believe one word from your storage admin about we''ve lot of cache to abstract the physical structure; Oracle can push any storage sub-system over the edge. Almost all of the storage vendors prevent one LUN from flooding the cache with writes, EMC gives no more than 8x the initial allocation of cache (total cache/total disk space) and after that it''ll stall your writes until destage is complete. - At least two ZFS filesystems under Production pool - One for online redo logs and control files. If need be you can further seggregate them onto two seperate ZFS filesystems. - One for db files. If need be you can isolate further by data, index, temp, archived redo, ... - Don''t host the ''temp'' on ZFS, just feed it plain old UFS or raw disk. - Match up your ZFS recordsize with your DB blocksize * multi block read count. Don''t do this for the index filesystem, just the filesystem hosting data Rinse and repeat for your Education ZFS pool. This will give you substantial isolation and improvement, sufficient enough to buy you time to plan out a better deployment strategy given that you''re under the gun now. Another thought is while ZFS works out its kinks why not use the BCV or ShadowCopy or whatever IBM calls it to create Education instance. This will reduce a tremendous amount of I/O. Just this past weekend I re-did our SAS server to relocate [b]just[/b] the SAS work area to good ol'' UFS and the payback is tremendous; not one complaint about performance 3 days in a row (we used to hear daily complaints.) By taking care of your online redo logs and control files (maybe skipping ZFS for it all together and running it on UFS) you''ll breathe easier. BTW, I''m curious what application using Oracle is creating more than a million files? This message posted from opensolaris.org
Anantha N. Srirama
2007-Jan-17 22:53 UTC
[zfs-discuss] Re: Re: Heavy writes freezing system
I did some straight up Oracle/ZFS testing but not on Zvols. I''ll give it a shot and report back, next week is the earliest. This message posted from opensolaris.org
Rainer Heilke wrote On 01/17/07 15:44,:> It turns out we''re probably going to go the UFS/ZFS route, with 4 filesystems (the DB files on> UFS with Directio).> > It seems that the pain of moving from a single-node ASM to a RAC''d ASM is great, and not worth it.> The DBA group decided doing the migration to UFS for the DB files now, and > then to a RAC''d ASM later, will end up being the easiest, safest route.> > Rainer > Still curious as to if and when this bug will get fixed...If you''re referring to bug 6413510 that Anantha mentioned then my earlier post today answered that: > This problem was fixed in snv_48 last September and will be > in S10_U4. Neil
Rainer Heilke
2007-Jan-17 23:10 UTC
[zfs-discuss] Re: Re: Re: Heavy writes freezing system
> The limit is documented as "1 million inodes per TB". > So something > ust not have gone right. But many people have > complained and > you could take the newfs source and fix the > limitation."Patching" the source ourselves would not fly very far, but thanks for the clarification. I guess I have to assume, then, that somewhere around this million mark we also ran out of inodes. With the wide range in file sizes for the files, this doesn''t surprise me. There was no way to tune the file system for anything.> The discontinuity when going from <1TB to over 1TB is > appaling. > (<1TB allows for 137million inodes; >= 1TB allows for > 1million per).Either way, we were stuck. Our test/devl environment goes way beyond 1 million files (read: inodes). I think we hit the ceiling half-way into our data copy, if memory serves. I think the argument I saw for this inode disparity was that a >1TB FS "was only for database files" and not the binaries, or something to that effect.> The rationale is fsck time (but logging is forced > anyway)I can''t remember for sure, but this might have been mentioned in one of the notes I found.> The 1 million limit is arbitrary and too low... > > CasperThank you very much for the clarification, and for the candor. It is greatly appreciated. Rainer This message posted from opensolaris.org
Hello Jason, Wednesday, January 17, 2007, 11:24:50 PM, you wrote: JJWW> Hi Anantha, JJWW> I was curious why segregating at the FS level would provide adequate JJWW> I/O isolation? Since all FS are on the same pool, I assumed flogging a JJWW> FS would flog the pool and negatively affect all the other FS on that JJWW> pool? because of the bug which forces all outstanding writes in a file system to commit to storage in case of one fsync to one file. Now when you separate data to different file systems the bug will affect only data in that file system which could greatly reduce imapct on performance if it''s done right. -- Best regards, Robert mailto:rmilkowski at task.gda.pl http://milek.blogspot.com
Hi Robert, I see. So it really doesn''t get around the idea of putting DB files and logs on separate spindles? Best Regards, Jason On 1/17/07, Robert Milkowski <rmilkowski at task.gda.pl> wrote:> Hello Jason, > > Wednesday, January 17, 2007, 11:24:50 PM, you wrote: > > JJWW> Hi Anantha, > > JJWW> I was curious why segregating at the FS level would provide adequate > JJWW> I/O isolation? Since all FS are on the same pool, I assumed flogging a > JJWW> FS would flog the pool and negatively affect all the other FS on that > JJWW> pool? > > because of the bug which forces all outstanding writes in a file > system to commit to storage in case of one fsync to one file. > Now when you separate data to different file systems the bug will > affect only data in that file system which could greatly reduce imapct > on performance if it''s done right. > > -- > Best regards, > Robert mailto:rmilkowski at task.gda.pl > http://milek.blogspot.com > >
Anton B. Rang
2007-Jan-18 03:31 UTC
[zfs-discuss] Re: Re: Re: Heavy writes freezing system
> Yes, Anantha is correct that is the bug id, which could be responsible > for more disk writes than expected.I believe, though, that this would explain at most a factor of 2 of "write expansion" (user data getting pushed to disk once in the intent log, then again in its final location). If the writes are relatively large, there''d be even less expansion, because the ZIL will write a large enough block of data (would this be 128K?) into a block which can be used as its final location. (If I''m understanding some earlier conversations right; haven''t looked at the code lately.) Anton This message posted from opensolaris.org
Anton B. Rang wrote On 01/17/07 20:31,:>>Yes, Anantha is correct that is the bug id, which could be responsible >>for more disk writes than expected. > > > I believe, though, that this would explain at most a factor of 2> of "write expansion" (user data getting pushed to disk once in the > intent log, then again in its final location). Agreed.> If the writes are> relatively large, there''d be even less expansion, because the ZIL > will write a large enough block of data (would this be 128K?) Anything over zfs_immediate_write_sz (currently 32KB) is written in this way. > into a block which can be used as its final location. (If I''m > understanding some earlier conversations right; haven''t looked at > the code lately.)> > Anton > > > This message posted from opensolaris.org > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
If some aspect of the load is writing large amount of data into the pool (through the memory cache, as opposed to the zil) and that leads to a frozen system, I think that a possible contributor should be: |6429205||each zpool needs to monitor its throughput and throttle heavy writers| -r Anantha N. Srirama writes: > Bug 6413510 is the root cause. ZFS maestros please correct me if I''m quoting an incorrect bug. > > > This message posted from opensolaris.org > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Jason J. W. Williams writes: > Hi Anantha, > > I was curious why segregating at the FS level would provide adequate > I/O isolation? Since all FS are on the same pool, I assumed flogging a > FS would flog the pool and negatively affect all the other FS on that > pool? > > Best Regards, > Jason > Good point, If the problem is 6413510 zfs: writing to ZFS filesystem slows down fsync() on other files Then the seggegration to 2 filesystem on the same pool will help. But if the problem is more like 6429205 each zpool needs to monitor its throughput and throttle heavy writers then it 2 FS won''t help. 2 pools probably would though. -r
> Bag-o-tricks-r-us, I suggest the following in such a case: > > - Two ZFS pools > - One for production > - One for EducationThe DBA''s are very resistant to splitting our whole environments. There are nine on the test/devl server! So, we''re going to put the DB files and redo logs on separate (UFS with directio) LUN''s. Binaries and backups will go onto two separate ZFS LUN''s. With production, they can do their cloning at night to minimize impact. Not sure what they''ll do on test/devl. The two ZFS file systems will probably also be separate zpools (political as well as juggling Hitachi disk space reasons). BTW, it wasn''t the storage guys who decided the "one filesystem to rule them all" strategy, but my predecessors. It was part of the move from Clarion arrays to Hitachi. The storage folks know about, understand, and agree with us when we talk about these kinds of issues (at least, they do now). We''ve pushed the caching and other subsystems often enough to make this painfully clear.> Another thought is while ZFS works out its kinks why > not use the BCV or ShadowCopy or whatever IBM calls > it to create Education instance. This will reduce a > tremendous amount of I/O.This means buying more software to alleviate a short-term problem (with RAC, the whole design will be different, including moving to ASM). We have RMAN and OEM already, so this argument won''t fly.> BTW, I''m curious what application using Oracle is > creating more than a million files?Oracle Financials. The application includes everything but the kitchen sink (but the bathroom sink is there!). Thanks for all of your feedback and suggestions. They all sound bang on. If we could just get all the pieces in place to move forward now, I think we''ll be OK. One big issue for us will be finding the Hitachi disk space--we''re pretty full-up right now. :-( Rainer This message posted from opensolaris.org
Rainer Heilke
2007-Jan-18 15:57 UTC
[zfs-discuss] Re: Re: Re: Heavy writes freezing system
> > This problem was fixed in snv_48 last September > and will be > > in S10_U4.U4 doesn''t help us any. We need the fix now. :-( By the time U4 is out, we may even be finished (certainly well on our way) our RAC/ASM migration and this whole issue will be moot. Rainer This message posted from opensolaris.org
Rainer Heilke
2007-Jan-18 16:00 UTC
[zfs-discuss] Re: Re: Re: Heavy writes freezing system
Thanks for the detailed explanation of the bug. This makes it clearer to us as to what''s happening, and why (which is something I _always_ appreciate!). Unfortunately, U4 doesn''t buy us anything for our current problem. Rainer This message posted from opensolaris.org
> If you plan on RAC, then ASM makes good sense. It is > unclear (to me anyway) > if ASM over a zvol is better than ASM over a raw LUN.Hmm. I thought ASM was really the _only_ effective way to do RAC, but then, I''m not a DBA (and don''t want to be ;-) We''ll be just using raw LUN''s. While the zvol idea is interesting, the DBA''s are very particular about making sure the environment is set up in a way Oracle will support (and not hang up when we have a problem). Rainer This message posted from opensolaris.org
Rainer, Have you considered looking for a patch? If you have the supported version(s) of Solaris (which it sound like you do), this may already be available in a patch. Bev. Rainer Heilke wrote:> Thanks for the detailed explanation of the bug. This makes it clearer to us as to what''s happening, and why (which is something I _always_ appreciate!). Unfortunately, U4 doesn''t buy us anything for our current problem. > > Rainer > > > This message posted from opensolaris.org > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >
Rainer Heilke wrote:>> If you plan on RAC, then ASM makes good sense. It is >> unclear (to me anyway) >> if ASM over a zvol is better than ASM over a raw LUN. > > Hmm. I thought ASM was really the _only_ effective way to do RAC, > but then, I''m not a DBA (and don''t want to be ;-) We''ll be just > using raw LUN''s. While the zvol idea is interesting, the DBA''s > are very particular about making sure the environment is set up > in a way Oracle will support (and not hang up when we have a problem).ASM is relatively new technology. Traditionally, OPS and RAC were built over raw devices, directly or as represented by cluster-aware logical volume managers. DBAs tend to not like raw, so Sun Cluster (Solaris Cluster) supports RAC over QFS which is a very good solution. Some Sun Cluster customers run RAC over NFS, which also works surprisingly well. Meanwhile, Oracle continues to develop ASM to appease the DBAs who want filesystem-like solutions. IMHO, in the long run, Oracle will transition many customers to ASM and this means that it probably isn''t worth the effort to make a file system be the best for Oracle, at the expense of other features and workloads. -- richard
Rainer Heilke
2007-Jan-18 19:07 UTC
[zfs-discuss] Re: Re: Re: Heavy writes freezing system
Sorry, I should have qualified that "effective" better. I was specifically speaking in terms of Solaris and price. For companies without a SAN (especially using Linux), something like a NetApp Filer using UFS is the way to go, I realize. If you''re running Solaris, the cost of QFS becomes a major factor. If you have a SAN, then getting a NetApp Filer seems silly. And so on. Oracle has suggested RAW disk for some time, I think. (Some?) DBA''s don''t seem to like it largely because they cannot see the files, and so on. ASM still has some of these limitations, but it''s getting better, and DBA''s are starting to get used to the new paradigms. If I remember a conversation last year correctly, OEM will become the window into some of these ideas. Once ASM has industry acceptance on a large scale, then yes, making file systems perform well especially for Oracle databases will be chasing the wind. But, that may be a while down the road. I don''t know, my crystal ball got cracked during the last comet transition. ;-) Rainer This message posted from opensolaris.org