We are using MySQL, and love the idea of using zfs for this. We are used to using Direct I/O to bypass file system caching (let the DB do this). Does this exist for zfs? This message posted from opensolaris.org
On Oct 2, 2007, at 1:11 PM, David Runyon wrote:> We are using MySQL, and love the idea of using zfs for this. We > are used to using Direct I/O to bypass file system caching (let the > DB do this). Does this exist for zfs?Not yet, see: 6429855 Need way to tell ZFS that caching is a lost cause Is there a specific reason why you need to do the caching at the DB level instead of the file system? I''m really curious as i''ve got conflicting data on why people do this. If i get more data on real reasons on why we shouldn''t cache at the file system, then this could get bumped up in my priority queue. eric
1) Modern DBMSs cache database pages in their own buffer pool because it is less expensive than to access data from the OS. (IIRC, MySQL''s MyISAM is the only one that relies on the FS cache, but a lot of MySQL sites use INNODB which has its own buffer pool) 2) Also, direct I/O is faster because it avoid double buffering. Rayson On 10/2/07, eric kustarz <eric.kustarz at sun.com> wrote:> Not yet, see: > 6429855 Need way to tell ZFS that caching is a lost cause > > Is there a specific reason why you need to do the caching at the DB > level instead of the file system? I''m really curious as i''ve got > conflicting data on why people do this. If i get more data on real > reasons on why we shouldn''t cache at the file system, then this could > get bumped up in my priority queue. > > eric > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >
David Runyon wrote:> We are using MySQL, and love the idea of using zfs for this. We are used to using Direct I/O to bypass file system caching (let the DB do this). Does this exist for zfs?This is a FAQ. See: http://www.solarisinternals.com/wiki/index.php/ZFS_Best_Practices_Guide#ZFS_and_Database_Recommendations http://blogs.sun.com/roch/entry/zfs_and_directio http://blogs.sun.com/bobs/entry/one_i_o_two_i -- richard
On Tue, Oct 02, 2007 at 01:20:24PM -0600, eric kustarz wrote:> > On Oct 2, 2007, at 1:11 PM, David Runyon wrote: > > > We are using MySQL, and love the idea of using zfs for this. We > > are used to using Direct I/O to bypass file system caching (let the > > DB do this). Does this exist for zfs? > > Not yet, see: > 6429855 Need way to tell ZFS that caching is a lost cause > > Is there a specific reason why you need to do the caching at the DB > level instead of the file system? I''m really curious as i''ve got > conflicting data on why people do this. If i get more data on real > reasons on why we shouldn''t cache at the file system, then this could > get bumped up in my priority queue.At least two reasons: http://developers.sun.com/solaris/articles/mysql_perf_tune.html#6 http://blogs.sun.com/glennf/entry/where_do_you_cache_oracle (the first example proofs that this issue is not only Oracle-related) Regards przemol -- http://przemol.blogspot.com/ ---------------------------------------------------------------------- Fajne i smieszne. Zobacz najlepsze filmiki!>>> http://link.interia.pl/f1bbb
Rayson Ho writes: > 1) Modern DBMSs cache database pages in their own buffer pool because > it is less expensive than to access data from the OS. (IIRC, MySQL''s > MyISAM is the only one that relies on the FS cache, but a lot of MySQL > sites use INNODB which has its own buffer pool) > The DB can and should cache data whether or not directio is used. > 2) Also, direct I/O is faster because it avoid double buffering. > A piece of data can be in one buffer, 2 buffers, 3 buffers. That says nothing about performance. More below. So I guess you mean DIO is faster because it avoids the extra copy: dma straight to user buffer rather than DMA to kernel buffer then copy to user buffer. If an I/O is 5ms an 8K copy is about 10 usec. Is avoiding the copy really the most urgent thing to work on ? > Rayson > > > > > On 10/2/07, eric kustarz <eric.kustarz at sun.com> wrote: > > Not yet, see: > > 6429855 Need way to tell ZFS that caching is a lost cause > > > > Is there a specific reason why you need to do the caching at the DB > > level instead of the file system? I''m really curious as i''ve got > > conflicting data on why people do this. If i get more data on real > > reasons on why we shouldn''t cache at the file system, then this could > > get bumped up in my priority queue. > > I can''t answer this although can well imagine that the DB is the most efficent place to cache it''s own data all organised and formatted to respond to queries. But once the DB has signified to the FS that it doesn''t require the FS to cache data then the benefit from this RFE is that the memory used to stage the data can be quickly recycled by ZFS for subsequent operations. It means ZFS memory footprint is more likely to contain useful ZFS metadata and not cached data block we know are not likely to be used again anytime soon. We also would operated better in mixed DIO/non-DIO workloads. See also: http://blogs.sun.com/roch/entry/zfs_and_directio -r > > eric > > _______________________________________________ > > zfs-discuss mailing list > > zfs-discuss at opensolaris.org > > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
On 10/3/07, Roch - PAE <Roch.Bourbonnais at sun.com> wrote:> Rayson Ho writes: > > > 1) Modern DBMSs cache database pages in their own buffer pool because > > it is less expensive than to access data from the OS. (IIRC, MySQL''s > > MyISAM is the only one that relies on the FS cache, but a lot of MySQL > > sites use INNODB which has its own buffer pool) > > > > The DB can and should cache data whether or not directio is used.It does, which leads to the core problem. Why do we have to store the exact same data twice in memory (i.e., once in the ARC, and once in the shared memory segment that Oracle uses)? Due to the lack of direct I/O and kernel asynchronous I/O in ZFS, my employer has decided to stick with VxFS. I would love nothing more than to use ZFS with our databases, but unfortunately these missing features prevent us from doing so. :( Thanks, - Ryan -- UNIX Administrator http://prefetch.net
Matty writes: > On 10/3/07, Roch - PAE <Roch.Bourbonnais at sun.com> wrote: > > Rayson Ho writes: > > > > > 1) Modern DBMSs cache database pages in their own buffer pool because > > > it is less expensive than to access data from the OS. (IIRC, MySQL''s > > > MyISAM is the only one that relies on the FS cache, but a lot of MySQL > > > sites use INNODB which has its own buffer pool) > > > > > > > The DB can and should cache data whether or not directio is used. > > It does, which leads to the core problem. Why do we have to store the > exact same data twice in memory (i.e., once in the ARC, and once in > the shared memory segment that Oracle uses)? We do not retain 2 copies of the same data. If the DB cache is made large enough to consume most of memory, the ZFS copy will quickly be evicted to stage other I/Os on their way to the DB cache. What problem does that pose ? -r > > Thanks, > - Ryan > -- > UNIX Administrator > http://prefetch.net > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
On Wed, Oct 03, 2007 at 10:42:53AM +0200, Roch - PAE wrote:> Rayson Ho writes: > > 2) Also, direct I/O is faster because it avoid double buffering. > > A piece of data can be in one buffer, 2 buffers, 3 > buffers. That says nothing about performance. More below. > > So I guess you mean DIO is faster because it avoids the > extra copy: dma straight to user buffer rather than DMA to > kernel buffer then copy to user buffer. If an I/O is 5ms an > 8K copy is about 10 usec. Is avoiding the copy really the > most urgent thing to work on ?If the DB is huge relative to RAM, and very busy, then memory pressure could become a problem. And it''s not just the time spent copying buffers, but the resources spent managing those copies. (Just guessing.)
On 10/3/07, Roch - PAE <Roch.Bourbonnais at sun.com> wrote:> We do not retain 2 copies of the same data. > > If the DB cache is made large enough to consume most of memory, > the ZFS copy will quickly be evicted to stage other I/Os on > their way to the DB cache. > > What problem does that pose ?Hi Roch, 1) The memory copy operations are expensive... I think the following is a good intro to this problem: "Copying data in memory can be a serious bottleneck in DBMS software today. This fact is often a surprise to database students, who assume that main-memory operations are "free" compared to disk I/O. But in practice, a welltuned database installation is typically not I/O-bound." (section 3.2) http://mitpress.mit.edu/books/chapters/0262693143chapm2.pdf (Ch 2: Anatomy of a Database System, Readings in Database Systems, 4th Ed) 2) If you look at the TPC-C disclosure reports, you will see vendors using thousands of disks for the top 10 systems. With that many disks working in parallel, the I/O latencies are not as big as of a problem as systems with fewer disks. 3) Also interesting is Concurrent I/O, which was introduced in AIX 5.2: "Improving Database Performance With AIX Concurrent I/O" http://www-03.ibm.com/systems/p/os/aix/whitepapers/db_perf_aix.html "Improve database performance on file system containers in IBM DB2 UDB V8.2 using Concurrent I/O on AIX" http://www-128.ibm.com/developerworks/db2/library/techarticle/dm-0408lee/ Rayson> > -r > > > > > Thanks, > > - Ryan > > -- > > UNIX Administrator > > http://prefetch.net > > _______________________________________________ > > zfs-discuss mailing list > > zfs-discuss at opensolaris.org > > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >
On Oct 3, 2007, at 10:31 AM, Roch - PAE wrote:> If the DB cache is made large enough to consume most of memory, > the ZFS copy will quickly be evicted to stage other I/Os on > their way to the DB cache. > > What problem does that pose ?Personally, I''m still not completely sold on the performance (performance as in ability, not speed) of ARC eviction. Often times, especially during a resilver, a server with ~2GB of RAM free under normal circumstances will dive down to the minfree floor, causing processes to be swapped out. We''ve had to take to manually constraining ARC max size so this situation is avoided. This is on s10u2/3. I haven''t tried anything heavy duty with Nevada simply because I don''t put Nevada in production situations. Anyhow, in the case of DBs, ARC indeed becomes a vestigial organ. I''m surprised that this is being met with skepticism considering that Oracle highly recommends direct IO be used, and, IIRC, Oracle performance was the main motivation to adding DIO to UFS back in Solaris 2.6. This isn''t a problem with ZFS or any specific fs per se, it''s the buffer caching they all employ. So I''m a big fan of seeing 6429855 come to fruition. /dale
Hey Roch -> We do not retain 2 copies of the same data. > > If the DB cache is made large enough to consume most of memory, > the ZFS copy will quickly be evicted to stage other I/Os on > their way to the DB cache. > > What problem does that pose ?Can''t answer that question empirically, because we can''t measure this, but I imagine there''s some overhead to ZFS cache management in evicting and replacing blocks, and that overhead could be eliminated if ZFS could be told not to cache the blocks at all. Now, obviously, whether this overhead would be in the noise level, or something that actually hurts sustainable performance will depend on several things, but I can envision scenerios where it''s overhead I''d rather avoid if I could. Thanks, /jim
Rayson Ho wrote:> On 10/3/07, Roch - PAE <Roch.Bourbonnais at sun.com> wrote: >> We do not retain 2 copies of the same data. >> >> If the DB cache is made large enough to consume most of memory, >> the ZFS copy will quickly be evicted to stage other I/Os on >> their way to the DB cache. >> >> What problem does that pose ? > > Hi Roch, > > 1) The memory copy operations are expensive... I think the following > is a good intro to this problem: > > "Copying data in memory can be a serious bottleneck in DBMS software > today. This fact is often a surprise to database students, who assume > that main-memory operations are "free" compared to disk I/O. But in > practice, a welltuned database installation is typically not > I/O-bound." (section 3.2)... just the ones people are complaining about ;-) Indeed it seems rare that a DB performance escalation does not involve I/O tuning :-(> http://mitpress.mit.edu/books/chapters/0262693143chapm2.pdf > > (Ch 2: Anatomy of a Database System, Readings in Database Systems, 4th Ed) > > > 2) If you look at the TPC-C disclosure reports, you will see vendors > using thousands of disks for the top 10 systems. With that many disks > working in parallel, the I/O latencies are not as big as of a problem > as systems with fewer disks. > > > 3) Also interesting is Concurrent I/O, which was introduced in AIX 5.2: > > "Improving Database Performance With AIX Concurrent I/O" > http://www-03.ibm.com/systems/p/os/aix/whitepapers/db_perf_aix.htmlThis is a pretty decent paper and some of the issues are the same with UFS. To wit, direct I/O is not always a win (qv. Bob Sneed''s blog) It also describes what we call the single writer lock problem, which IBM solves with Concurrent I/O. See also: http://www.solarisinternals.com/wiki/index.php/Direct_I/O ZFS doesn''t have the single writer lock problem. See also: http://blogs.sun.com/roch/entry/zfs_to_ufs_performance_comparison Slightly off-topic, in looking at some field data this morning (looking for something completely unrelated) I notice that the use of directio on UFS is declining over time. I''m not sure what that means... hopefully not more performance escalations... -- richard
On Oct 3, 2007, at 5:21 PM, Richard Elling wrote:> Slightly off-topic, in looking at some field data this morning > (looking > for something completely unrelated) I notice that the use of directio > on UFS is declining over time. I''m not sure what that means... > hopefully > not more performance escalations...Sounds like someone from ZFS team needs to get with someone from Oracle/MySQL/Postgres and get the skinny on how the IO rubber->road boundary should look, because it doesn''t sound like there''s a definitive or at least a sure answer here. Oracle trumpets the use of DIO, and there are benchmarks and first- hand accounts out there from DBAs on its virtues - at least when running on UFS (and EXT2/3 on Linux, etc) As it relates to ZFS mechanics specifically, there doesn''t appear to be any settled opinion. /dale
Hi Dale, We''re testing out the enhanced arc_max enforcement (track DNLC entries) using Build 72 right now. Hopefully, it will fix the memory creep, which is the only real downside to ZFS for DB work it seems to me. Frankly, of our DB loads have improved performance with ZFS. I suspect its because we are write-heavy. -J On 10/3/07, Dale Ghent <daleg at elemental.org> wrote:> On Oct 3, 2007, at 10:31 AM, Roch - PAE wrote: > > > If the DB cache is made large enough to consume most of memory, > > the ZFS copy will quickly be evicted to stage other I/Os on > > their way to the DB cache. > > > > What problem does that pose ? > > Personally, I''m still not completely sold on the performance > (performance as in ability, not speed) of ARC eviction. Often times, > especially during a resilver, a server with ~2GB of RAM free under > normal circumstances will dive down to the minfree floor, causing > processes to be swapped out. We''ve had to take to manually > constraining ARC max size so this situation is avoided. This is on > s10u2/3. I haven''t tried anything heavy duty with Nevada simply > because I don''t put Nevada in production situations. > > Anyhow, in the case of DBs, ARC indeed becomes a vestigial organ. I''m > surprised that this is being met with skepticism considering that > Oracle highly recommends direct IO be used, and, IIRC, Oracle > performance was the main motivation to adding DIO to UFS back in > Solaris 2.6. This isn''t a problem with ZFS or any specific fs per se, > it''s the buffer caching they all employ. So I''m a big fan of seeing > 6429855 come to fruition. > > /dale > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >
Postgres assumes that the OS takes care of caching: "PLEASE NOTE. PostgreSQL counts a lot on the OS to cache data files and hence does not bother with duplicating its file caching effort. The shared buffers parameter assumes that OS is going to cache a lot of files and hence it is generally very low compared with system RAM. Even for a dataset in excess of 20GB, a setting of 128MB may be too much, if you have only 1GB RAM and an aggressive-at-caching OS like Linux." Tuning PostgreSQL for performance, Shridhar Daithankar, Josh Berkus, 2003, http://www.varlena.com/GeneralBits/Tidbits/perf.html Slightly off-topic, I have noticed at least 25% performance gain on my postgresql database after installing Wu Fengguang''s adaptive read- ahead disk cache patch for the linux kernel. http://lkml.org/lkml/ 2005/9/15/185 http://www.samag.com/documents/s=10101/sam0616a/0616a.htm I was wondering if Solaris uses a similar approach. On 04/10/2007, at 4:44 AM, Dale Ghent wrote:> On Oct 3, 2007, at 5:21 PM, Richard Elling wrote: > > >> Slightly off-topic, in looking at some field data this morning >> (looking >> for something completely unrelated) I notice that the use of directio >> on UFS is declining over time. I''m not sure what that means... >> hopefully >> not more performance escalations... >> > > Sounds like someone from ZFS team needs to get with someone from > Oracle/MySQL/Postgres and get the skinny on how the IO rubber->road > boundary should look, because it doesn''t sound like there''s a > definitive or at least a sure answer here. > > Oracle trumpets the use of DIO, and there are benchmarks and first- > hand accounts out there from DBAs on its virtues - at least when > running on UFS (and EXT2/3 on Linux, etc) > > As it relates to ZFS mechanics specifically, there doesn''t appear to > be any settled opinion. > > /dale > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
> > Anyhow, in the case of DBs, ARC indeed becomes a vestigial organ. I''m > surprised that this is being met with skepticism considering that > Oracle highly recommends direct IO be used, and, IIRC, Oracle > performance was the main motivation to adding DIO to UFS back in > Solaris 2.6. This isn''t a problem with ZFS or any specific fs per se, > it''s the buffer caching they all employ. So I''m a big fan of seeing > 6429855 come to fruition.The point is that directI/O typically means two things: 1) concurrent I/O 2) no caching at the file system Most file systems (ufs, vxfs, etc.) don''t do 1) or 2) without turning on "directI/O". ZFS *does* 1. It doesn''t do 2 (currently). That is what we''re trying to discuss here. Where does the win come from with "directI/O"? Is it 1), 2), or some combination? If its a combination, what''s the percentage of each towards the win? We need to tease 1) and 2) apart to have a full understanding. I''m not against adding 2) to ZFS but want more information. I suppose i''ll just prototype it and find out for myself. eric
On Oct 3, 2007, at 3:44 PM, Dale Ghent wrote:> On Oct 3, 2007, at 5:21 PM, Richard Elling wrote: > >> Slightly off-topic, in looking at some field data this morning >> (looking >> for something completely unrelated) I notice that the use of directio >> on UFS is declining over time. I''m not sure what that means... >> hopefully >> not more performance escalations... > > Sounds like someone from ZFS team needs to get with someone from > Oracle/MySQL/Postgres and get the skinny on how the IO rubber->road > boundary should look, because it doesn''t sound like there''s a > definitive or at least a sure answer here.I''ve done that already (Oracle, Postgres, JavaDB, etc.). Because the holy grail of "directI/O" is an overloaded term, we don''t really know where the win within "directI/O" lies. In any event, it seems the only way to get a definitive answer here is to prototype a no caching property... eric
Would it be easier to ... 1) Change ZFS code to enable a sort of directIO emulation and then run various tests... or 2) Use Sun''s performance team, which have all the experience in the world when it comes to performing benchmarks on Solaris and Oracle .. + a Dtrace master to drill down and see what the difference is between UFS and UFS/DIO... and where the real win lies. On 10/4/07, eric kustarz <eric.kustarz at sun.com> wrote:> > On Oct 3, 2007, at 3:44 PM, Dale Ghent wrote: > > > On Oct 3, 2007, at 5:21 PM, Richard Elling wrote: > > > >> Slightly off-topic, in looking at some field data this morning > >> (looking > >> for something completely unrelated) I notice that the use of directio > >> on UFS is declining over time. I''m not sure what that means... > >> hopefully > >> not more performance escalations... > > > > Sounds like someone from ZFS team needs to get with someone from > > Oracle/MySQL/Postgres and get the skinny on how the IO rubber->road > > boundary should look, because it doesn''t sound like there''s a > > definitive or at least a sure answer here. > > I''ve done that already (Oracle, Postgres, JavaDB, etc.). Because the > holy grail of "directI/O" is an overloaded term, we don''t really know > where the win within "directI/O" lies. In any event, it seems the > only way to get a definitive answer here is to prototype a no caching > property... > > eric > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >
> Where does the win come from with "directI/O"? Is it 1), 2), or some > combination? If its a combination, what''s the percentage of each > towards the win? >That will vary based on workload (I know, you already knew that ... :^). Decomposing the performance win between what is gained as a result of single writer lock breakup and no caching is something we can only guess at, because, at least for UFS, you can''t do just one - it''s all or nothing.> We need to tease 1) and 2) apart to have a full understanding.We can''t. We can only guess (for UFS). My opinion - it''s a must-have for ZFS if we''re going to get serious attention in the database space. I''ll bet dollars-to-donuts that, over the next several years, we''ll burn many tens-of-millions of dollars on customer support escalations that come down to memory utilization issues and contention between database specific buffering and the ARC. This is entirely my opinion (not that of Sun), and I''ve been wrong before. Thanks, /jim
Jim Mauro writes: > > > Where does the win come from with "directI/O"? Is it 1), 2), or some > > combination? If its a combination, what''s the percentage of each > > towards the win? > > > That will vary based on workload (I know, you already knew that ... :^). > Decomposing the performance win between what is gained as a result of > single writer > lock breakup and no caching is something we can only guess at, because, > at least > for UFS, you can''t do just one - it''s all or nothing. > > We need to tease 1) and 2) apart to have a full understanding. > > We can''t. We can only guess (for UFS). > > My opinion - it''s a must-have for ZFS if we''re going to get serious > attention > in the database space. I''ll bet dollars-to-donuts that, over the next > several years, > we''ll burn many tens-of-millions of dollars on customer support > escalations that > come down to memory utilization issues and contention between database > specific buffering and the ARC. This is entirely my opinion (not that of > Sun), ...memory utilisation... OK so we should implement the ''lost cause'' rfe. In all cases, ZFS must not steal pages from other memory consumers : 6488341 ZFS should avoiding growing the ARC into trouble So the DB memory pages should not be _contented_ for. -r > and I''ve been wrong before. > > Thanks, > /jim > > >
eric kustarz writes: > > > > Anyhow, in the case of DBs, ARC indeed becomes a vestigial organ. I''m > > surprised that this is being met with skepticism considering that > > Oracle highly recommends direct IO be used, and, IIRC, Oracle > > performance was the main motivation to adding DIO to UFS back in > > Solaris 2.6. This isn''t a problem with ZFS or any specific fs per se, > > it''s the buffer caching they all employ. So I''m a big fan of seeing > > 6429855 come to fruition. > > The point is that directI/O typically means two things: > 1) concurrent I/O > 2) no caching at the file system > In my blog I also mention : 3) no readahead (but can be viewed as an implicit consequence of 2) And someone chimed in with 4) ability to do I/O at the sector granularity. I also think that for many 2) is too weak form of what they expect : 5) DMA straight from user buffer to disk avoiding a copy. So 1) concurrent I/O we have in ZFS. 2) No Caching. we could do by taking a directio hint and evict arc buffer immediately after copyout to user space for reads, and after txg completion for writes. 3) No prefetching. we have 2 level of prefetching. The low level was fixed recently. Should not cause problem to DB loads. The high level still needs fixing on it''s own. Then we should take the same hint as 2) to disable it altogether. In the mean time we can tune our way into this mode. 4) Sector sized I/O Is really foreign to ZFS design. 5) Zero Copy & more CPU efficientcy. I think is where the debate is. My line has been that 5) won''t help latency much and latency is where I think the game is currently played. Now the disconnect might be because people might feel that the game is not latency but CPU efficientcy : "how many CPU cycles to I burn to do get data from disk to user buffer". This is a valid point. Configurations can with very large number of disks end up saturated by the filesystem CPU utilisation. So I still think that the major area for ZFS perf gains are on the latency front : block allocation (now much improved with the Separate intent log), I/O scheduling, and other fixes to the threading & ARC behavior. But at some point we can turn our microscope on the CPU efficientcy of the implementation. The copy will certainly be a big chunk of the CPU cost per I/O but I would still like to gather that data. Also consider, 50 disks at 200 IOPS of 8K is 80 MB/sec. That means maybe 1/10th of a single CPU to be saved by avoiding just the copy. Probably not what people have in mind. How many CPU''s do you have when attaching 1000 drives to a host running a 100TB database ? That many drivers will barely occupy 2 cores running the copies. People want performance and efficientcy. Directio is just an overloaded name that delivered those gains to other filesystems. Right now, what I think is worth gathering is cycles spent in ZFS per reads & writes in a large DB environment where DB holds 90% of memory. For comparison with another FS, we should disable checksum, file prefetching, vdev prefetching, cap the ARC, atime off, 8K recordsize. A breakdown and comparison of the CPU cost per layer will be quite interesting and points to what needs work. Another interesting thing for me would be : what is your budget ? "how many cycles per DB reads and writes are you willing to spend and how did you come to that number" But, as Eric says, let''s develop 2 and I''ll try in parallel to figure out the per layer breakdown cost. -r > Most file systems (ufs, vxfs, etc.) don''t do 1) or 2) without turning > on "directI/O". > > ZFS *does* 1. It doesn''t do 2 (currently). > > That is what we''re trying to discuss here. > > Where does the win come from with "directI/O"? Is it 1), 2), or some > combination? If its a combination, what''s the percentage of each > towards the win? > > We need to tease 1) and 2) apart to have a full understanding. I''m > not against adding 2) to ZFS but want more information. I suppose > i''ll just prototype it and find out for myself. > > eric
On Wed, Oct 03, 2007 at 04:31:01PM +0200, Roch - PAE wrote:> > It does, which leads to the core problem. Why do we have to store the > > exact same data twice in memory (i.e., once in the ARC, and once in > > the shared memory segment that Oracle uses)? > > We do not retain 2 copies of the same data. > > If the DB cache is made large enough to consume most of memory, > the ZFS copy will quickly be evicted to stage other I/Os on > their way to the DB cache. > > What problem does that pose ?Other things deserving of staying in the cache get pushed out by things that don''t deserve being in the cache. Thus systemic memory pressure (e.g., more on-demand paging of text). Nico --
On Thu, Oct 04, 2007 at 03:49:12PM +0200, Roch - PAE wrote:> ...memory utilisation... OK so we should implement the ''lost cause'' rfe. > > In all cases, ZFS must not steal pages from other memory consumers : > > 6488341 ZFS should avoiding growing the ARC into trouble > > So the DB memory pages should not be _contented_ for.What if your executable text, and pretty much everything lives on ZFS? You don''t want to content for the memory caching those things either. It''s not just the DB''s memory you don''t want to contend for.
Nicolas Williams writes: > On Wed, Oct 03, 2007 at 04:31:01PM +0200, Roch - PAE wrote: > > > It does, which leads to the core problem. Why do we have to store the > > > exact same data twice in memory (i.e., once in the ARC, and once in > > > the shared memory segment that Oracle uses)? > > > > We do not retain 2 copies of the same data. > > > > If the DB cache is made large enough to consume most of memory, > > the ZFS copy will quickly be evicted to stage other I/Os on > > their way to the DB cache. > > > > What problem does that pose ? > > Other things deserving of staying in the cache get pushed out by things > that don''t deserve being in the cache. Thus systemic memory pressure > (e.g., more on-demand paging of text). > > Nico > -- I agree. That''s why I submitted both of these. 6429855 Need way to tell ZFS that caching is a lost cause 6488341 ZFS should avoiding growing the ARC into trouble -r
Nicolas Williams writes: > On Thu, Oct 04, 2007 at 03:49:12PM +0200, Roch - PAE wrote: > > ...memory utilisation... OK so we should implement the ''lost cause'' rfe. > > > > In all cases, ZFS must not steal pages from other memory consumers : > > > > 6488341 ZFS should avoiding growing the ARC into trouble > > > > So the DB memory pages should not be _contented_ for. > > What if your executable text, and pretty much everything lives on ZFS? > You don''t want to content for the memory caching those things either. > It''s not just the DB''s memory you don''t want to contend for. On the read side, We''re talking here about 1000 disks each running 35 concurrent I/Os of 8K, so a footprint of 250MB, to stage a ton of work. On the write side we do have to play with the transaction group so that will be 5-10 seconds worth of synchronous write activity. But how much memory does a 1000-disks server got ? -r
On Thu, Oct 04, 2007 at 06:59:56PM +0200, Roch - PAE wrote:> Nicolas Williams writes: > > On Thu, Oct 04, 2007 at 03:49:12PM +0200, Roch - PAE wrote: > > > So the DB memory pages should not be _contented_ for. > > > > What if your executable text, and pretty much everything lives on ZFS? > > You don''t want to content for the memory caching those things either. > > It''s not just the DB''s memory you don''t want to contend for. > > On the read side, > > We''re talking here about 1000 disks each running 35 > concurrent I/Os of 8K, so a footprint of 250MB, to stage a > ton of work.I''m not sure what you mean, but extra copies and memory just to stage the I/Os is not the same as the systemic memory pressure issue. Now, I''m _speculating_ as to what the real problem is, but it seems very likely that putting things in the cache that needn''t be there would push out things that should be there, and since restoring those things to the cache later would cost I/Os...
I''d like to second a couple of comments made recently: * If they don''t regularly do so, I too encourage the ZFS, Solaris performance, and Sun Oracle support teams to sit down and talk about the utility of Direct I/O for databases. * I too suspect that absent Direct I/O (or some ringing endorsement from Oracle about how ZFS doesn''t need Direct I/O), there will be a drain of customer escalations regarding the lack-- plus FUD and other sales inhibitors. While I realize that Sun has not published a TPC-C result since 2001 and offers a different value proposition to customers, performance does matter and for some cases Direct I/O can contribute to that. Historically, every TPC-C database benchmark run can be converted from being I/O bound to being CPU bound by adding enough disk spindles and enough main memory. In that context, saving the CPU cycles (and cache misses) from a copy are important. Another historical trend was that for performance, portability across different operating systems, and perhaps just because they could, databases tended to use as few OS capabilities as possible and to do their own resource management. So for instance databases were often benchmarked using raw devices. Customers on the other hand preferred the manageability of filesystems and tended to deploy there. In that context, Direct I/O is an attempt to get the best of both worlds. Finally, besides UFS Direct I/O on Solaris, other filesystems including VxFS also have various forms of Direct I/O-- either separate APIs or mount options for that bypass the cache on large writes, etc. Understanding those benefits, both real and advertised, helps understand the opportunities and shortfalls for ZFS. It may be that this is not the most important thing for ZFS performance or capability right now-- measurement in targeted configurations and workloads is the only way to tell-- but I''d be highly surprised if there isn''t something (bypass cache on really large writes?) that can''t be learned from experiences with Direct I/O. Eric (Hamilton)
> 5) DMA straight from user buffer to disk avoiding a copy.This is what the "direct" in "direct i/o" has historically meant. :-)> line has been that 5) won''t help latency much and > latency is here I think the game is currently played. Now the > disconnect might be because people might feel that the game > is not latency but CPU efficiency : "how many CPU cycles do I > burn to do get data from disk to user buffer".Actually, it''s less CPU cycles in many cases than memory cycles. For many databases, most of the I/O is writes (reads wind up cached in memory). What''s the cost of a write? With direct I/O: CPU writes to memory (spread out over many transactions), disk DMAs from memory. We write LPS (log page size) bytes of data from CPU to memory, we read LPS bytes from memory. On processors without a cache line zero, we probably read the LPS data from memory as part of the write. Total cost = W:LPS, R:2*LPS. Without direct I/O: The cost of getting the data into the user buffer remains the same (W:LPS, R:LPS). We copy the data from user buffer to system buffer (W:LPS, R:LPS). Then we push it out to disk. Total cost = W:2*LPS, R:3*LPS. We''ve nearly doubled the cost, not including any TLB effects. On a memory-bandwidth-starved system (which should be nearly all modern designs, especially with multi-threaded chips like Niagara), replacing buffered I/O with direct I/O should give you nearly a 2x improvement in log write bandwidth. That''s without considering cache effects (which shouldn''t be too significant, really, since LPS should be << the size of L2). How significant is this? We''d have to measure; and it will likely vary quite a lot depending on which database is used for testing. But note that, for ZFS, the win with direct I/O will be somewhat less. That''s because you still need to read the page to compute its checksum. So for direct I/O with ZFS (with checksums enabled), the cost is W:LPS, R:2*LPS. Is saving one page of writes enough to make a difference? Possibly not. Anton This message posted from opensolaris.org
I''ve been thinking about this for awhile, but Anton''s analysis makes me think about it even more: We all love ZFS, right. It''s futuristic in a bold new way, which many virtues, I won''t preach tot he choir. But to make it all glue together has some necessary CPU/Memory intensive operations around checksum generation/validation, compression, encryption, data placement/component load balancing, etc. Processors have gotten really powerful, much more so than the relative disk I/O gains, which in all honesty make ZFS possible. My question: Is anyone working on an offload engine for ZFS? I can envision a highly optimized, pipelined system, where writes and reads pass through checksum, compression, encryption ASICs, that also locate data properly on disk. This could even be in the form of a PCIe SATA/SAS card with many ports, or different options. This would make direct IO, or DMA IO possible again. The file system abstraction with ZFS is really too much and too important to ignore, and too hard to optimize with different load conditions, (my rookie opinion) to expect any RDBMS app to have a clue what to do with it. I guess what I''m saying is the RDMBS app will know what blocks it needs, and wants to get them in and out speedy quick, but the mapping to disk is not linear with ZFS, the way other file systems are. An offload engine could translate this instead. Just throwing this out there for the purpose of blue sky fluff. Jon Anton B. Rang wrote:>> 5) DMA straight from user buffer to disk avoiding a copy. >> > > This is what the "direct" in "direct i/o" has historically meant. :-) > > >> line has been that 5) won''t help latency much and >> latency is here I think the game is currently played. Now the >> disconnect might be because people might feel that the game >> is not latency but CPU efficiency : "how many CPU cycles do I >> burn to do get data from disk to user buffer". >> > > Actually, it''s less CPU cycles in many cases than memory cycles. > > For many databases, most of the I/O is writes (reads wind up > cached in memory). What''s the cost of a write? > > With direct I/O: CPU writes to memory (spread out over many > transactions), disk DMAs from memory. We write LPS (log page size) > bytes of data from CPU to memory, we read LPS bytes from memory. > On processors without a cache line zero, we probably read the LPS > data from memory as part of the write. Total cost = W:LPS, R:2*LPS. > > Without direct I/O: The cost of getting the data into the user buffer > remains the same (W:LPS, R:LPS). We copy the data from user buffer > to system buffer (W:LPS, R:LPS). Then we push it out to disk. Total > cost = W:2*LPS, R:3*LPS. We''ve nearly doubled the cost, not including > any TLB effects. > > On a memory-bandwidth-starved system (which should be nearly all > modern designs, especially with multi-threaded chips like Niagara), > replacing buffered I/O with direct I/O should give you nearly a 2x > improvement in log write bandwidth. That''s without considering > cache effects (which shouldn''t be too significant, really, since LPS > should be << the size of L2). > > How significant is this? We''d have to measure; and it will likely > vary quite a lot depending on which database is used for testing. > > But note that, for ZFS, the win with direct I/O will be somewhat > less. That''s because you still need to read the page to compute > its checksum. So for direct I/O with ZFS (with checksums enabled), > the cost is W:LPS, R:2*LPS. Is saving one page of writes enough to > make a difference? Possibly not. > > Anton > > > This message posted from opensolaris.org > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >-- - _____/ _____/ / - Jonathan Loran - - - / / / IT Manager - - _____ / _____ / / Space Sciences Laboratory, UC Berkeley - / / / (510) 643-5146 jloran at ssl.berkeley.edu - ______/ ______/ ______/ AST:7731^29u18e3 -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20071004/33aaa999/attachment.html>
Hello Rayson, Tuesday, October 2, 2007, 8:56:09 PM, you wrote: RH> 1) Modern DBMSs cache database pages in their own buffer pool because RH> it is less expensive than to access data from the OS. (IIRC, MySQL''s RH> MyISAM is the only one that relies on the FS cache, but a lot of MySQL RH> sites use INNODB which has its own buffer pool) RH> 2) Also, direct I/O is faster because it avoid double buffering. I doubt its buying you much... However on UFS if you go with direct IO, you allow concurent writes to the same file and you disable read-aheads - I guess it''s buying you much more in most cases than eliminating double buffering. Now the question is - if application is usingi directio() call - what happens if underlying fs is zfs? -- Best regards, Robert Milkowski mailto:rmilkowski at task.gda.pl http://milek.blogspot.com
From: "Anton B. Rang" <rang at acm.org>> For many databases, most of the I/O is writes (reads wind up > cached in memory).2 words: table scan -=dave
On 5-Oct-07, at 2:26 AM, Jonathan Loran wrote:> > I''ve been thinking about this for awhile, but Anton''s analysis > makes me think about it even more: > > We all love ZFS, right. It''s futuristic in a bold new way, which > many virtues, I won''t preach tot he choir. But to make it all > glue together has some necessary CPU/Memory intensive operations > around checksum generation/validation, compression, encryption, > data placement/component load balancing, etc. Processors have > gotten really powerful, much more so than the relative disk I/O > gains, which in all honesty make ZFS possible. My question: Is > anyone working on an offload engine for ZFS?How far would that compromise ZFS'' #1 virtue (IMHO), end to end integrity? --Toby> I can envision a highly optimized, pipelined system, where writes > and reads pass through checksum, compression, encryption ASICs, > that also locate data properly on disk. This could even be in the > form of a PCIe SATA/SAS card with many ports, or different options.
On 10/5/07, Robert Milkowski <rmilkowski at task.gda.pl> wrote:> RH> 2) Also, direct I/O is faster because it avoids double buffering. > > I doubt its buying you much...We don''t know how much the performance gain is until we get a prototype and benchmark it - the behavior is different with different DBMSs, OSes, workloads (ie. I/O rate, hit ratio) etc.> However on UFS if you go with direct IO, you allow concurent writes to > the same file and you disable read-aheads - I guess it''s buying you > much more in most cases than eliminating double buffering.I really hope that someone can sit down and look at the database interface provided in all the filesystems. So far, there are Direct I/O, Concurrent I/O (AIX JFS2), and Quick I/O (VxFS) http://eval.veritas.com/webfiles/docs/qiowp.pdf Then a prototype for ZFS will help us understand how much we can get... Rayson> > Now the question is - if application is usingi directio() call - what > happens if underlying fs is zfs?
Toby Thain wrote:> On 5-Oct-07, at 2:26 AM, Jonathan Loran wrote: > >> I''ve been thinking about this for awhile, but Anton''s analysis >> makes me think about it even more: >> >> We all love ZFS, right. It''s futuristic in a bold new way, which >> many virtues, I won''t preach tot he choir. But to make it all >> glue together has some necessary CPU/Memory intensive operations >> around checksum generation/validation, compression, encryption, >> data placement/component load balancing, etc. Processors have >> gotten really powerful, much more so than the relative disk I/O >> gains, which in all honesty make ZFS possible. My question: Is >> anyone working on an offload engine for ZFS? > > How far would that compromise ZFS'' #1 virtue (IMHO), end to end > integrity?It need not, in fact with ZFS Crypto you will already get the encryption and checksum offloaded if you have suitable hardware (eg a SCA-6000 card or a Niagara 2 processor). -- Darren J Moffat
On Fri, 2007-10-05 at 09:40 -0300, Toby Thain wrote:> How far would that compromise ZFS'' #1 virtue (IMHO), end to end > integrity?Speed sells, and speed kills. If the offload were done on the HBA, it would extend the size of the "assumed correct" part of the hardware from just the CPU+memory to also include the offload device and all the I/O bridges, DMA widgets, etc., between it and memory. This is already the case for the TCP checksum offloads commonly performed in network cards, and, yes, people get bitten by it occasionally. An analysis of packets with bad tcp checksums i''m familiar with showed a fair number which showed patterns consistent with a DMA hiccup (repeated words or cache lines); those glitches on a system with checksum offload would turn into data corruption. See "When The CRC and TCP Checksum Disagree", http://www.sigcomm.org/sigcomm2000/conf/paper/sigcomm2000-9-1.pdf - Bill
On Thu, Oct 04, 2007 at 10:26:24PM -0700, Jonathan Loran wrote:> I can envision a highly optimized, pipelined system, where writes and > reads pass through checksum, compression, encryption ASICs, that also > locate data properly on disk. ...I''ve argued before that RAID-Z could be implemented in hardware. But I think that it''s all about economics. Software is easier to develop and patch than hardware, so if we can put together systems with enough memory, general purpose CPU horsepower, and memory and I/O bandwidth, all cheaply enough, then that will be better than developing special purpose hardware for ZFS. Thumper is an example of such a system. Eventually we may find trends in system design once again favoring pushing special tasks to the edge. When that happens I''m sure we''ll go there. But right now the trend is to put crypto co-processors and NICs on the same die as the CPU. Nico --
Nicolas Williams wrote:> On Thu, Oct 04, 2007 at 10:26:24PM -0700, Jonathan Loran wrote: > >> I can envision a highly optimized, pipelined system, where writes and >> reads pass through checksum, compression, encryption ASICs, that also >> locate data properly on disk. ... >> > > I''ve argued before that RAID-Z could be implemented in hardware. But I > think that it''s all about economics. Software is easier to develop and > patch than hardware, so if we can put together systems with enough > memory, general purpose CPU horsepower, and memory and I/O bandwidth, > all cheaply enough, then that will be better than developing special > purpose hardware for ZFS. Thumper is an example of such a system. > > Eventually we may find trends in system design once again favoring > pushing special tasks to the edge. When that happens I''m sure we''ll go > there. But right now the trend is to put crypto co-processors and NICs > on the same die as the CPU. > > Nico >Time for on board FPGAs!
On Fri, Oct 05, 2007 at 08:56:26AM -0700, Tim Spriggs wrote:> Time for on board FPGAs!Heh!
Nicolas Williams wrote:> On Thu, Oct 04, 2007 at 10:26:24PM -0700, Jonathan Loran wrote: > >> I can envision a highly optimized, pipelined system, where writes and >> reads pass through checksum, compression, encryption ASICs, that also >> locate data properly on disk. ... >> > > I''ve argued before that RAID-Z could be implemented in hardware. But I > think that it''s all about economics. Software is easier to develop and > patch than hardware, so if we can put together systems with enough > memory, general purpose CPU horsepower, and memory and I/O bandwidth, > all cheaply enough, then that will be better than developing special > purpose hardware for ZFS. Thumper is an example of such a system. > > Eventually we may find trends in system design once again favoring > pushing special tasks to the edge. When that happens I''m sure we''ll go > there. But right now the trend is to put crypto co-processors and NICs > on the same die as the CPU. > > Nico >1) We can put it on the same die also, or at least as a chip set on the MoBo. 2) Offload engines do have software, stored in firmware. Or maybe such an offload processor could run software out of a driver, could be booted in dynamically? 3) You all are aware of how many micro processors are involved in a normal file server right? There''s one at almost every interface, disk to controller, controller to PCI bridge, PCI bridge to Hyperbus, etc. Imagine the burden if you did all that in the CPU only? I sometimes find it amazing computers are as stable as they are, but it''s all in the maturity of the code running in every step of the way, and of course, good firmware coding practices. Your vanilla SCSI controllers and disk drives do a lot of very complex but useful processing. We trust these guys 100%, because the interface is stable, and the code and processors are mature, well used. I do agree, pushing ZFS to the edge will come down the road, when it becomes less dynamic (how boring) and we know more about the bottlenecks. Jon -- - _____/ _____/ / - Jonathan Loran - - - / / / IT Manager - - _____ / _____ / / Space Sciences Laboratory, UC Berkeley - / / / (510) 643-5146 jloran at ssl.berkeley.edu - ______/ ______/ ______/ AST:7731^29u18e3 -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20071005/f7abbc0d/attachment.html>
> Is there a specific reason why you need to do the caching at the DB > level instead of the file system? I''m really curious as i''ve got > conflicting data on why people do this. If i get more data on real > reasons on why we shouldn''t cache at the file system, then this could > get bumped up in my priority queue.FWIW a MySQL database was recently moved to a FreeBSD system with ZFS. Performance ended up sucking because for some reason data did not make it into the cache in a predictable fashion (simple case of repeated queries were not cached; so for example a very common query, even when executed repeatedly on an idle system, would take more than 1 minute instead of 0.10 seconds or so when cached). Ended up convincing the person running the DB to switch from MyISAM (which does not seem to support DB level caching, other than of indexes) to InnoDB, thus allowing use of the InnoDB buffer cache. I don''t know why it wasn''t cached by ZFS/ARC to begin with (the size of the ARC cache was definitely large enough - ~ 800 MB, and I know the working set for this query was below 300 MB). Perhaps it has to do with ARC trying to be smart and avoiding flushing the cache with useless data? I am not read up on the details of the ARC. But in this particular case it was clear that a simple LRU had been much more useful - unless there was some other problem related to my setup or FreeBSD integration that somehow broke proper caching. -- / Peter Schuller PGP userID: 0xE9758B7D or ''Peter Schuller <peter.schuller at infidyne.com>'' Key retrieval: Send an E-Mail to getpgpkey at scode.org E-Mail: peter.schuller at infidyne.com Web: http://www.scode.org -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 187 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20071006/c414acb1/attachment.bin>
> But note that, for ZFS, the win with direct I/O will be somewhat > less. That''s because you still need to read the page to compute > its checksum. So for direct I/O with ZFS (with checksums enabled), > the cost is W:LPS, R:2*LPS. Is saving one page of writes enough to > make a difference? Possibly not.It''s more complicated than that. The kernel would be verifying checksums on buffers in a user''s address space. For this to work, we have to map these buffers into the kernel and simultaneously arrange for these pages to be protected from other threads in the user''s address space. We discussed some of the VM gymnastics required to properly implement this back in January: http://mail.opensolaris.org/pipermail/zfs-discuss/2007-January/thread.html#36890 -j
Peter Schuller wrote:>> Is there a specific reason why you need to do the caching at the DB >> level instead of the file system? I''m really curious as i''ve got >> conflicting data on why people do this. If i get more data on real >> reasons on why we shouldn''t cache at the file system, then this could >> get bumped up in my priority queue. > > FWIW a MySQL database was recently moved to a FreeBSD system with > ZFS. Performance ended up sucking because for some reason data did not > make it into the cache in a predictable fashion (simple case of > repeated queries were not cached; so for example a very common query, > even when executed repeatedly on an idle system, would take more than > 1 minute instead of 0.10 seconds or so when cached). > > Ended up convincing the person running the DB to switch from MyISAM > (which does not seem to support DB level caching, other than of > indexes) to InnoDB, thus allowing use of the InnoDB buffer cache. > > I don''t know why it wasn''t cached by ZFS/ARC to begin with (the size > of the ARC cache was definitely large enough - ~ 800 MB, and I know > the working set for this query was below 300 MB). Perhaps it has to do > with ARC trying to be smart and avoiding flushing the cache with > useless data? I am not read up on the details of the ARC. But in this > particular case it was clear that a simple LRU had been much more > useful - unless there was some other problem related to my setup or > FreeBSD integration that somehow broke proper caching.Neel''s arcstat might help shed light on such behaviour. http://blogs.sun.com/realneel/entry/zfs_arc_statistics -- richard
Hi All, While pumping IO on a zfs file system my ststem is crashing/panicing. Please find the crash dump below. panic[cpu0]/thread=2a100adfcc0: assertion failed: ss != NULL, file: ../../common/fs/zfs/space_map.c, line: 125 000002a100adec40 genunix:assfail+74 (7b652448, 7b652458, 7d, 183d800, 11ed400, 0) %l0-3: 0000000000000000 0000000000000000 00000000011e7508 000003000744ea30 %l4-7: 00000000011ed400 0000000000000000 000000000186fc00 0000000000000000 000002a100adecf0 zfs:space_map_remove+b8 (3000683e7b8, 2b200000, 20000, 7b652000, 7b652400, 7b652400) %l0-3: 0000000000000000 000000002b220000 000000002b0ec600 000003000744ebc0 %l4-7: 000003000744eaf8 000000002b0ec000 000000007b652000 000000002b0ec600 000002a100adedd0 zfs:space_map_load+218 (3000683e7b8, 30006f5f160, 1000, 3000683e488, 2b000000, 1) %l0-3: 0000000000000160 0000030006f5f000 0000000000000000 000000007b620ad0 %l4-7: 000000007b62086c 00007fffffffffff 0000000000007fff 0000030006f5f128 000002a100adeea0 zfs:metaslab_activate+3c (3000683e480, 8000000000000000, c000000000000000, 24a998, 3000683e480, c0000000) %l0-3: 0000000000000000 0000000000000008 0000000000000000 0000029ebf9d0000 %l4-7: 00000000704e2000 000003000391e940 0000030005572540 00000300060bacd0 000002a100adef50 zfs:metaslab_group_alloc+1bc (3fffffffffffffff, 20000, 8000000000000000, 7e68000, 30006766080, ffffffffffffffff) %l0-3: 0000000000000000 00000300060bacd8 0000000000000001 000003000683e480 %l4-7: 8000000000000000 0000000000000000 0000000003f34000 4000000000000000 000002a100adf030 zfs:metaslab_alloc_dva+114 (0, 7e68000, 30006766080, 20000, 30005572540, 1e910) %l0-3: 0000000000000001 0000000000000000 0000000000000003 000003000380b6e0 %l4-7: 0000000000000000 00000300060bacd0 0000000000000000 00000300060bacd0 000002a100adf100 zfs:metaslab_alloc+2c (3000391e940, 20000, 30006766080, 1, 1e910, 0) %l0-3: 0000009980001605 0000000000000016 0000000000001b4d 0000000000000214 %l4-7: 0000000000000000 0000000000000000 000003000391e940 0000000000000001 000002a100adf1b0 zfs:zio_dva_allocate+4c (30005dd8a40, 7b6335a8, 30006766080, 704e2508, 704e2400, 20001) %l0-3: 0000030005dd8a40 0000060200ff00ff 0000060200ff00ff 0000000000000000 %l4-7: 0000000000000000 00000000018a6400 0000000000000001 0000000000000006 000002a100adf260 zfs:zio_write_compress+1ec (30005dd8a40, 23e20b, 23e000, ff00ff, 2, 30006766080) %l0-3: 000000000000ffff 00000000000000ff 0000000000000100 0000000000020000 %l4-7: 0000000000000000 0000000000ff0000 000000000000fc00 00000000000000ff 000002a100adf330 zfs:arc_write+e4 (30005dd8a40, 3000391e940, 6, 2, 1, 1e910) %l0-3: ffffffffffffffff 000000007b6063c8 0000030006af2570 00000300060c5cf0 %l4-7: 000002a100adf538 0000000000000004 0000000000000004 00000300060c7a88 000002a100adf440 zfs:dbuf_sync+6c0 (30006af2570, 30005dd9440, 2b3ca, 2, 6, 1e910) %l0-3: 0000030005dd96c0 0000000000000000 0000030006ae7750 0000030006af2678 %l4-7: 0000030006766080 0000000000000013 0000000000000001 0000000000000000 000002a100adf560 zfs:dnode_sync+35c (0, 0, 30005dd9440, 30005ac8cc0, 2, 2) %l0-3: 0000030006af2570 0000030006ae77a8 0000030006ae7808 0000030006ae7808 %l4-7: 0000000000000000 0000030006ae77a8 0000000000000001 000003000640ace0 000002a100adf620 zfs:dmu_objset_sync_dnodes+6c (30005dd96c0, 30005dd97a0, 30005ac8cc0, 30006ae7750, 30006bd3ca0, 0) %l0-3: 00000000704e84c0 00000000704e8000 00000000704e8000 0000000000000001 %l4-7: 0000000000000000 00000000704e4000 0000000000000000 0000030005dd9440 000002a100adf6d0 zfs:dmu_objset_sync+54 (30005dd96c0, 30005ac8cc0, 0, 0, 300060c5318, 1e910) %l0-3: 0000000000000000 000000000000000f 0000000000000000 000000000000478d %l4-7: 0000030005dd97a0 0000000000000000 0000030005dd97a0 0000030005dd9820 000002a100adf7e0 zfs:dsl_dataset_sync+c (30006f36780, 30005ac8cc0, 30006f36810, 300040c7db8, 300040c7db8, 30006f36780) %l0-3: 0000000000000001 0000000000000007 00000300040c7e38 0000000000000000 %l4-7: 0000030006f36808 0000000000000000 0000000000000000 0000000000000000 000002a100adf890 zfs:dsl_pool_sync+64 (300040c7d00, 1e910, 30006f36780, 30005ac9640, 30005581a80, 30005581aa8) %l0-3: 0000000000000000 000003000391ed00 0000030005ac8cc0 00000300040c7e98 %l4-7: 00000300040c7e68 00000300040c7e38 00000300040c7da8 0000030005dd9440 000002a100adf940 zfs:spa_sync+1b0 (3000391e940, 1e910, 0, 0, 2a100adfcc4, 1) %l0-3: 000003000391eb00 000003000391eb10 000003000391ea28 0000030005ac9640 %l4-7: 0000000000000000 000003000410f580 00000300040c7d00 000003000391eac0 000002a100adfa00 zfs:txg_sync_thread+134 (300040c7d00, 1e910, 0, 2a100adfab0, 300040c7e10, 300040c7e12) %l0-3: 00000300040c7e20 00000300040c7dd0 0000000000000000 00000300040c7dd8 %l4-7: 00000300040c7e16 00000300040c7e14 00000300040c7dc8 000000000001e911 syncing file systems... [1] 16 [1] 6 [1] [1] [1] [1] [1] [1] [1] [1] [1] [1] [1] [1] [1] [1] [1] [1] [1] [1] [1] [1] [1] done (not all i/o completed) dumping to /dev/dsk/c1t0d0s1, offset 429916160, content: kernel\ Unfortunately I dont have much idea on crash dump analsys. Can any one explain why my machine went down ? -Masthan D --------------------------------- Don''t let your dream ride pass you by. Make it a reality with Yahoo! Autos. -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20071008/507b7af9/attachment.html>
Hi All, Any one has any chance to look into this issue ? -Masthan D dudekula mastan <d_mastan at yahoo.com> wrote: Hi All, While pumping IO on a zfs file system my ststem is crashing/panicing. Please find the crash dump below. panic[cpu0]/thread=2a100adfcc0: assertion failed: ss != NULL, file: ../../common/fs/zfs/space_map.c, line: 125 000002a100adec40 genunix:assfail+74 (7b652448, 7b652458, 7d, 183d800, 11ed400, 0) %l0-3: 0000000000000000 0000000000000000 00000000011e7508 000003000744ea30 %l4-7: 00000000011ed400 0000000000000000 000000000186fc00 0000000000000000 000002a100adecf0 zfs:space_map_remove+b8 (3000683e7b8, 2b200000, 20000, 7b652000, 7b652400, 7b652400) %l0-3: 0000000000000000 000000002b220000 000000002b0ec600 000003000744ebc0 %l4-7: 000003000744eaf8 000000002b0ec000 000000007b652000 000000002b0ec600 000002a100adedd0 zfs:space_map_load+218 (3000683e7b8, 30006f5f160, 1000, 3000683e488, 2b000000, 1) %l0-3: 0000000000000160 0000030006f5f000 0000000000000000 000000007b620ad0 %l4-7: 000000007b62086c 00007fffffffffff 0000000000007fff 0000030006f5f128 000002a100adeea0 zfs:metaslab_activate+3c (3000683e480, 8000000000000000, c000000000000000, 24a998, 3000683e480, c0000000) %l0-3: 0000000000000000 0000000000000008 0000000000000000 0000029ebf9d0000 %l4-7: 00000000704e2000 000003000391e940 0000030005572540 00000300060bacd0 000002a100adef50 zfs:metaslab_group_alloc+1bc (3fffffffffffffff, 20000, 8000000000000000, 7e68000, 30006766080, ffffffffffffffff) %l0-3: 0000000000000000 00000300060bacd8 0000000000000001 000003000683e480 %l4-7: 8000000000000000 0000000000000000 0000000003f34000 4000000000000000 000002a100adf030 zfs:metaslab_alloc_dva+114 (0, 7e68000, 30006766080, 20000, 30005572540, 1e910) %l0-3: 0000000000000001 0000000000000000 0000000000000003 000003000380b6e0 %l4-7: 0000000000000000 00000300060bacd0 0000000000000000 00000300060bacd0 000002a100adf100 zfs:metaslab_alloc+2c (3000391e940, 20000, 30006766080, 1, 1e910, 0) %l0-3: 0000009980001605 0000000000000016 0000000000001b4d 0000000000000214 %l4-7: 0000000000000000 0000000000000000 000003000391e940 0000000000000001 000002a100adf1b0 zfs:zio_dva_allocate+4c (30005dd8a40, 7b6335a8, 30006766080, 704e2508, 704e2400, 20001) %l0-3: 0000030005dd8a40 0000060200ff00ff 0000060200ff00ff 0000000000000000 %l4-7: 0000000000000000 00000000018a6400 0000000000000001 0000000000000006 000002a100adf260 zfs:zio_write_compress+1ec (30005dd8a40, 23e20b, 23e000, ff00ff, 2, 30006766080) %l0-3: 000000000000ffff 00000000000000ff 0000000000000100 0000000000020000 %l4-7: 0000000000000000 0000000000ff0000 000000000000fc00 00000000000000ff 000002a100adf330 zfs:arc_write+e4 (30005dd8a40, 3000391e940, 6, 2, 1, 1e910) %l0-3: ffffffffffffffff 000000007b6063c8 0000030006af2570 00000300060c5cf0 %l4-7: 000002a100adf538 0000000000000004 0000000000000004 00000300060c7a88 000002a100adf440 zfs:dbuf_sync+6c0 (30006af2570, 30005dd9440, 2b3ca, 2, 6, 1e910) %l0-3: 0000030005dd96c0 0000000000000000 0000030006ae7750 0000030006af2678 %l4-7: 0000030006766080 0000000000000013 0000000000000001 0000000000000000 000002a100adf560 zfs:dnode_sync+35c (0, 0, 30005dd9440, 30005ac8cc0, 2, 2) %l0-3: 0000030006af2570 0000030006ae77a8 0000030006ae7808 0000030006ae7808 %l4-7: 0000000000000000 0000030006ae77a8 0000000000000001 000003000640ace0 000002a100adf620 zfs:dmu_objset_sync_dnodes+6c (30005dd96c0, 30005dd97a0, 30005ac8cc0, 30006ae7750, 30006bd3ca0, 0) %l0-3: 00000000704e84c0 00000000704e8000 00000000704e8000 0000000000000001 %l4-7: 0000000000000000 00000000704e4000 0000000000000000 0000030005dd9440 000002a100adf6d0 zfs:dmu_objset_sync+54 (30005dd96c0, 30005ac8cc0, 0, 0, 300060c5318, 1e910) %l0-3: 0000000000000000 000000000000000f 0000000000000000 000000000000478d %l4-7: 0000030005dd97a0 0000000000000000 0000030005dd97a0 0000030005dd9820 000002a100adf7e0 zfs:dsl_dataset_sync+c (30006f36780, 30005ac8cc0, 30006f36810, 300040c7db8, 300040c7db8, 30006f36780) %l0-3: 0000000000000001 0000000000000007 00000300040c7e38 0000000000000000 %l4-7: 0000030006f36808 0000000000000000 0000000000000000 0000000000000000 000002a100adf890 zfs:dsl_pool_sync+64 (300040c7d00, 1e910, 30006f36780, 30005ac9640, 30005581a80, 30005581aa8) %l0-3: 0000000000000000 000003000391ed00 0000030005ac8cc0 00000300040c7e98 %l4-7: 00000300040c7e68 00000300040c7e38 00000300040c7da8 0000030005dd9440 000002a100adf940 zfs:spa_sync+1b0 (3000391e940, 1e910, 0, 0, 2a100adfcc4, 1) %l0-3: 000003000391eb00 000003000391eb10 000003000391ea28 0000030005ac9640 %l4-7: 0000000000000000 000003000410f580 00000300040c7d00 000003000391eac0 000002a100adfa00 zfs:txg_sync_thread+134 (300040c7d00, 1e910, 0, 2a100adfab0, 300040c7e10, 300040c7e12) %l0-3: 00000300040c7e20 00000300040c7dd0 0000000000000000 00000300040c7dd8 %l4-7: 00000300040c7e16 00000300040c7e14 00000300040c7dc8 000000000001e911 syncing file systems... [1] 16 [1] 6 [1] [1] [1] [1] [1] [1] [1] [1] [1] [1] [1] [1] [1] [1] [1] [1] [1] [1] [1] [1] [1] done (not all i/o completed) dumping to /dev/dsk/c1t0d0s1, offset 429916160, content: kernel\ Unfortunately I dont have much idea on crash dump analsys. Can any one explain why my machine went down ? -Masthan D --------------------------------- Don''t let your dream ride pass you by. Make it a reality with Yahoo! Autos. _______________________________________________ zfs-discuss mailing list zfs-discuss at opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss --------------------------------- Looking for a deal? Find great prices on flights and hotels with Yahoo! FareChase. -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20071008/1e6e15d0/attachment.html>
Prabahar Jeyaram
2007-Oct-09 06:07 UTC
[zfs-discuss] ZFS file system is crashing my system
Your system seem to have hit a variant of BUG : 6458218 - http://bugs.opensolaris.org/view_bug.do?bug_id=6458218 This is fixed in Opensolaris Build 60 or S10U4. -- Prabahar. On Oct 8, 2007, at 10:04 PM, dudekula mastan wrote:> Hi All, > > Any one has any chance to look into this issue ? > > -Masthan D > > dudekula mastan <d_mastan at yahoo.com> wrote: > > Hi All, > > While pumping IO on a zfs file system my ststem is crashing/ > panicing. Please find the crash dump below. > > panic[cpu0]/thread=2a100adfcc0: assertion failed: ss != NULL, > file: ../../common/fs/zfs/space_map.c, line: 125 > 000002a100adec40 genunix:assfail+74 (7b652448, 7b652458, 7d, > 183d800, 11ed400, 0) > %l0-3: 0000000000000000 0000000000000000 00000000011e7508 > 000003000744ea30 > %l4-7: 00000000011ed400 0000000000000000 000000000186fc00 > 0000000000000000 > 000002a100adecf0 zfs:space_map_remove+b8 (3000683e7b8, 2b200000, > 20000, 7b652000, 7b652400, 7b652400) > %l0-3: 0000000000000000 000000002b220000 000000002b0ec600 > 000003000744ebc0 > %l4-7: 000003000744eaf8 000000002b0ec000 000000007b652000 > 000000002b0ec600 > 000002a100adedd0 zfs:space_map_load+218 (3000683e7b8, 30006f5f160, > 1000, 3000683e488, 2b000000, 1) > %l0-3: 0000000000000160 0000030006f5f000 0000000000000000 > 000000007b620ad0 > %l4-7: 000000007b62086c 00007fffffffffff 0000000000007fff > 0000030006f5f128 > 000002a100adeea0 zfs:metaslab_activate+3c (3000683e480, > 8000000000000000, c000000000000000, 24a998, 3000683e480, c0000000) > %l0-3: 0000000000000000 0000000000000008 0000000000000000 > 0000029ebf9d0000 > %l4-7: 00000000704e2000 000003000391e940 0000030005572540 > 00000300060bacd0 > 000002a100adef50 zfs:metaslab_group_alloc+1bc (3fffffffffffffff, > 20000, 8000000000000000, 7e68000, 30006766080, ffffffffffffffff) > %l0-3: 0000000000000000 00000300060bacd8 0000000000000001 > 000003000683e480 > %l4-7: 8000000000000000 0000000000000000 0000000003f34000 > 4000000000000000 > 000002a100adf030 zfs:metaslab_alloc_dva+114 (0, 7e68000, > 30006766080, 20000, 30005572540, 1e910) > %l0-3: 0000000000000001 0000000000000000 0000000000000003 > 000003000380b6e0 > %l4-7: 0000000000000000 00000300060bacd0 0000000000000000 > 00000300060bacd0 > 000002a100adf100 zfs:metaslab_alloc+2c (3000391e940, 20000, > 30006766080, 1, 1e910, 0) > %l0-3: 0000009980001605 0000000000000016 0000000000001b4d > 0000000000000214 > %l4-7: 0000000000000000 0000000000000000 000003000391e940 > 0000000000000001 > 000002a100adf1b0 zfs:zio_dva_allocate+4c (30005dd8a40, 7b6335a8, > 30006766080, 704e2508, 704e2400, 20001) > %l0-3: 0000030005dd8a40 0000060200ff00ff 0000060200ff00ff > 0000000000000000 > %l4-7: 0000000000000000 00000000018a6400 0000000000000001 > 0000000000000006 > 000002a100adf260 zfs:zio_write_compress+1ec (30005dd8a40, 23e20b, > 23e000, ff00ff, 2, 30006766080) > %l0-3: 000000000000ffff 00000000000000ff 0000000000000100 > 0000000000020000 > %l4-7: 0000000000000000 0000000000ff0000 000000000000fc00 > 00000000000000ff > 000002a100adf330 zfs:arc_write+e4 (30005dd8a40, 3000391e940, 6, 2, > 1, 1e910) > %l0-3: ffffffffffffffff 000000007b6063c8 0000030006af2570 > 00000300060c5cf0 > %l4-7: 000002a100adf538 0000000000000004 0000000000000004 > 00000300060c7a88 > 000002a100adf440 zfs:dbuf_sync+6c0 (30006af2570, 30005dd9440, > 2b3ca, 2, 6, 1e910) > %l0-3: 0000030005dd96c0 0000000000000000 0000030006ae7750 > 0000030006af2678 > %l4-7: 0000030006766080 0000000000000013 0000000000000001 > 0000000000000000 > 000002a100adf560 zfs:dnode_sync+35c (0, 0, 30005dd9440, > 30005ac8cc0, 2, 2) > %l0-3: 0000030006af2570 0000030006ae77a8 0000030006ae7808 > 0000030006ae7808 > %l4-7: 0000000000000000 0000030006ae77a8 0000000000000001 > 000003000640ace0 > 000002a100adf620 zfs:dmu_objset_sync_dnodes+6c (30005dd96c0, > 30005dd97a0, 30005ac8cc0, 30006ae7750, 30006bd3ca0, 0) > %l0-3: 00000000704e84c0 00000000704e8000 00000000704e8000 > 0000000000000001 > %l4-7: 0000000000000000 00000000704e4000 0000000000000000 > 0000030005dd9440 > 000002a100adf6d0 zfs:dmu_objset_sync+54 (30005dd96c0, 30005ac8cc0, > 0, 0, 300060c5318, 1e910) > %l0-3: 0000000000000000 000000000000000f 0000000000000000 > 000000000000478d > %l4-7: 0000030005dd97a0 0000000000000000 0000030005dd97a0 > 0000030005dd9820 > 000002a100adf7e0 zfs:dsl_dataset_sync+c (30006f36780, 30005ac8cc0, > 30006f36810, 300040c7db8, 300040c7db8, 30006f36780) > %l0-3: 0000000000000001 0000000000000007 00000300040c7e38 > 0000000000000000 > %l4-7: 0000030006f36808 0000000000000000 0000000000000000 > 0000000000000000 > 000002a100adf890 zfs:dsl_pool_sync+64 (300040c7d00, 1e910, > 30006f36780, 30005ac9640, 30005581a80, 30005581aa8) > %l0-3: 0000000000000000 000003000391ed00 0000030005ac8cc0 > 00000300040c7e98 > %l4-7: 00000300040c7e68 00000300040c7e38 00000300040c7da8 > 0000030005dd9440 > 000002a100adf940 zfs:spa_sync+1b0 (3000391e940, 1e910, 0, 0, > 2a100adfcc4, 1) > %l0-3: 000003000391eb00 000003000391eb10 000003000391ea28 > 0000030005ac9640 > %l4-7: 0000000000000000 000003000410f580 00000300040c7d00 > 000003000391eac0 > 000002a100adfa00 zfs:txg_sync_thread+134 (300040c7d00, 1e910, 0, > 2a100adfab0, 300040c7e10, 300040c7e12) > %l0-3: 00000300040c7e20 00000300040c7dd0 0000000000000000 > 00000300040c7dd8 > %l4-7: 00000300040c7e16 00000300040c7e14 00000300040c7dc8 > 000000000001e911 > syncing file systems... [1] 16 [1] 6 [1] [1] [1] [1] [1] [1] [1] > [1] [1] [1] [1] [1] [1] [1] [1] [1] [1] [1] [1] [1] [1] done (not > all i/o completed) > dumping to /dev/dsk/c1t0d0s1, offset 429916160, content: kernel\ > > > Unfortunately I dont have much idea on crash dump analsys. Can any > one explain why my machine went down ? > > -Masthan D > Don''t let your dream ride pass you by. Make it a reality with > Yahoo! Autos. _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > > > Looking for a deal? Find great prices on flights and hotels with > Yahoo! FareChase. > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
I didn''t see an exact match in the bug database, but http://bugs.opensolaris.org/view_bug.do?bug_id=6328538 looks possible. (The line number doesn''t quite match, but the call chain does.) Someone else reported this last month: http://www.opensolaris.org/jive/thread.jspa?messageID=155834 but there wasn''t a definite conclusion from there about what the problem is. This message posted from opensolaris.org
Hi Jeyaram, Thanks for your reply. Can you explain more about this bug ? Regards Masthan D Prabahar Jeyaram <Prabahar.Jeyaram at Sun.COM> wrote: Your system seem to have hit a variant of BUG : 6458218 - http://bugs.opensolaris.org/view_bug.do?bug_id=6458218 This is fixed in Opensolaris Build 60 or S10U4. -- Prabahar. On Oct 8, 2007, at 10:04 PM, dudekula mastan wrote:> Hi All, > > Any one has any chance to look into this issue ? > > -Masthan D > > dudekula mastan wrote: > > Hi All, > > While pumping IO on a zfs file system my ststem is crashing/ > panicing. Please find the crash dump below. > > panic[cpu0]/thread=2a100adfcc0: assertion failed: ss != NULL, > file: ../../common/fs/zfs/space_map.c, line: 125 > 000002a100adec40 genunix:assfail+74 (7b652448, 7b652458, 7d, > 183d800, 11ed400, 0) > %l0-3: 0000000000000000 0000000000000000 00000000011e7508 > 000003000744ea30 > %l4-7: 00000000011ed400 0000000000000000 000000000186fc00 > 0000000000000000 > 000002a100adecf0 zfs:space_map_remove+b8 (3000683e7b8, 2b200000, > 20000, 7b652000, 7b652400, 7b652400) > %l0-3: 0000000000000000 000000002b220000 000000002b0ec600 > 000003000744ebc0 > %l4-7: 000003000744eaf8 000000002b0ec000 000000007b652000 > 000000002b0ec600 > 000002a100adedd0 zfs:space_map_load+218 (3000683e7b8, 30006f5f160, > 1000, 3000683e488, 2b000000, 1) > %l0-3: 0000000000000160 0000030006f5f000 0000000000000000 > 000000007b620ad0 > %l4-7: 000000007b62086c 00007fffffffffff 0000000000007fff > 0000030006f5f128 > 000002a100adeea0 zfs:metaslab_activate+3c (3000683e480, > 8000000000000000, c000000000000000, 24a998, 3000683e480, c0000000) > %l0-3: 0000000000000000 0000000000000008 0000000000000000 > 0000029ebf9d0000 > %l4-7: 00000000704e2000 000003000391e940 0000030005572540 > 00000300060bacd0 > 000002a100adef50 zfs:metaslab_group_alloc+1bc (3fffffffffffffff, > 20000, 8000000000000000, 7e68000, 30006766080, ffffffffffffffff) > %l0-3: 0000000000000000 00000300060bacd8 0000000000000001 > 000003000683e480 > %l4-7: 8000000000000000 0000000000000000 0000000003f34000 > 4000000000000000 > 000002a100adf030 zfs:metaslab_alloc_dva+114 (0, 7e68000, > 30006766080, 20000, 30005572540, 1e910) > %l0-3: 0000000000000001 0000000000000000 0000000000000003 > 000003000380b6e0 > %l4-7: 0000000000000000 00000300060bacd0 0000000000000000 > 00000300060bacd0 > 000002a100adf100 zfs:metaslab_alloc+2c (3000391e940, 20000, > 30006766080, 1, 1e910, 0) > %l0-3: 0000009980001605 0000000000000016 0000000000001b4d > 0000000000000214 > %l4-7: 0000000000000000 0000000000000000 000003000391e940 > 0000000000000001 > 000002a100adf1b0 zfs:zio_dva_allocate+4c (30005dd8a40, 7b6335a8, > 30006766080, 704e2508, 704e2400, 20001) > %l0-3: 0000030005dd8a40 0000060200ff00ff 0000060200ff00ff > 0000000000000000 > %l4-7: 0000000000000000 00000000018a6400 0000000000000001 > 0000000000000006 > 000002a100adf260 zfs:zio_write_compress+1ec (30005dd8a40, 23e20b, > 23e000, ff00ff, 2, 30006766080) > %l0-3: 000000000000ffff 00000000000000ff 0000000000000100 > 0000000000020000 > %l4-7: 0000000000000000 0000000000ff0000 000000000000fc00 > 00000000000000ff > 000002a100adf330 zfs:arc_write+e4 (30005dd8a40, 3000391e940, 6, 2, > 1, 1e910) > %l0-3: ffffffffffffffff 000000007b6063c8 0000030006af2570 > 00000300060c5cf0 > %l4-7: 000002a100adf538 0000000000000004 0000000000000004 > 00000300060c7a88 > 000002a100adf440 zfs:dbuf_sync+6c0 (30006af2570, 30005dd9440, > 2b3ca, 2, 6, 1e910) > %l0-3: 0000030005dd96c0 0000000000000000 0000030006ae7750 > 0000030006af2678 > %l4-7: 0000030006766080 0000000000000013 0000000000000001 > 0000000000000000 > 000002a100adf560 zfs:dnode_sync+35c (0, 0, 30005dd9440, > 30005ac8cc0, 2, 2) > %l0-3: 0000030006af2570 0000030006ae77a8 0000030006ae7808 > 0000030006ae7808 > %l4-7: 0000000000000000 0000030006ae77a8 0000000000000001 > 000003000640ace0 > 000002a100adf620 zfs:dmu_objset_sync_dnodes+6c (30005dd96c0, > 30005dd97a0, 30005ac8cc0, 30006ae7750, 30006bd3ca0, 0) > %l0-3: 00000000704e84c0 00000000704e8000 00000000704e8000 > 0000000000000001 > %l4-7: 0000000000000000 00000000704e4000 0000000000000000 > 0000030005dd9440 > 000002a100adf6d0 zfs:dmu_objset_sync+54 (30005dd96c0, 30005ac8cc0, > 0, 0, 300060c5318, 1e910) > %l0-3: 0000000000000000 000000000000000f 0000000000000000 > 000000000000478d > %l4-7: 0000030005dd97a0 0000000000000000 0000030005dd97a0 > 0000030005dd9820 > 000002a100adf7e0 zfs:dsl_dataset_sync+c (30006f36780, 30005ac8cc0, > 30006f36810, 300040c7db8, 300040c7db8, 30006f36780) > %l0-3: 0000000000000001 0000000000000007 00000300040c7e38 > 0000000000000000 > %l4-7: 0000030006f36808 0000000000000000 0000000000000000 > 0000000000000000 > 000002a100adf890 zfs:dsl_pool_sync+64 (300040c7d00, 1e910, > 30006f36780, 30005ac9640, 30005581a80, 30005581aa8) > %l0-3: 0000000000000000 000003000391ed00 0000030005ac8cc0 > 00000300040c7e98 > %l4-7: 00000300040c7e68 00000300040c7e38 00000300040c7da8 > 0000030005dd9440 > 000002a100adf940 zfs:spa_sync+1b0 (3000391e940, 1e910, 0, 0, > 2a100adfcc4, 1) > %l0-3: 000003000391eb00 000003000391eb10 000003000391ea28 > 0000030005ac9640 > %l4-7: 0000000000000000 000003000410f580 00000300040c7d00 > 000003000391eac0 > 000002a100adfa00 zfs:txg_sync_thread+134 (300040c7d00, 1e910, 0, > 2a100adfab0, 300040c7e10, 300040c7e12) > %l0-3: 00000300040c7e20 00000300040c7dd0 0000000000000000 > 00000300040c7dd8 > %l4-7: 00000300040c7e16 00000300040c7e14 00000300040c7dc8 > 000000000001e911 > syncing file systems... [1] 16 [1] 6 [1] [1] [1] [1] [1] [1] [1] > [1] [1] [1] [1] [1] [1] [1] [1] [1] [1] [1] [1] [1] [1] done (not > all i/o completed) > dumping to /dev/dsk/c1t0d0s1, offset 429916160, content: kernel\ > > > Unfortunately I dont have much idea on crash dump analsys. Can any > one explain why my machine went down ? > > -Masthan D > Don''t let your dream ride pass you by. Make it a reality with > Yahoo! Autos. _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > > > Looking for a deal? Find great prices on flights and hotels with > Yahoo! FareChase. > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss--------------------------------- Check out the hottest 2008 models today at Yahoo! Autos. -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20071009/0757754c/attachment.html>
Prabahar Jeyaram
2007-Oct-09 15:38 UTC
[zfs-discuss] ZFS file system is crashing my system
Hi Masthan, There was a race in the block allocation code which allocates a single disk block to two consumers. The system will trip when both the consumers try to free the block. -- Prabahar. On Oct 9, 2007, at 4:20 AM, dudekula mastan wrote:> Hi Jeyaram, > > Thanks for your reply. Can you explain more about this bug ? > > Regards > Masthan D > > Prabahar Jeyaram <Prabahar.Jeyaram at Sun.COM> wrote: > Your system seem to have hit a variant of BUG : > > 6458218 - http://bugs.opensolaris.org/view_bug.do?bug_id=6458218 > > This is fixed in Opensolaris Build 60 or S10U4. > > -- > Prabahar. > > > On Oct 8, 2007, at 10:04 PM, dudekula mastan wrote: > > > Hi All, > > > > Any one has any chance to look into this issue ? > > > > -Masthan D > > > > dudekula mastan wrote: > > > > Hi All, > > > > While pumping IO on a zfs file system my ststem is crashing/ > > panicing. Please find the crash dump below. > > > > panic[cpu0]/thread=2a100adfcc0: assertion failed: ss != NULL, > > file: ../../common/fs/zfs/space_map.c, line: 125 > > 000002a100adec40 genunix:assfail+74 (7b652448, 7b652458, 7d, > > 183d800, 11ed400, 0) > > %l0-3: 0000000000000000 0000000000000000 00000000011e7508 > > 000003000744ea30 > > %l4-7: 00000000011ed400 0000000000000000 000000000186fc00 > > 0000000000000000 > > 000002a100adecf0 zfs:space_map_remove+b8 (3000683e7b8, 2b200000, > > 20000, 7b652000, 7b652400, 7b652400) > > %l0-3: 0000000000000000 000000002b220000 000000002b0ec600 > > 000003000744ebc0 > > %l4-7: 000003000744eaf8 000000002b0ec000 000000007b652000 > > 000000002b0ec600 > > 000002a100adedd0 zfs:space_map_load+218 (3000683e7b8, 30006f5f160, > > 1000, 3000683e488, 2b000000, 1) > > %l0-3: 0000000000000160 0000030006f5f000 0000000000000000 > > 000000007b620ad0 > > %l4-7: 000000007b62086c 00007fffffffffff 0000000000007fff > > 0000030006f5f128 > > 000002a100adeea0 zfs:metaslab_activate+3c (3000683e480, > > 8000000000000000, c000000000000000, 24a998, 3000683e480, c0000000) > > %l0-3: 0000000000000000 0000000000000008 0000000000000000 > > 0000029ebf9d0000 > > %l4-7: 00000000704e2000 000003000391e940 0000030005572540 > > 00000300060bacd0 > > 000002a100adef50 zfs:metaslab_group_alloc+1bc (3fffffffffffffff, > > 20000, 8000000000000000, 7e68000, 30006766080, ffffffffffffffff) > > %l0-3: 0000000000000000 00000300060bacd8 0000000000000001 > > 000003000683e480 > > %l4-7: 8000000000000000 0000000000000000 0000000003f34000 > > 4000000000000000 > > 000002a100adf030 zfs:metaslab_alloc_dva+114 (0, 7e68000, > > 30006766080, 20000, 30005572540, 1e910) > > %l0-3: 0000000000000001 0000000000000000 0000000000000003 > > 000003000380b6e0 > > %l4-7: 0000000000000000 00000300060bacd0 0000000000000000 > > 00000300060bacd0 > > 000002a100adf100 zfs:metaslab_alloc+2c (3000391e940, 20000, > > 30006766080, 1, 1e910, 0) > > %l0-3: 0000009980001605 0000000000000016 0000000000001b4d > > 0000000000000214 > > %l4-7: 0000000000000000 0000000000000000 000003000391e940 > > 0000000000000001 > > 000002a100adf1b0 zfs:zio_dva_allocate+4c (30005dd8a40, 7b6335a8, > > 30006766080, 704e2508, 704e2400, 20001) > > %l0-3: 0000030005dd8a40 0000060200ff00ff 0000060200ff00ff > > 0000000000000000 > > %l4-7: 0000000000000000 00000000018a6400 0000000000000001 > > 0000000000000006 > > 000002a100adf260 zfs:zio_write_compress+1ec (30005dd8a40, 23e20b, > > 23e000, ff00ff, 2, 30006766080) > > %l0-3: 000000000000ffff 00000000000000ff 0000000000000100 > > 0000000000020000 > > %l4-7: 0000000000000000 0000000000ff0000 000000000000fc00 > > 00000000000000ff > > 000002a100adf330 zfs:arc_write+e4 (30005dd8a40, 3000391e940, 6, 2, > > 1, 1e910) > > %l0-3: ffffffffffffffff 000000007b6063c8 0000030006af2570 > > 00000300060c5cf0 > > %l4-7: 000002a100adf538 0000000000000004 0000000000000004 > > 00000300060c7a88 > > 000002a100adf440 zfs:dbuf_sync+6c0 (30006af2570, 30005dd9440, > > 2b3ca, 2, 6, 1e910) > > %l0-3: 0000030005dd96c0 0000000000000000 0000030006ae7750 > > 0000030006af2678 > > %l4-7: 0000030006766080 0000000000000013 0000000000000001 > > 0000000000000000 > > 000002a100adf560 zfs:dnode_sync+35c (0, 0, 30005dd9440, > > 30005ac8cc0, 2, 2) > > %l0-3: 0000030006af2570 0000030006ae77a8 0000030006ae7808 > > 0000030006ae7808 > > %l4-7: 0000000000000000 0000030006ae77a8 0000000000000001 > > 000003000640ace0 > > 000002a100adf620 zfs:dmu_objset_sync_dnodes+6c (30005dd96c0, > > 30005dd97a0, 30005ac8cc0, 30006ae7750, 30006bd3ca0, 0) > > %l0-3: 00000000704e84c0 00000000704e8000 00000000704e8000 > > 0000000000000001 > > %l4-7: 0000000000000000 00000000704e4000 0000000000000000 > > 0000030005dd9440 > > 000002a100adf6d0 zfs:dmu_objset_sync+54 (30005dd96c0, 30005ac8cc0, > > 0, 0, 300060c5318, 1e910) > > %l0-3: 0000000000000000 000000000000000f 0000000000000000 > > 000000000000478d > > %l4-7: 0000030005dd97a0 0000000000000000 0000030005dd97a0 > > 0000030005dd9820 > > 000002a100adf7e0 zfs:dsl_dataset_sync+c (30006f36780, 30005ac8cc0, > > 30006f36810, 300040c7db8, 300040c7db8, 30006f36780) > > %l0-3: 0000000000000001 0000000000000007 00000300040c7e38 > > 0000000000000000 > > %l4-7: 0000030006f36808 0000000000000000 0000000000000000 > > 0000000000000000 > > 000002a100adf890 zfs:dsl_pool_sync+64 (300040c7d00, 1e910, > > 30006f36780, 30005ac9640, 30005581a80, 30005581aa8) > > %l0-3: 0000000000000000 000003000391ed00 0000030005ac8cc0 > > 00000300040c7e98 > > %l4-7: 00000300040c7e68 00000300040c7e38 00000300040c7da8 > > 0000030005dd9440 > > 000002a100adf940 zfs:spa_sync+1b0 (3000391e940, 1e910, 0, 0, > > 2a100adfcc4, 1) > > %l0-3: 000003000391eb00 000003000391eb10 000003000391ea28 > > 0000030005ac9640 > > %l4-7: 0000000000000000 000003000410f580 00000300040c7d00 > > 000003000391eac0 > > 000002a100adfa00 zfs:txg_sync_thread+134 (300040c7d00, 1e910, 0, > > 2a100adfab0, 300040c7e10, 300040c7e12) > > %l0-3: 00000300040c7e20 00000300040c7dd0 0000000000000000 > > 00000300040c7dd8 > > %l4-7: 00000300040c7e16 00000300040c7e14 00000300040c7dc8 > > 000000000001e911 > > syncing file systems... [1] 16 [1] 6 [1] [1] [1] [1] [1] [1] [1] > > [1] [1] [1] [1] [1] [1] [1] [1] [1] [1] [1] [1] [1] [1] done (not > > all i/o completed) > > dumping to /dev/dsk/c1t0d0s1, offset 429916160, content: kernel\ > > > > > > Unfortunately I dont have much idea on crash dump analsys. Can any > > one explain why my machine went down ? > > > > -Masthan D > > Don''t let your dream ride pass you by. Make it a reality with > > Yahoo! Autos. _______________________________________________ > > zfs-discuss mailing list > > zfs-discuss at opensolaris.org > > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > > > > > > Looking for a deal? Find great prices on flights and hotels with > > Yahoo! FareChase. > > _______________________________________________ > > zfs-discuss mailing list > > zfs-discuss at opensolaris.org > > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > > > > Check out the hottest 2008 models today at Yahoo! Autos. > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Hi Everybody, From the last one week so many mails are exchanged on this topic. I have also one similar issue like this. I will appreciate if any one helps me on this. I have an IO test tool, which writes the data and reads the data and then compare the read data with write data. If read data and write data are same then there is no CORRUIPTION else there is a CORRUPTION. File data may corrupt because of any reasons and one possible reason is file system cache. If file system cache have issues, it will give wrong data (wrong data means the actual data on the disk and the data that read call return to the application are not match) to user applications. When there is a CORRUPTION, to check file system cache issues, my application bypass the file system cache and then reads (Re-read) the data from the same file and then compare the re-read data with write data. Tell me, is there a way to skip ZFS file system cache or tell me is there a way to do direct IO on ZFS file system? Regards Masthan D --------------------------------- Don''t let your dream ride pass you by. Make it a reality with Yahoo! Autos. -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20071009/bc535fc7/attachment.html>
Hi All, Any update on this ? -Masthan D dudekula mastan <d_mastan at yahoo.com> wrote: Hi Everybody, From the last one week so many mails are exchanged on this topic. I have also one similar issue like this. I will appreciate if any one helps me on this. I have an IO test tool, which writes the data and reads the data and then compare the read data with write data. If read data and write data are same then there is no CORRUIPTION else there is a CORRUPTION. File data may corrupt because of any reasons and one possible reason is file system cache. If file system cache have issues, it will give wrong data (wrong data means the actual data on the disk and the data that read call return to the application are not match) to user applications. When there is a CORRUPTION, to check file system cache issues, my application bypass the file system cache and then reads (Re-read) the data from the same file and then compare the re-read data with write data. Tell me, is there a way to skip ZFS file system cache or tell me is there a way to do direct IO on ZFS file system? Regards Masthan D --------------------------------- Don''t let your dream ride pass you by. Make it a reality with Yahoo! Autos. _______________________________________________ zfs-discuss mailing list zfs-discuss at opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss --------------------------------- Pinpoint customers who are looking for what you sell. -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20071010/2d9cac74/attachment.html>
> Tell me, is there a way to skip ZFS file system cache or tell me is > there a way to do direct IO on ZFS file system?No, currently there is no way to disable file system cache aka ARC in ZFS. There is a pending RFE though, 6429855 Need way to tell ZFS that caching is a lost cause Cheers, Vidya Sakar dudekula mastan wrote:> Hi All, > > Any update on this ? > > -Masthan D > > */dudekula mastan <d_mastan at yahoo.com>/* wrote: > > Hi Everybody, > > From the last one week so many mails are exchanged on this topic. > > I have also one similar issue like this. I will appreciate if any > one helps me on this. > > I have an IO test tool, which writes the data and reads the data and > then compare the read data with write data. If read data and write > data are same then there is no CORRUIPTION else there is a CORRUPTION. > > File data may corrupt because of any reasons and one possible reason > is file system cache. If file system cache have issues, it will > give wrong data (wrong data means the actual data on the disk and > the data that read call return to the application are not match) to > user applications. > > When there is a CORRUPTION, to check file system cache issues, my > application bypass the file system cache and then reads (Re-read) > the data from the same file and then compare the re-read data with > write data. > > Tell me, is there a way to skip ZFS file system cache or tell me is > there a way to do direct IO on ZFS file system? > > Regards > Masthan D > > ------------------------------------------------------------------------ > Don''t let your dream ride pass you by. Make it a reality > <http://us.rd.yahoo.com/evt=51200/*http://autos.yahoo.com/index.html;_ylc=X3oDMTFibjNlcHF0BF9TAzk3MTA3MDc2BHNlYwNtYWlsdGFncwRzbGsDYXV0b3MtZHJlYW1jYXI-> > with Yahoo! Autos. _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > > > ------------------------------------------------------------------------ > Pinpoint customers > <http://us.rd.yahoo.com/evt=48250/*http://searchmarketing.yahoo.com/arp/sponsoredsearch_v9.php?o=US2226&cmp=Yahoo&ctv=AprNI&s=Y&s2=EM&b=50>who > are looking for what you sell. > > > ------------------------------------------------------------------------ > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Restarting this thread... I''ve just finished reading the article, "A look at MySQL on ZFS": http://dev.mysql.com/tech-resources/articles/mysql-zfs.html The section " MySQL Performance Comparison: ZFS vs. UFS on Open Solaris" looks interesting... Rayson
Hi Tony, John posted the URL to his article to the databases-discuss this morning, and I only took a very quick look. May be you can join that list and discuss further regarding the configurations? http://mail.opensolaris.org/mailman/listinfo/databases-discuss Rayson On 10/29/07, Tony Leone <Tony.Leone at oag.state.ny.us> wrote:> This is very interesting because it directly contradicts the results the ZFS developers are posting on the OpenSolaris mailing list. I just scanned the article, does he give his ZFS settings and is he separate ZIL devices? > > Tony Leone > > >>> "Rayson Ho" <rayrayson at gmail.com> 10/29/2007 11:39 AM >>> > Restarting this thread... I''ve just finished reading the article, "A > look at MySQL on ZFS": > http://dev.mysql.com/tech-resources/articles/mysql-zfs.html > > The section " MySQL Performance Comparison: ZFS vs. UFS on Open > Solaris" looks interesting... > > Rayson > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > >