Hi all; i just joined the group. My team has created a tool to watch 
Solaris,Linux,Windows. It is non-root on both the client and the server. 
We currently are using it to watch the 750 Sun Ray servers inside of 
sun. The basic thing it does is executes a 55 command shell script on 
the client and sends the 1000 line output up to the central server. We 
do this every 10 minutes across the 750 machines. We have a Niagara 
T2000, 32x1000mhz server  running Solaris 10 Generic_118833-08. We put 
ZFS on this box two months ago. We currently have 3.5 million files and 
3 billion lines of ASCII sitting on the internal drive of the Niagara. 
The box runs less than 20% load. Everything has been working perfectly 
until two days ago, now it can take 10 minutes to exit from vi. The 
following truss shows that the 3 line file that is sitting on the ZFS 
volume (/archives) took almost 15 minutes in fdsync. For those of you 
inside of Sun you can see the web page of all Sun Ray servers at 
http://canary.sfbay. My team has contacted engineering and we are 
waiting for more help. Any suggestions of what we might have hit would 
be appreciated.
%itsm-mpk-2% zpool list
NAME                    SIZE    USED   AVAIL    CAP  HEALTH     ALTROOT
canary                   54G   38.9G   15.1G    72%  ONLINE     -
%truss -o truss.out -dD vi /archives/junk
 2.8150  0.0005 write(1, "1B [ 1 M1B [ 3 2 B ~1B [".., 16)      = 16
read(0, 0xFFBFD00F, 1)          (sleeping...)
 4.1164  1.3014 read(0, " Z", 1)                                = 1
 4.3322  0.2158 read(0, " Z", 1)                                = 1
 4.3329  0.0007 write(1, "1B [ 3 3 B", 5)                       = 5
 4.3332  0.0003 write(1, " " / a r c h i v e s / j".., 16)      =
16
 4.3336  0.0004 stat64("/archives/junk", 0xFFBFCEF8)            = 0
148.4689        144.1353        creat("/archives/junk", 
0666)                   = 5
148.4703         0.0014 ioctl(2, TCSETSW, 0x00060C10)                   = 0
148.4708         0.0005 write(5, "\n 1 2 3 1 2 3\n", 8)               
= 8
*971.7021        823.2313        fdsync(5, 
FSYNC)                                = 0*
971.8102         0.1081 close(5)                                        = 0
971.8108         0.0006 write(1, "   2   l i n e s ,   8  ".., 23)    
= 23
971.8113         0.0005 write(1, "\r\n", 2)                           
= 2
971.8116         0.0003 write(1, "1B [ J", 3)                         
= 3
971.8120         0.0004 write(1, "1B [ ? 1 l1B >", 7)              
= 7
971.8124         0.0004 ioctl(2, TCSETSW, 0x00060C10)                   = 0
971.8126         0.0002 ioctl(2, TCGETS, 0x000E8098)                    = 0
971.8128         0.0002 ioctl(0, I_STR, 0x000579F8)                     
Err#22 EINVAL
971.8130         0.0002 ioctl(2, TIOCGPGRP, 0xFFBFCF8C)                 = 0
971.8131         0.0001 getpgid(0)                                      
= 9456
971.8134         0.0003 ioctl(2, TCSETSW, 0x000E8074)                   = 0
971.8137         0.0003 unlink("/var/tmp/ExKaaiEs")                   
= 0
971.8141         0.0004 unlink("/var/tmp/RxLaaiEs")                   
= 0
971.8144         0.0003 close(4)                                        = 0
971.8146         0.0002 _exit(0)
I tried the same command on /var/tmp, a non-ZFS volume. fdsync(4, FSYNC) 
takes 0.3 seconds
 3.4120  0.0004 write(1, " " / v a r / t m p / j u".., 18)      =
18
 3.4123  0.0003 stat64("/var/tmp/junk123", 0xFFBFCEF8)          Err#2
ENOENT
 3.4128  0.0005 creat("/var/tmp/junk123", 0666)                 = 4
 3.4132  0.0004 ioctl(2, TCSETSW, 0x00060C10)                   = 0
 3.4135  0.0003 write(4, " a\n\n", 3)                           = 3
* 3.7586  0.3451 fdsync(4, FSYNC)                                = 0*
 3.7589  0.0003 close(4)                                        = 0
 3.7592  0.0003 write(1, "   [ N e w   f i l e ]  ".., 34)      = 34
 3.7596  0.0004 write(1, "\r\n", 2)                             = 2
 3.7599  0.0003 write(1, "1B [ J", 3)                           = 3
 3.7602  0.0003 write(1, "1B [ ? 1 l1B >", 7)                   = 7
 3.7606  0.0004 ioctl(2, TCSETSW, 0x00060C10)                   = 0
 3.7608  0.0002 ioctl(2, TCGETS, 0x000E8098)                    = 0
 3.7610  0.0002 ioctl(0, I_STR, 0x000579F8)                     Err#22 
EINVAL
 3.7612  0.0002 ioctl(2, TIOCGPGRP, 0xFFBFCF8C)                 = 0
 3.7614  0.0002 getpgid(0)                                      = 28194
 3.7616  0.0002 ioctl(2, TCSETSW, 0x000E8074)                   = 0
 3.7619  0.0003 unlink("/var/tmp/ExR9aqk3")                     = 0
 3.7623  0.0004 _exit(0)
-- 
The canary monitor also watches our central server, here is the cpu load 
for the last 7 days on the T2000, the RED line is %USR+%SYS over a one 
minute sar -u command. We mostly use just 10% of the Niagara to receive 
110,000 files, 750 million  lines per day.
We also look to show all disk activity on the server by summing the 
r+w/s column from a "sar -d".
itsm-mpk-2% sar -d 3 3
SunOS itsm-mpk-2 5.10 Generic_118833-08 sun4v    06/17/2006
12:39:48   device        %busy   avque   r+w/s  blks/s  avwait  avserv
12:39:51   nfs1              0     0.0       0       0     0.0     0.0
           nfs60             0     0.0       0       0     0.0     0.0
           sd1             100    35.0     684    6131    27.8    23.4
           sd1,a             0     0.0       0       0     0.0     0.0
           sd1,b             0     0.0       0       0     0.0     0.0
           sd1,c             0     0.0       0       0     0.0     0.0
           sd1,d           100    35.0     684    6131    27.8    23.4
           sd2               0     0.0       0       0     0.0     0.0
           ohci0,bu          0     0.0       0       0     0.0     0.0
           ohci0,ct          0     0.0       0       0     0.0     0.0
           ohci0,in          0     0.0       0       0     0.0     0.0
           ohci0,is          0     0.0       0       0     0.0     0.0
           ohci0,to          0     0.0       0       0     0.0     0.0
           ohci1,bu          0     0.0       0       0     0.0     0.0
           ohci1,ct          0     0.0       0       0     0.0     0.0
           ohci1,in          0     0.0       0       0     0.0     0.0
           ohci1,is          0     0.0       0       0     0.0     0.0
           ohci1,to          0     0.0       0       0     0.0     0.0
Here is a graph of the disk activity (again over the last 7 days).
Thanks for any help with the fdsynch.
PS: It is planned to make the canary monitor tool open source.
<http://www.sun.com> 	* Sean Meighan *
Mgr ITSM Engineering
*Sun Microsystems, Inc.*
US
Phone x32329 / +1 408 850-9537
Mobile 303-520-2024
Fax 408 850-9537
Email Sean.Meighan at Sun.COM
	
------------------------------------------------------------------------
NOTICE: This email message is for the sole use of the intended 
recipient(s) and may contain confidential and privileged information. 
Any unauthorized review, use, disclosure or distribution is prohibited. 
If you are not the intended recipient, please contact the sender by 
reply email and destroy all copies of the original message.
------------------------------------------------------------------------
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20060617/21ce5559/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: itsm-mpk-2.SFBay.Sun.com_cpu.png
Type: image/png
Size: 13478 bytes
Desc: not available
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20060617/21ce5559/attachment.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: itsm-mpk-2.SFBay.Sun.com_sard.png
Type: image/png
Size: 11450 bytes
Desc: not available
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20060617/21ce5559/attachment-0001.png>
Sean Meighan wrote:> ... Everything has been working perfectly until two days ago, now it > can take 10 minutes to exit from vi. The following truss shows that > the 3 line file that is sitting on the ZFS volume (/archives) took > almost 15 minutes in fdsync. For those of you inside of Sun you can > see the web page of all Sun Ray servers at http://canary.sfbay. My > team has contacted engineering and we are waiting for more help. Any > suggestions of what we might have hit would be appreciated.Have you tried using dtrace to follow the code inside the kernel? Or just to profile the kernel while this script is running? Darren
Sean Meighan wrote:> Hi all; i just joined the group. My team has created a tool to watch > Solaris,Linux,Windows. It is non-root on both the client and the server. We > currently are using it to watch the 750 Sun Ray servers inside of sun. The basic > thing it does is executes a 55 command shell script on the client and sends the > 1000 line output up to the central server. We do this every 10 minutes across > the 750 machines. We have a Niagara T2000, 32x1000mhz server running Solaris 10 > Generic_118833-08. We put ZFS on this box two months ago. We currently have 3.5 > million files and 3 billion lines of ASCII sitting on the internal drive of the > Niagara. The box runs less than 20% load.> Everything has been working perfectly > until two days ago, now it can take 10 minutes to exit from vi.Hi Sean, May I ask what happened between 13/6 and 14/6? (Forgive my European bias in dates.) Without getting into the zfs details, *what changes occurred in the system* around that time, which is about two days before the "two days ago", which I assume refers to 15/6. We don''t need any statistical process analysis here, but just from looking at the graphs I would say that until the end of the day on 13/6, the system shows very regular activity in cpu *and* disk. The assertion cpu > users is true for most of this period, with only a few exceptions. After 14/6, this definitely isn''t. * at 18:00 13/6 the baseline for both cpu and disk drops visibly. * around 08:00 14/6 the baseline jumps back up to previous levels * between 14:00 and 16:00, baseline disk activity reaches new heights for the week. Variance is quite low during this time. * At the same time, CPU load rises steadily, with little variation, until halfway through, when it starts a new, more chaotic behaviour that deviates significantly from previous patterns. * After 16:00 14/6, disk activity also enters a new, more chaotic pattern with a higher variance. * There are a couple of polling gaps on 15/6 and 16/6, which was after you realised something was happening. The line graph has some disadvantages - I''d like to see a scatterplot, if possible with a log scale from 1 to 100 on the Y-axis... I don''t have access to staroffice and the stats-plugin anymore, don''t know if you have, but for data analysis they could be helpful. Cheers, Henk But first check what actually happened between 13 and 14 june. Were new h/w resources added? New s/w installed or accessed? Cheers, Henk Langeveld
Sean Meighan schrieb:> The box runs less than 20% load. Everything has been working perfectly > until two days ago, now it can take 10 minutes to exit from vi. The > following truss shows that the 3 line file that is sitting on the ZFS > volume (/archives) took almost 15 minutes in fdsync./me have similar observations. With virtual no other activity on the box exiting from vi (:wq) sometimes takes 10-60 seconds. Trussing the process always shows vi is waiting in fdsync(). This happened on at least two different machines: - Sun E3500 with ZFS on A5200 disks (snv_37, almost all of the time large delay (10-60 secs.) when exiting from vi) - Solaris/amd64 with ZFS on SATA disks (snv_39, but only few delays, usually not longer than 10 secs.) Daniel
Daniel Rock wrote:> Sean Meighan schrieb: > >> The box runs less than 20% load. Everything has been working perfectly >> until two days ago, now it can take 10 minutes to exit from vi. The >> following truss shows that the 3 line file that is sitting on the ZFS >> volume (/archives) took almost 15 minutes in fdsync. > > /me have similar observations. With virtual no other activity on the box > exiting from vi (:wq) sometimes takes 10-60 seconds. Trussing the > process always shows vi is waiting in fdsync(). This happened on at > least two different machines: > - Sun E3500 with ZFS on A5200 disks (snv_37, almost all of the time large > delay (10-60 secs.) when exiting from vi) > - Solaris/amd64 with ZFS on SATA disks (snv_39, but only few delays, > usually > not longer than 10 secs.)I think you may be tripping over: CR 6413510 zfs: writing to ZFS filesystem slows down fsync() on other files in the same FS (see http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6413510) It is apparently because ZFS currently syncs the entire filesystem when asked to sync just one file. Dana
ZFS engineering got back to us today and said the following: In addition to 6404018 there are couple of other performance bottle necks : 6413510 zfs: writing to ZFS filesystem slows down fsync() on other files in the same FS 6429205 each zpool needs to monitor it''s throughput and throttle heavy writers joining together and causing the slowdown in this system itsm-mpk-2.sfbay. 6404018/6413510 is caused because of the way the ZFS logging records are handled. Disabling ZFS logging (add ''set zfs:zil_disable=1'' to /etc/system) might give you a relief from the ''vi'' slowdown problem. 6429205 is more lively to hit hard machines with lots of memory, CPUs and a weak storage. Which is true in this case. One disk is not able to handle the IO load generated by the 32 virtual CPUs. As a workaround can you try adding one or two disks to distribute the load. Yes. There is no fix/patch for these BUGs yet. We would recommend to try the workarounds till there is a fix available for these BUGs. what is the downside of disabling loggin records? this is a production machine and we are now affecting a fairly large population. thanks sean Daniel Rock wrote:> Sean Meighan schrieb: > >> The box runs less than 20% load. Everything has been working >> perfectly until two days ago, now it can take 10 minutes to exit from >> vi. The following truss shows that the 3 line file that is sitting on >> the ZFS volume (/archives) took almost 15 minutes in fdsync. > > > /me have similar observations. With virtual no other activity on the > box exiting from vi (:wq) sometimes takes 10-60 seconds. Trussing the > process always shows vi is waiting in fdsync(). This happened on at > least two different machines: > - Sun E3500 with ZFS on A5200 disks (snv_37, almost all of the time large > delay (10-60 secs.) when exiting from vi) > - Solaris/amd64 with ZFS on SATA disks (snv_39, but only few delays, > usually > not longer than 10 secs.) > > > Daniel > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss-- <http://www.sun.com> * Sean Meighan * Mgr ITSM Engineering *Sun Microsystems, Inc.* US Phone x32329 / +1 408 850-9537 Mobile 303-520-2024 Fax 408 850-9537 Email Sean.Meighan at Sun.COM ------------------------------------------------------------------------ NOTICE: This email message is for the sole use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message. ------------------------------------------------------------------------ -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20060619/7997f7fb/attachment.html>
Sean, I''m not sure yet that the bugs below are responsible for this, but I don''t know what is either. 15 minutes to do a fdsync is way outside the slowdown usually seen. The footprint for 6413510 is that when a huge amount of data is being written non synchronously and a fsync comes in for the same filesystem then all the non-synchronous data is also forced out synchronously. So is there a lot of data being written during the vi? Also you say "Everything has been working perfectly until two days ago, now it can take 10 minutes to exit from vi". So what happened two days ago? Finally I would not recommend disabling logging. That switch was only intended for internal use. Applications that rely on POSIX synchronous semantics will not get what they asked for. Neil. Sean Meighan wrote On 06/19/06 17:48,:> ZFS engineering got back to us today and said the following: > > In addition to 6404018 there are couple of other performance bottle necks : > > 6413510 zfs: writing to ZFS filesystem slows down fsync() on other files > in the same FS > 6429205 each zpool needs to monitor it''s throughput and throttle heavy > writers > > joining together and causing the slowdown in this system itsm-mpk-2.sfbay. > > 6404018/6413510 is caused because of the way the ZFS logging records are > handled. Disabling ZFS logging (add ''set zfs:zil_disable=1'' to > /etc/system) might give you a relief from the ''vi'' slowdown problem. > > 6429205 is more lively to hit hard machines with lots of memory, CPUs > and a weak storage. Which is true in this case. One disk is not able to > handle the IO load generated by the 32 virtual CPUs. As a workaround can > you try adding one or two disks to distribute the load. > > Yes. There is no fix/patch for these BUGs yet. We would recommend to try > the workarounds till there is a fix available for these BUGs. > > > what is the downside of disabling loggin records? this is a production > machine and we are now affecting a fairly large population. > > thanks > sean > > Daniel Rock wrote: > >> Sean Meighan schrieb: >> >>> The box runs less than 20% load. Everything has been working >>> perfectly until two days ago, now it can take 10 minutes to exit from >>> vi. The following truss shows that the 3 line file that is sitting on >>> the ZFS volume (/archives) took almost 15 minutes in fdsync. >> >> >> /me have similar observations. With virtual no other activity on the >> box exiting from vi (:wq) sometimes takes 10-60 seconds. Trussing the >> process always shows vi is waiting in fdsync(). This happened on at >> least two different machines: >> - Sun E3500 with ZFS on A5200 disks (snv_37, almost all of the time large >> delay (10-60 secs.) when exiting from vi) >> - Solaris/amd64 with ZFS on SATA disks (snv_39, but only few delays, >> usually >> not longer than 10 secs.) >> >> >> Daniel >> _______________________________________________ >> zfs-discuss mailing list >> zfs-discuss at opensolaris.org >> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > > > -- > <http://www.sun.com> * Sean Meighan * > Mgr ITSM Engineering > > *Sun Microsystems, Inc.* > US > Phone x32329 / +1 408 850-9537 > Mobile 303-520-2024 > Fax 408 850-9537 > Email Sean.Meighan at Sun.COM > > > ------------------------------------------------------------------------ > NOTICE: This email message is for the sole use of the intended > recipient(s) and may contain confidential and privileged information. > Any unauthorized review, use, disclosure or distribution is prohibited. > If you are not the intended recipient, please contact the sender by > reply email and destroy all copies of the original message. > ------------------------------------------------------------------------ > > > ------------------------------------------------------------------------ > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss-- Neil
15 minutes to do a fdsync is way outside the slowdown usually seen. The footprint for 6413510 is that when a huge amount of data is being written non synchronously and a fsync comes in for the same filesystem then all the non-synchronous data is also forced out synchronously. So is there a lot of data being written during the vi? vi will write the whole file (in 4K) chunks and fsync it. (based on a single experiment). So for a largefile vi , on quit, we have lots of data to sync in and of itself. But because 6413510 we potentially have to sync lots of other data written by other applications. Now take a Niagara with lots of available CPUs and lots of free memory (32GB maybe?) running some ''tar x'' in parallel. A huge chunk of the 32GB can end up as dirty. I say too much so because of lack of throttling: http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6429205 6429205 each zpool needs to monitor it''s throughput and throttle heavy writers Then vi :q; fsyncs; and all of the pending data must sync. So we have extra data to sync because of: http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6413510 zfs: writing to ZFS filesystem slows down fsync() on other files in the same FS Furthermore, we can be slowed by this: http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6440499 zil should avoid txg_wait_synced() and use dmu_sync() to issue parallel IOs... Note: 6440499 is now fixed in the gate. And finally all this data goes to a single disk. Worse a slice of a disk. Since it''s just a slice ZFS can''t enable the write cache. Then if there is no tag queue (is there ?) we will handle everything one I/O at a time. If it''s a SATA drive we have other issues... I think we''ve hit is all here. So can this lead to 15 min fsync ? I can''t swear, Actually I won''t be convinced myself before I convince you, but we do have things to chew on already. Do I recall that this is about a 1GB file in vi ? :wq-uitting out of a 1 GB vi session on a 50MB/sec disk will take 20sec when everything hums and there are no other traffic involved. With no write cache / no tag queue , maybe 10X more. -r
The vi we were doing was a 2 line file. If you just vi a new file, add one line and exit it would take 15 minutes in fdsynch. On recommendation of a workaround we set set zfs:zil_disable=1 after the reboot the fdsynch is now < 0.1 seconds. Now I have no idea if it was this setting or the fact that we went through a reboot. Whatever the root cause we are now back to a well behaved file system. thanks sean Roch wrote:> 15 minutes to do a fdsync is way outside the slowdown usually seen. > The footprint for 6413510 is that when a huge amount of > data is being written non synchronously and a fsync comes in for the > same filesystem then all the non-synchronous data is also forced out > synchronously. So is there a lot of data being written during the vi? > >vi will write the whole file (in 4K) chunks and fsync it. >(based on a single experiment). > >So for a largefile vi , on quit, we have lots of data to >sync in and of itself. But because 6413510 we potentially >have to sync lots of other data written by other >applications. > >Now take a Niagara with lots of available CPUs and lots >of free memory (32GB maybe?) running some ''tar x'' in >parallel. A huge chunk of the 32GB can end up as dirty. > >I say too much so because of lack of throttling: > > http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6429205 > 6429205 each zpool needs to monitor it''s throughput and throttle heavy writers > >Then vi :q; fsyncs; and all of the pending data must >sync. So we have extra data to sync because of: > > http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6413510 > zfs: writing to ZFS filesystem slows down fsync() on other files in the same FS > >Furthermore, we can be slowed by this: > > http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6440499 > zil should avoid txg_wait_synced() and use dmu_sync() to issue parallel IOs... > >Note: 6440499 is now fixed in the gate. > >And finally all this data goes to a single disk. Worse a >slice of a disk. Since it''s just a slice ZFS can''t enable >the write cache. Then if there is no tag queue (is there ?) we >will handle everything one I/O at a time. If it''s a SATA >drive we have other issues... > >I think we''ve hit is all here. So can this lead to 15 min >fsync ? I can''t swear, Actually I won''t be convinced myself >before I convince you, but we do have things to chew on >already. > > >Do I recall that this is about a 1GB file in vi ? >:wq-uitting out of a 1 GB vi session on a 50MB/sec disk will >take 20sec when everything hums and there are no other >traffic involved. With no write cache / no tag queue , maybe >10X more. > >-r > > >-- <http://www.sun.com> * Sean Meighan * Mgr ITSM Engineering *Sun Microsystems, Inc.* US Phone x32329 / +1 408 850-9537 Mobile 303-520-2024 Fax 408 850-9537 Email Sean.Meighan at Sun.COM ------------------------------------------------------------------------ NOTICE: This email message is for the sole use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message. ------------------------------------------------------------------------ -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20060621/20c1695b/attachment.html>
Roch wrote On 06/21/06 07:31,:> 15 minutes to do a fdsync is way outside the slowdown usually seen. > The footprint for 6413510 is that when a huge amount of > data is being written non synchronously and a fsync comes in for the > same filesystem then all the non-synchronous data is also forced out > synchronously. So is there a lot of data being written during the vi? > > vi will write the whole file (in 4K) chunks and fsync it. > (based on a single experiment).Sean kindly gave me access to the system and so far I have reproduced the problem. It just requires a fsync on a file with 1 byte and for me takes 10 minutes to fsync! I have run a few dscripts but have yet to make much more progress. I do see that zfs version is fairly old so we may be chasing an old bug, or perhaps this really is an extreme version of 6413510, as there is often 3MB of data being collected and written to the pool and probably the same fs. Neil
Well this does look more and more like a duplicate of: 6413510 zfs: writing to ZFS filesystem slows down fsync() on other files in the same FS Neil
Sean Meighan writes: > The vi we were doing was a 2 line file. If you just vi a new file, add > one line and exit it would take 15 minutes in fdsynch. On recommendation > of a workaround we set > > set zfs:zil_disable=1 > > after the reboot the fdsynch is now < 0.1 seconds. Now I have no idea if it was this setting or the fact that we went through a reboot. Whatever the root cause we are now back to a well behaved file system. well behaved...In appearance only ! Maybe it''s nice to validate hypothesis but you should not run with this option set, ever., it disable O_DSYNC and fsync() and I don''t know what else. Bad idea, bad. -r bad.
Torrey McMahon
2006-Jun-21  16:29 UTC
[zfs-discuss] 15 minute fdsync problem and ZFS: Solved
Roch wrote:> Sean Meighan writes: > > The vi we were doing was a 2 line file. If you just vi a new file, add > > one line and exit it would take 15 minutes in fdsynch. On recommendation > > of a workaround we set > > > > set zfs:zil_disable=1 > > > > after the reboot the fdsynch is now < 0.1 seconds. Now I have no idea if it was this setting or the fact that we went through a reboot. Whatever the root cause we are now back to a well behaved file system. > > > well behaved...In appearance only ! > > Maybe it''s nice to validate hypothesis but you should not > run with this option set, ever., it disable O_DSYNC and > fsync() and I don''t know what else. > > Bad idea, bad. >Why is this option available then? (Yes, that''s a loaded question.)
Torrey McMahon wrote On 06/21/06 10:29,:> Roch wrote: > >> Sean Meighan writes: >> > The vi we were doing was a 2 line file. If you just vi a new file, >> add > one line and exit it would take 15 minutes in fdsynch. On >> recommendation > of a workaround we set >> > > set zfs:zil_disable=1 >> > > after the reboot the fdsynch is now < 0.1 seconds. Now I have no >> idea if it was this setting or the fact that we went through a reboot. >> Whatever the root cause we are now back to a well behaved file system. >> >> >> well behaved...In appearance only ! >> >> Maybe it''s nice to validate hypothesis but you should not >> run with this option set, ever., it disable O_DSYNC and >> fsync() and I don''t know what else. >> >> Bad idea, bad. > > > > Why is this option available then? (Yes, that''s a loaded question.)I wouldn''t call it an option, but an internal debugging switch that I originally added to allow progress when initially integrating the ZIL. As Roch says it really shouldn''t be ever set (as it does negate POSIX synchronous semantics). Nor should it be mentioned to a customer. In fact I''m inclined to now remove it - however it does still have a use as it helped root cause this problem. Neil
Robert Milkowski
2006-Jun-21  17:09 UTC
[zfs-discuss] 15 minute fdsync problem and ZFS: Solved
Hello Neil, Wednesday, June 21, 2006, 6:41:50 PM, you wrote: NP> Torrey McMahon wrote On 06/21/06 10:29,:>> Roch wrote: >> >>> Sean Meighan writes: >>> > The vi we were doing was a 2 line file. If you just vi a new file, >>> add > one line and exit it would take 15 minutes in fdsynch. On >>> recommendation > of a workaround we set >>> > > set zfs:zil_disable=1 >>> > > after the reboot the fdsynch is now < 0.1 seconds. Now I have no >>> idea if it was this setting or the fact that we went through a reboot. >>> Whatever the root cause we are now back to a well behaved file system. >>> >>> >>> well behaved...In appearance only ! >>> >>> Maybe it''s nice to validate hypothesis but you should not >>> run with this option set, ever., it disable O_DSYNC and >>> fsync() and I don''t know what else. >>> >>> Bad idea, bad. >> >> >> >> Why is this option available then? (Yes, that''s a loaded question.)NP> I wouldn''t call it an option, but an internal debugging switch that I NP> originally added to allow progress when initially integrating the ZIL. NP> As Roch says it really shouldn''t be ever set (as it does negate POSIX NP> synchronous semantics). Nor should it be mentioned to a customer. NP> In fact I''m inclined to now remove it - however it does still have a use NP> as it helped root cause this problem. Isn''t it similar to unsupported fastfs for ufs? I think it could be useful in some cases after all. -- Best regards, Robert mailto:rmilkowski at task.gda.pl http://milek.blogspot.com
Nicolas Williams
2006-Jun-21  17:10 UTC
[zfs-discuss] 15 minute fdsync problem and ZFS: Solved
On Wed, Jun 21, 2006 at 10:41:50AM -0600, Neil Perrin wrote:> >Why is this option available then? (Yes, that''s a loaded question.) > > I wouldn''t call it an option, but an internal debugging switch that I > originally added to allow progress when initially integrating the ZIL. > As Roch says it really shouldn''t be ever set (as it does negate POSIX > synchronous semantics). Nor should it be mentioned to a customer. > In fact I''m inclined to now remove it - however it does still have a use > as it helped root cause this problem.Rename it to "zil_disable_danger_will_robinson" :)
Torrey McMahon
2006-Jun-21  17:19 UTC
[zfs-discuss] 15 minute fdsync problem and ZFS: Solved
Nicolas Williams wrote:> On Wed, Jun 21, 2006 at 10:41:50AM -0600, Neil Perrin wrote: > >>> Why is this option available then? (Yes, that''s a loaded question.) >>> >> I wouldn''t call it an option, but an internal debugging switch that I >> originally added to allow progress when initially integrating the ZIL. >> As Roch says it really shouldn''t be ever set (as it does negate POSIX >> synchronous semantics). Nor should it be mentioned to a customer. >> In fact I''m inclined to now remove it - however it does still have a use >> as it helped root cause this problem. >> > > Rename it to "zil_disable_danger_will_robinson"The sad truth is that debugging bits tend to survive into production and then we get escalations that go something like, "I set this variable in /etc/system and now I''m {getting data corruption, weird behavior, an odd rash, ...}" The fewer the better, imho. If it can be removed, great. If not, then maybe something for the tunables guide.
Robert Milkowski wrote On 06/21/06 11:09,:> Hello Neil, >>>Why is this option available then? (Yes, that''s a loaded question.) > > NP> I wouldn''t call it an option, but an internal debugging switch that I > NP> originally added to allow progress when initially integrating the ZIL. > NP> As Roch says it really shouldn''t be ever set (as it does negate POSIX > NP> synchronous semantics). Nor should it be mentioned to a customer. > NP> In fact I''m inclined to now remove it - however it does still have a use > NP> as it helped root cause this problem. > > Isn''t it similar to unsupported fastfs for ufs?It is similar in the sense that it speeds up the file system. Using fastfs can be much more dangerous though as it can lead to a badly corrupted file system as writing meta data is delayed and written out of order. Whereas disabling the ZIL does not affect the integrity of the fs. The transaction group model of ZFS gives consistency in the event of a crash/power fail. However, any data that was promised to be on stable storage may not be unless the transaction group committed (an operation that is started every 5s). We once had plans to add a mount option to allow the admin to control the ZIL. Here''s a brief section of the RFE (6280630): sync={deferred,standard,forced} Controls synchronous semantics for the dataset. When set to ''standard'' (the default), synchronous operations such as fsync(3C) behave precisely as defined in fcntl.h(3HEAD). When set to ''deferred'', requests for synchronous semantics are ignored. However, ZFS still guarantees that ordering is preserved -- that is, consecutive operations reach stable storage in order. (If a thread performs operation A followed by operation B, then the moment that B reaches stable storage, A is guaranteed to be on stable storage as well.) ZFS also guarantees that all operations will be scheduled for write to stable storage within a few seconds, so that an unexpected power loss only takes the last few seconds of change with it. When set to ''forced'', all operations become synchronous. No operation will return until all previous operations have been committed to stable storage. This option can be useful if an application is found to depend on synchronous semantics without actually requesting them; otherwise, it will just make everything slow, and is not recommended. Of course we would need to stress the dangers of setting ''deferred''. What do you guys think? Neil.
Bill Sommerfeld
2006-Jun-21  19:03 UTC
[zfs-discuss] 15 minute fdsync problem and ZFS: Solved
On Wed, 2006-06-21 at 14:15, Neil Perrin wrote:> Of course we would need to stress the dangers of setting ''deferred''. > What do you guys think?I can think of a use case for "deferred": improving the efficiency of a large mega-"transaction"/batch job such as a nightly build. You create an initially empty or cloned dedicated filesystem for the build, and start it off, and won''t look inside until it completes. If the build machine crashes in the middle of the build you''re going to nuke it all and start over because that''s lower risk than assuming you can pick up where it left off. now, it happens that a bunch of tools used during a build invoke fsync. But in the context of a full nightly build that effort is wasted. All you need is one big "sync everything" at the very end, either by using a command like sync or lockfs -f, or as a side effect of reverting from sync=deferred to sync=standard. - Bill
Torrey McMahon
2006-Jun-21  19:44 UTC
[zfs-discuss] 15 minute fdsync problem and ZFS: Solved
Neil Perrin wrote:> > Of course we would need to stress the dangers of setting ''deferred''. > What do you guys think?That''s the key: Be very explicit about what the option does and the side effects.
Neil Perrin wrote:> > > Robert Milkowski wrote On 06/21/06 11:09,: > >> Hello Neil, >> >>>> Why is this option available then? (Yes, that''s a loaded question.) >>> >> >> NP> I wouldn''t call it an option, but an internal debugging switch >> that I >> NP> originally added to allow progress when initially integrating the >> ZIL. >> NP> As Roch says it really shouldn''t be ever set (as it does negate >> POSIX >> NP> synchronous semantics). Nor should it be mentioned to a customer. >> NP> In fact I''m inclined to now remove it - however it does still >> have a use >> NP> as it helped root cause this problem. >> >> Isn''t it similar to unsupported fastfs for ufs? > > > It is similar in the sense that it speeds up the file system. > Using fastfs can be much more dangerous though as it can lead > to a badly corrupted file system as writing meta data is delayed > and written out of order. Whereas disabling the ZIL does not affect > the integrity of the fs. The transaction group model of ZFS gives > consistency in the event of a crash/power fail. However, any data that > was promised to be on stable storage may not be unless the transaction > group committed (an operation that is started every 5s). > > We once had plans to add a mount option to allow the admin > to control the ZIL. Here''s a brief section of the RFE (6280630): > > sync={deferred,standard,forced} > > Controls synchronous semantics for the dataset. > > When set to ''standard'' (the default), synchronous > operations > such as fsync(3C) behave precisely as defined in > fcntl.h(3HEAD). > > When set to ''deferred'', requests for synchronous > semantics > are ignored. However, ZFS still guarantees that ordering > is preserved -- that is, consecutive operations reach > stable > storage in order. (If a thread performs operation A > followed > by operation B, then the moment that B reaches stable > storage, > A is guaranteed to be on stable storage as well.) ZFS > also > guarantees that all operations will be scheduled for > write to > stable storage within a few seconds, so that an > unexpected > power loss only takes the last few seconds of change > with it. > > When set to ''forced'', all operations become synchronous. > No operation will return until all previous operations > have been committed to stable storage. This option > can be > useful if an application is found to depend on > synchronous > semantics without actually requesting them; otherwise, it > will just make everything slow, and is not recommended. > > Of course we would need to stress the dangers of setting ''deferred''. > What do you guys think? > > Neil.Scares me, and it seems we should wait until people are demanding it and we *have* to do it (if that time ever comes) - that is, we can''t squeeze any more performance gain out of the ''standard'' method. If problems do occur because of ''deferred'' mode, once i wrap-up zpool history, we''ll have that they set this logged to disk. eric
Neil, I think it might be wise to look at this problem from the perspective of an application (e.g. a simple database) designer taking into account all the new things that Solaris ZFS provides. In case of ZFS the designer does not have to worry about consistency of the on-disk file system format but only about "has my data been committed either to disk (or to NVRAM if there is one)". Depending on the problem the designer tries to address it might be either the total write throughput, in which case the designer might love the "deferred" option, or the ability of sync file data to stable storage and the latency of this operation. Considering flexibility of the file system creation in ZFS I could imagine use of multiple file systems with different mount options for different types of files. All in all, though, the question is if a set of the POSIX calls with the semantics defined through the mount options gives programmers (or application designers) enough flexibility to address most common issues in high level application scenarios a simple and productive way. If so which of these different sync options are useful or needed. -- Olaf> It is similar in the sense that it speeds up the file system. > Using fastfs can be much more dangerous though as it can lead > to a badly corrupted file system as writing meta data is delayed > and written out of order. Whereas disabling the ZIL does not affect > the integrity of the fs. The transaction group model of ZFS gives > consistency in the event of a crash/power fail. However, any data that > was promised to be on stable storage may not be unless the transaction > group committed (an operation that is started every 5s). > > We once had plans to add a mount option to allow the admin > to control the ZIL. Here''s a brief section of the RFE (6280630): > > sync={deferred,standard,forced} > > Controls synchronous semantics for the dataset. > > When set to ''standard'' (the default), synchronous > operations > such as fsync(3C) behave precisely as defined in > fcntl.h(3HEAD). > > When set to ''deferred'', requests for synchronous semantics > are ignored. However, ZFS still guarantees that ordering > is preserved -- that is, consecutive operations reach > stable > storage in order. (If a thread performs operation A > followed > by operation B, then the moment that B reaches stable > storage, > A is guaranteed to be on stable storage as well.) ZFS also > guarantees that all operations will be scheduled for > write to > stable storage within a few seconds, so that an unexpected > power loss only takes the last few seconds of change > with it. > > When set to ''forced'', all operations become synchronous. > No operation will return until all previous operations > have been committed to stable storage. This option can be > useful if an application is found to depend on synchronous > semantics without actually requesting them; otherwise, it > will just make everything slow, and is not recommended. > > Of course we would need to stress the dangers of setting ''deferred''. > What do you guys think?
Jason Ozolins
2006-Jun-22  03:57 UTC
[zfs-discuss] 15 minute fdsync problem and ZFS: Solved
Bill Sommerfeld wrote:> On Wed, 2006-06-21 at 14:15, Neil Perrin wrote: > >>Of course we would need to stress the dangers of setting ''deferred''. >>What do you guys think? > > I can think of a use case for "deferred": improving the efficiency of a > large mega-"transaction"/batch job such as a nightly build. > > You create an initially empty or cloned dedicated filesystem for the > build, and start it off, and won''t look inside until it completes. If > the build machine crashes in the middle of the build you''re going to > nuke it all and start over because that''s lower risk than assuming you > can pick up where it left off. > > now, it happens that a bunch of tools used during a build invoke fsync. > But in the context of a full nightly build that effort is wasted. All > you need is one big "sync everything" at the very end, either by using a > command like sync or lockfs -f, or as a side effect of reverting from > sync=deferred to sync=standard.Can I give support for this use case? Or does it take someone like Casper Dik with ''fastfs'' to come along later and provide a utility that lets people make the filesystem do what want it to? [still annoyed that it took me so long to find out about fastfs - hell, the Solaris 8 or 9 OS installation process was using the same IOCTL as fastfs uses, but for some reason end users still have to find fastfs out on the Net somewhere instead of getting it with the OS]. If the ZFS docs state why it''s not for general use, then what''s to separate this from the zillion other ways that a cavalier sysadmin can bork their data (or indeed their whole machine)? Otherwise, why even let people create a striped zpool vdev without redundancy - it''s just an accident waiting to happen, right? We must save people from themselves! Think of the children! ;-) -Jason =:^/ -- Jason.Ozolins at anu.edu.au ANU Supercomputer Facility APAC Grid Program Leonard Huxley Bldg 56, Mills Road Ph: +61 2 6125 5449 Australian National University Fax: +61 2 6125 8199 Canberra, ACT, 0200, Australia
How about the ''deferred'' option be on a leased basis with a deadline to revert to normal behavior; at most 24hrs at a time. Console output everytime the option is enabled. -r Torrey McMahon writes: > Neil Perrin wrote: > > > > Of course we would need to stress the dangers of setting ''deferred''. > > What do you guys think? > > > That''s the key: Be very explicit about what the option does and the side > effects. > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Darren J Moffat
2006-Jun-22  08:56 UTC
[zfs-discuss] 15 minute fdsync problem and ZFS: Solved
Bill Sommerfeld wrote:> On Wed, 2006-06-21 at 14:15, Neil Perrin wrote: >> Of course we would need to stress the dangers of setting ''deferred''. >> What do you guys think? > > I can think of a use case for "deferred": improving the efficiency of a > large mega-"transaction"/batch job such as a nightly build.Yum Yum!! We could even build this into nightly(1) once we have user delegation to create clones. nightly(1) would zfs clone, zfs set reservation=, zfs set sync=deferred, and when it is done release the reservation unset deffered and snapshot. When we can have it ? -- Darren J Moffat
Dana H. Myers
2006-Jun-22  09:20 UTC
[zfs-discuss] 15 minute fdsync problem and ZFS: Solved
Darren J Moffat wrote:> Bill Sommerfeld wrote: >> On Wed, 2006-06-21 at 14:15, Neil Perrin wrote: >>> Of course we would need to stress the dangers of setting ''deferred''. >>> What do you guys think? >> >> I can think of a use case for "deferred": improving the efficiency of a >> large mega-"transaction"/batch job such as a nightly build. > > Yum Yum!! > > We could even build this into nightly(1) once we have user delegation to > create clones. > > nightly(1) would zfs clone, zfs set reservation=, zfs set sync=deferred, > and when it is done release the reservation unset deffered and snapshot. > > When we can have it ?Before we get too far down that path, has anyone timed a nightly with and without the zil_disable''d ? Dana
Robert Milkowski
2006-Jun-22  12:36 UTC
[zfs-discuss] 15 minute fdsync problem and ZFS: Solved
Hello Roch,
Thursday, June 22, 2006, 9:55:41 AM, you wrote:
R> How about the ''deferred'' option be on  a leased basis
with a
R> deadline to revert to  normal behavior; at  most 24hrs  at a
R> time.  Console output everytime the option is enabled.
I really hate when tools try to be more clever than sys-admins.
Generating some kind of warning - sure, why not.
-- 
Best regards,
 Robert                            mailto:rmilkowski at task.gda.pl
                                       http://milek.blogspot.com
Bill Sommerfeld
2006-Jun-22  14:05 UTC
[zfs-discuss] 15 minute fdsync problem and ZFS: Solved
On Thu, 2006-06-22 at 03:55, Roch wrote:> How about the ''deferred'' option be on a leased basis with a > deadline to revert to normal behavior; at most 24hrs at a > time.why?> Console output everytime the option is enabled.in general, no. error messages to the console should be reserved for truly frightening events and this simply isn''t one of them. - Bill
Well, I should weigh in hear. I have been using ZFS with an iscsi backend and a NFS front end to my clients. Until B41 (not sure what fixed this) I was getting 20KB/sec for RAIDZ and 200KB/sec for just ZFS on on large iscsi LUNs (non-RAIDZ) when I was receiving many small writes, such as untarring of a linux or opensolaris tree, or artificially a copy of 6250 8k files. It turned out the NFS would issue 3 fsyncs on each write, and my performance degraded terribly from my normal 20MB+/sec writes to the backend iscsi storage. Now, a parallel test using NetApps shows no performance drop, but that''s because of NVRAM backed storage there, and a test against the same iscsi targets using linux and XFS and the NFS server implementation there gave me 1.25MB/sec writes. I was about to throw in the towel and deem ZFS/NFS has unusable until B41 came along and at least gave me 1.25MB/sec. This option, with all its caveats, would be ideal on various NFS-provided filesystems (large cache directories for cluster nodes, tmp space from my pools, etc) to get performance characteristics similar to a NetApp Filer. If I can provide for that stable storage to a high degree, or have only NVRAM-based storage, this could be a big win for ZFS, if nothing else than the RFPs that would require benchmarks/bakeoffs against a NetApp showing it can perform just as fast, caveats and all. On 6/21/06, Olaf Manczak <Olaf.Manczak at sun.com> wrote:> Neil, > > I think it might be wise to look at this problem from the perspective > of an application (e.g. a simple database) designer taking into account > all the new things that Solaris ZFS provides. > > In case of ZFS the designer does not have to worry about consistency > of the on-disk file system format but only about "has my data been > committed either to disk (or to NVRAM if there is one)". Depending on > the problem the designer tries to address it might be either the > total write throughput, in which case the designer might love the > "deferred" option, or the ability of sync file data to stable storage > and the latency of this operation. Considering flexibility of the > file system creation in ZFS I could imagine use of multiple file > systems with different mount options for different types of files. > > All in all, though, the question is if a set of the POSIX calls with > the semantics defined through the mount options gives programmers > (or application designers) enough flexibility to address most common > issues in high level application scenarios a simple and productive way. > If so which of these different sync options are useful or needed. > > -- Olaf > > > It is similar in the sense that it speeds up the file system. > > Using fastfs can be much more dangerous though as it can lead > > to a badly corrupted file system as writing meta data is delayed > > and written out of order. Whereas disabling the ZIL does not affect > > the integrity of the fs. The transaction group model of ZFS gives > > consistency in the event of a crash/power fail. However, any data that > > was promised to be on stable storage may not be unless the transaction > > group committed (an operation that is started every 5s). > > > > We once had plans to add a mount option to allow the admin > > to control the ZIL. Here''s a brief section of the RFE (6280630): > > > > sync={deferred,standard,forced} > > > > Controls synchronous semantics for the dataset. > > > > When set to ''standard'' (the default), synchronous > > operations > > such as fsync(3C) behave precisely as defined in > > fcntl.h(3HEAD). > > > > When set to ''deferred'', requests for synchronous semantics > > are ignored. However, ZFS still guarantees that ordering > > is preserved -- that is, consecutive operations reach > > stable > > storage in order. (If a thread performs operation A > > followed > > by operation B, then the moment that B reaches stable > > storage, > > A is guaranteed to be on stable storage as well.) ZFS also > > guarantees that all operations will be scheduled for > > write to > > stable storage within a few seconds, so that an unexpected > > power loss only takes the last few seconds of change > > with it. > > > > When set to ''forced'', all operations become synchronous. > > No operation will return until all previous operations > > have been committed to stable storage. This option can be > > useful if an application is found to depend on synchronous > > semantics without actually requesting them; otherwise, it > > will just make everything slow, and is not recommended. > > > > Of course we would need to stress the dangers of setting ''deferred''. > > What do you guys think? > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >
Bill Sommerfeld writes: > On Thu, 2006-06-22 at 03:55, Roch wrote: > > How about the ''deferred'' option be on a leased basis with a > > deadline to revert to normal behavior; at most 24hrs at a > > time. > why? I''ll trust your judgement over mine on this, so I won''t press. But it was mentioned that this would be useful to implement time-bounded huge meta-transaction such as a build. Given that we eventually do want to have a point where we know that data is on stable-storage, I''d figure we could say upfront what the time scale is. Is there a sync command that targets individual FS ? -r
Bill Sommerfeld
2006-Jun-22  17:15 UTC
[zfs-discuss] 15 minute fdsync problem and ZFS: Solved
On Thu, 2006-06-22 at 13:01, Roch wrote:> Is there a sync command that targets individual FS ?Yes. lockfs -f - Bill
Darren J Moffat
2006-Jun-22  17:19 UTC
[zfs-discuss] 15 minute fdsync problem and ZFS: Solved
Bill Sommerfeld wrote:> On Thu, 2006-06-22 at 13:01, Roch wrote: >> Is there a sync command that targets individual FS ? > > Yes. lockfs -fDoes lockfs work with ZFS ? The man page appears to indicate it is very UFS specific. -- Darren J Moffat
Bill Sommerfeld
2006-Jun-22  17:32 UTC
[zfs-discuss] 15 minute fdsync problem and ZFS: Solved
On Thu, 2006-06-22 at 13:19, Darren J Moffat wrote:> > Yes. lockfs -f > > Does lockfs work with ZFS ? The man page appears to indicate it is very > UFS specific.all of lockfs does not. but, if truss is to believed, the ioctl used by lockfs -f appears to. or at least, it returns without error. - Bill
Jonathan Adams
2006-Jun-22  17:39 UTC
[zfs-discuss] 15 minute fdsync problem and ZFS: Solved
On Thu, Jun 22, 2006 at 06:19:20PM +0100, Darren J Moffat wrote:> Bill Sommerfeld wrote: > >On Thu, 2006-06-22 at 13:01, Roch wrote: > >> Is there a sync command that targets individual FS ? > > > >Yes. lockfs -f > > Does lockfs work with ZFS ? The man page appears to indicate it is very > UFS specific.Well, it just ends up doing an ioctl(), which zfs recognizes: # dtrace -n ''syscall::ioctl:entry/pid == $target/{self->on = 1}'' \ -n''fbt:::/self->on/{}'' -n ''syscall::ioctl:return/self->on/{self->on = 0}'' \ -F -c ''lockfs -f /aux1'' dtrace: description ''syscall::ioctl:entry'' matched 1 probe dtrace: description ''fbt:::'' matched 44321 probes dtrace: description ''syscall::ioctl:return'' matched 1 probe dtrace: pid 151072 has exited CPU FUNCTION 0 -> ioctl 0 -> getf 0 -> set_active_fd 0 <- set_active_fd 0 <- getf 0 -> get_udatamodel 0 <- get_udatamodel 0 -> fop_ioctl 0 -> zfs_ioctl 0 <- zfs_ioctl 0 -> zfs_sync 0 -> zil_commit 0 <- zil_commit 0 <- zfs_sync 0 <- fop_ioctl 0 -> releasef 0 -> clear_active_fd 0 <- clear_active_fd 0 -> cv_broadcast 0 <- cv_broadcast 0 <- releasef 0 <- ioctl So the sync happens. Cheers, - jonathan -- Jonathan Adams, Solaris Kernel Development
As I recall, the zfs sync is, unlike UFS, synchronous. -r
Prabahar Jeyaram
2006-Jun-22  17:55 UTC
[zfs-discuss] 15 minute fdsync problem and ZFS: Solved
Yep. ZFS supports the ioctl (_FIOFFS) which ''lockfs -f'' issues. -- Prabahar. Darren J Moffat wrote:> Bill Sommerfeld wrote: >> On Thu, 2006-06-22 at 13:01, Roch wrote: >>> Is there a sync command that targets individual FS ? >> >> Yes. lockfs -f > > Does lockfs work with ZFS ? The man page appears to indicate it is very > UFS specific. >
Jonathan Adams
2006-Jun-22  18:01 UTC
[zfs-discuss] 15 minute fdsync problem and ZFS: Solved
On Thu, Jun 22, 2006 at 07:46:57PM +0200, Roch wrote:> > As I recall, the zfs sync is, unlike UFS, synchronous.Uh, are you talking about sync(2), or lockfs -f? IIRC, lockfs -f is always synchronous. Cheers, - jonathan -- Jonathan Adams, Solaris Kernel Development
Yes, lockfs works. It uses the ZIL - unless it''s disabled where it waits for all outstanding txgs to commit. The man page doesn''t say it''s specific to UFS, but does mention one specific UFS detail. Darren J Moffat wrote On 06/22/06 11:19,:> Bill Sommerfeld wrote: > >> On Thu, 2006-06-22 at 13:01, Roch wrote: >> >>> Is there a sync command that targets individual FS ? >> >> >> Yes. lockfs -f > > > Does lockfs work with ZFS ? The man page appears to indicate it is very > UFS specific. >-- Neil
Robert Milkowski
2006-Jun-28  21:52 UTC
[zfs-discuss] 15 minute fdsync problem and ZFS: Solved
Hello Neil, Wednesday, June 21, 2006, 8:15:54 PM, you wrote: NP> Robert Milkowski wrote On 06/21/06 11:09,:>> Hello Neil, >>>>Why is this option available then? (Yes, that''s a loaded question.) >> >> NP> I wouldn''t call it an option, but an internal debugging switch that I >> NP> originally added to allow progress when initially integrating the ZIL. >> NP> As Roch says it really shouldn''t be ever set (as it does negate POSIX >> NP> synchronous semantics). Nor should it be mentioned to a customer. >> NP> In fact I''m inclined to now remove it - however it does still have a use >> NP> as it helped root cause this problem. >> >> Isn''t it similar to unsupported fastfs for ufs?NP> It is similar in the sense that it speeds up the file system. NP> Using fastfs can be much more dangerous though as it can lead NP> to a badly corrupted file system as writing meta data is delayed NP> and written out of order. Whereas disabling the ZIL does not affect NP> the integrity of the fs. The transaction group model of ZFS gives NP> consistency in the event of a crash/power fail. However, any data that NP> was promised to be on stable storage may not be unless the transaction NP> group committed (an operation that is started every 5s). NP> We once had plans to add a mount option to allow the admin NP> to control the ZIL. Here''s a brief section of the RFE (6280630): NP> sync={deferred,standard,forced} NP> Controls synchronous semantics for the dataset. NP> When set to ''standard'' (the default), synchronous operations NP> such as fsync(3C) behave precisely as defined in NP> fcntl.h(3HEAD). NP> When set to ''deferred'', requests for synchronous semantics NP> are ignored. However, ZFS still guarantees that ordering NP> is preserved -- that is, consecutive operations reach stable NP> storage in order. (If a thread performs operation A followed NP> by operation B, then the moment that B reaches stable storage, NP> A is guaranteed to be on stable storage as well.) ZFS also NP> guarantees that all operations will be scheduled for write to NP> stable storage within a few seconds, so that an unexpected NP> power loss only takes the last few seconds of change with it. NP> When set to ''forced'', all operations become synchronous. NP> No operation will return until all previous operations NP> have been committed to stable storage. This option can be NP> useful if an application is found to depend on synchronous NP> semantics without actually requesting them; otherwise, it NP> will just make everything slow, and is not recommended. NP> Of course we would need to stress the dangers of setting ''deferred''. NP> What do you guys think? I think it would be really useful. I found myself many times in situation that such features (like fastfs) were my last resort help. The same with txg_time - in some cases tuning it could probably be useful. Instead of playing with mdb it would be much better put into zpool/zfs or other util (and if possible made per fs not per host). -- Best regards, Robert mailto:rmilkowski at task.gda.pl http://milek.blogspot.com
Robert Milkowski wrote On 06/28/06 15:52,:> Hello Neil, > > Wednesday, June 21, 2006, 8:15:54 PM, you wrote: > > > NP> Robert Milkowski wrote On 06/21/06 11:09,: > >>>Hello Neil, >>> >>>>>Why is this option available then? (Yes, that''s a loaded question.) >>> >>>NP> I wouldn''t call it an option, but an internal debugging switch that I >>>NP> originally added to allow progress when initially integrating the ZIL. >>>NP> As Roch says it really shouldn''t be ever set (as it does negate POSIX >>>NP> synchronous semantics). Nor should it be mentioned to a customer. >>>NP> In fact I''m inclined to now remove it - however it does still have a use >>>NP> as it helped root cause this problem. >>> >>>Isn''t it similar to unsupported fastfs for ufs? > > > NP> It is similar in the sense that it speeds up the file system. > NP> Using fastfs can be much more dangerous though as it can lead > NP> to a badly corrupted file system as writing meta data is delayed > NP> and written out of order. Whereas disabling the ZIL does not affect > NP> the integrity of the fs. The transaction group model of ZFS gives > NP> consistency in the event of a crash/power fail. However, any data that > NP> was promised to be on stable storage may not be unless the transaction > NP> group committed (an operation that is started every 5s). > > NP> We once had plans to add a mount option to allow the admin > NP> to control the ZIL. Here''s a brief section of the RFE (6280630): > > NP> sync={deferred,standard,forced} > > NP> Controls synchronous semantics for the dataset. > > NP> When set to ''standard'' (the default), synchronous operations > NP> such as fsync(3C) behave precisely as defined in > NP> fcntl.h(3HEAD). > > NP> When set to ''deferred'', requests for synchronous semantics > NP> are ignored. However, ZFS still guarantees that ordering > NP> is preserved -- that is, consecutive operations reach stable > NP> storage in order. (If a thread performs operation A followed > NP> by operation B, then the moment that B reaches stable storage, > NP> A is guaranteed to be on stable storage as well.) ZFS also > NP> guarantees that all operations will be scheduled for write to > NP> stable storage within a few seconds, so that an unexpected > NP> power loss only takes the last few seconds of change with it. > > NP> When set to ''forced'', all operations become synchronous. > NP> No operation will return until all previous operations > NP> have been committed to stable storage. This option can be > NP> useful if an application is found to depend on synchronous > NP> semantics without actually requesting them; otherwise, it > NP> will just make everything slow, and is not recommended. > > NP> Of course we would need to stress the dangers of setting ''deferred''. > NP> What do you guys think? > > I think it would be really useful. > I found myself many times in situation that such features (like > fastfs) were my last resort help.The over-whelming consensus was that it would be useful. So I''ll go ahead and put that on my to do list.> > The same with txg_time - in some cases tuning it could probably be > useful. Instead of playing with mdb it would be much better put into > zpool/zfs or other util (and if possible made per fs not per host).This one I''m less sure about. I have certainly tuned txg_time myself to force certain situations, but I wouldn''t be happy exposing the inner workings of ZFS - which may well change. Neil