thr3ads.net - zfs discuss - [zfs-discuss] fdsync(5, FSYNC) problem and ZFS [Jun 2006]

If this information is useful, please help other people find it:
Share via:

Sean Meighan

2006-Jun-17 19:43 UTC

[zfs-discuss] fdsync(5, FSYNC) problem and ZFS

Hi all; i just joined the group. My team has created a tool to watch 
Solaris,Linux,Windows. It is non-root on both the client and the server. 
We currently are using it to watch the 750 Sun Ray servers inside of 
sun. The basic thing it does is executes a 55 command shell script on 
the client and sends the 1000 line output up to the central server. We 
do this every 10 minutes across the 750 machines. We have a Niagara 
T2000, 32x1000mhz server  running Solaris 10 Generic_118833-08. We put 
ZFS on this box two months ago. We currently have 3.5 million files and 
3 billion lines of ASCII sitting on the internal drive of the Niagara. 
The box runs less than 20% load. Everything has been working perfectly 
until two days ago, now it can take 10 minutes to exit from vi. The 
following truss shows that the 3 line file that is sitting on the ZFS 
volume (/archives) took almost 15 minutes in fdsync. For those of you 
inside of Sun you can see the web page of all Sun Ray servers at 
http://canary.sfbay. My team has contacted engineering and we are 
waiting for more help. Any suggestions of what we might have hit would 
be appreciated.



%itsm-mpk-2% zpool list
NAME                    SIZE    USED   AVAIL    CAP  HEALTH     ALTROOT
canary                   54G   38.9G   15.1G    72%  ONLINE     -


%truss -o truss.out -dD vi /archives/junk

 2.8150  0.0005 write(1, "1B [ 1 M1B [ 3 2 B ~1B [".., 16)      = 16
read(0, 0xFFBFD00F, 1)          (sleeping...)
 4.1164  1.3014 read(0, " Z", 1)                                = 1
 4.3322  0.2158 read(0, " Z", 1)                                = 1
 4.3329  0.0007 write(1, "1B [ 3 3 B", 5)                       = 5
 4.3332  0.0003 write(1, " " / a r c h i v e s / j".., 16)      =
16
 4.3336  0.0004 stat64("/archives/junk", 0xFFBFCEF8)            = 0
148.4689        144.1353        creat("/archives/junk", 
0666)                   = 5
148.4703         0.0014 ioctl(2, TCSETSW, 0x00060C10)                   = 0
148.4708         0.0005 write(5, "\n 1 2 3 1 2 3\n", 8)               
= 8
*971.7021        823.2313        fdsync(5, 
FSYNC)                                = 0*
971.8102         0.1081 close(5)                                        = 0
971.8108         0.0006 write(1, "   2   l i n e s ,   8  ".., 23)    
= 23
971.8113         0.0005 write(1, "\r\n", 2)                           
= 2
971.8116         0.0003 write(1, "1B [ J", 3)                         
= 3
971.8120         0.0004 write(1, "1B [ ? 1 l1B >", 7)              
= 7
971.8124         0.0004 ioctl(2, TCSETSW, 0x00060C10)                   = 0
971.8126         0.0002 ioctl(2, TCGETS, 0x000E8098)                    = 0
971.8128         0.0002 ioctl(0, I_STR, 0x000579F8)                     
Err#22 EINVAL
971.8130         0.0002 ioctl(2, TIOCGPGRP, 0xFFBFCF8C)                 = 0
971.8131         0.0001 getpgid(0)                                      
= 9456
971.8134         0.0003 ioctl(2, TCSETSW, 0x000E8074)                   = 0
971.8137         0.0003 unlink("/var/tmp/ExKaaiEs")                   
= 0
971.8141         0.0004 unlink("/var/tmp/RxLaaiEs")                   
= 0
971.8144         0.0003 close(4)                                        = 0
971.8146         0.0002 _exit(0)

I tried the same command on /var/tmp, a non-ZFS volume. fdsync(4, FSYNC) 
takes 0.3 seconds

 3.4120  0.0004 write(1, " " / v a r / t m p / j u".., 18)      =
18
 3.4123  0.0003 stat64("/var/tmp/junk123", 0xFFBFCEF8)          Err#2
ENOENT
 3.4128  0.0005 creat("/var/tmp/junk123", 0666)                 = 4
 3.4132  0.0004 ioctl(2, TCSETSW, 0x00060C10)                   = 0
 3.4135  0.0003 write(4, " a\n\n", 3)                           = 3
* 3.7586  0.3451 fdsync(4, FSYNC)                                = 0*
 3.7589  0.0003 close(4)                                        = 0
 3.7592  0.0003 write(1, "   [ N e w   f i l e ]  ".., 34)      = 34
 3.7596  0.0004 write(1, "\r\n", 2)                             = 2
 3.7599  0.0003 write(1, "1B [ J", 3)                           = 3
 3.7602  0.0003 write(1, "1B [ ? 1 l1B >", 7)                   = 7
 3.7606  0.0004 ioctl(2, TCSETSW, 0x00060C10)                   = 0
 3.7608  0.0002 ioctl(2, TCGETS, 0x000E8098)                    = 0
 3.7610  0.0002 ioctl(0, I_STR, 0x000579F8)                     Err#22 
EINVAL
 3.7612  0.0002 ioctl(2, TIOCGPGRP, 0xFFBFCF8C)                 = 0
 3.7614  0.0002 getpgid(0)                                      = 28194
 3.7616  0.0002 ioctl(2, TCSETSW, 0x000E8074)                   = 0
 3.7619  0.0003 unlink("/var/tmp/ExR9aqk3")                     = 0
 3.7623  0.0004 _exit(0)

-- 
The canary monitor also watches our central server, here is the cpu load 
for the last 7 days on the T2000, the RED line is %USR+%SYS over a one 
minute sar -u command. We mostly use just 10% of the Niagara to receive 
110,000 files, 750 million  lines per day.


We also look to show all disk activity on the server by summing the 
r+w/s column from a "sar -d".
itsm-mpk-2% sar -d 3 3

SunOS itsm-mpk-2 5.10 Generic_118833-08 sun4v    06/17/2006

12:39:48   device        %busy   avque   r+w/s  blks/s  avwait  avserv

12:39:51   nfs1              0     0.0       0       0     0.0     0.0
           nfs60             0     0.0       0       0     0.0     0.0
           sd1             100    35.0     684    6131    27.8    23.4
           sd1,a             0     0.0       0       0     0.0     0.0
           sd1,b             0     0.0       0       0     0.0     0.0
           sd1,c             0     0.0       0       0     0.0     0.0
           sd1,d           100    35.0     684    6131    27.8    23.4
           sd2               0     0.0       0       0     0.0     0.0
           ohci0,bu          0     0.0       0       0     0.0     0.0
           ohci0,ct          0     0.0       0       0     0.0     0.0
           ohci0,in          0     0.0       0       0     0.0     0.0
           ohci0,is          0     0.0       0       0     0.0     0.0
           ohci0,to          0     0.0       0       0     0.0     0.0
           ohci1,bu          0     0.0       0       0     0.0     0.0
           ohci1,ct          0     0.0       0       0     0.0     0.0
           ohci1,in          0     0.0       0       0     0.0     0.0
           ohci1,is          0     0.0       0       0     0.0     0.0
           ohci1,to          0     0.0       0       0     0.0     0.0


Here is a graph of the disk activity (again over the last 7 days).


Thanks for any help with the fdsynch.

PS: It is planned to make the canary monitor tool open source.


<http://www.sun.com> 	* Sean Meighan *
Mgr ITSM Engineering

*Sun Microsystems, Inc.*
US
Phone x32329 / +1 408 850-9537
Mobile 303-520-2024
Fax 408 850-9537
Email Sean.Meighan at Sun.COM
	

------------------------------------------------------------------------
NOTICE: This email message is for the sole use of the intended 
recipient(s) and may contain confidential and privileged information. 
Any unauthorized review, use, disclosure or distribution is prohibited. 
If you are not the intended recipient, please contact the sender by 
reply email and destroy all copies of the original message.
------------------------------------------------------------------------
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20060617/21ce5559/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: itsm-mpk-2.SFBay.Sun.com_cpu.png
Type: image/png
Size: 13478 bytes
Desc: not available
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20060617/21ce5559/attachment.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: itsm-mpk-2.SFBay.Sun.com_sard.png
Type: image/png
Size: 11450 bytes
Desc: not available
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20060617/21ce5559/attachment-0001.png>

Darren Reed

2006-Jun-18 08:50 UTC

head link

[zfs-discuss] fdsync(5, FSYNC) problem and ZFS

Sean Meighan wrote:
> ... Everything has been working perfectly until two days ago, now it 
> can take 10 minutes to exit from vi. The following truss shows that 
> the 3 line file that is sitting on the ZFS volume (/archives) took 
> almost 15 minutes in fdsync. For those of you inside of Sun you can 
> see the web page of all Sun Ray servers at http://canary.sfbay. My 
> team has contacted engineering and we are waiting for more help. Any 
> suggestions of what we might have hit would be appreciated.

Have you tried using dtrace to follow the code inside the kernel?
Or just to profile the kernel while this script is running?

Darren

Henk Langeveld

2006-Jun-18 12:15 UTC

head link

[zfs-discuss] fdsync(5, FSYNC) problem and ZFS

Sean Meighan wrote:>  Hi all; i just joined the group. My team has created a tool to watch 
> Solaris,Linux,Windows. It is non-root on both the client and the server. We
> currently are using it to watch the 750 Sun Ray servers inside of sun. The
basic
> thing it does is executes a 55 command shell script on the client and sends
the
> 1000 line output up to the central server. We do this every 10 minutes
across
> the 750 machines. We have a Niagara T2000, 32x1000mhz server  running
Solaris 10
> Generic_118833-08. We put ZFS on this box two months ago. We currently have
3.5
> million files and 3 billion lines of ASCII sitting on the internal drive of
the
> Niagara. The box runs less than 20% load.  
> Everything has been working perfectly 
> until two days ago, now it can take 10 minutes to exit from vi. 
Hi Sean,

May I ask what happened between 13/6 and 14/6?
(Forgive my European bias in dates.)

Without getting into the zfs details, *what changes occurred in the
system* around that time, which is about two days before the "two days 
ago", which I assume refers to 15/6.

We don''t need any statistical process analysis here, but just from
looking at the graphs I would say that until the end of the day on 13/6, 
  the system shows very regular activity in cpu *and* disk.
The assertion cpu > users is true for most of this period, with only
a few exceptions.  After 14/6, this definitely isn''t.

* at 18:00 13/6 the baseline for both cpu and disk drops visibly.
* around 08:00 14/6 the baseline jumps back up to previous levels
* between 14:00 and 16:00, baseline disk activity reaches new heights
   for the week.  Variance is quite low during this time.
* At the same time, CPU load rises steadily, with little variation,
   until halfway through, when it starts a new, more chaotic behaviour
   that deviates significantly from previous patterns.
* After 16:00 14/6, disk activity also enters a new, more chaotic
   pattern with a higher variance.
* There are a couple of polling gaps on 15/6 and 16/6, which was after
   you realised something was happening.

The line graph has some disadvantages - I''d like to see a scatterplot, 
if possible with a log scale from 1 to 100 on the Y-axis...

I don''t have access to staroffice and the stats-plugin anymore,
don''t
know if you have, but for data analysis they could be helpful.

Cheers,
Henk

But first check what actually happened between 13 and 14 june. Were new
h/w resources added?  New s/w installed or accessed?

Cheers,
Henk Langeveld

Daniel Rock

2006-Jun-18 13:43 UTC

head link

[zfs-discuss] fdsync(5, FSYNC) problem and ZFS

Sean Meighan schrieb:
> The box runs less than 20% load. Everything has been working perfectly 
> until two days ago, now it can take 10 minutes to exit from vi. The 
> following truss shows that the 3 line file that is sitting on the ZFS 
> volume (/archives) took almost 15 minutes in fdsync.
/me have similar observations. With virtual no other activity on the box 
exiting from vi (:wq) sometimes takes 10-60 seconds. Trussing the process 
always shows vi is waiting in fdsync(). This happened on at least two 
different machines:
- Sun E3500 with ZFS on A5200 disks (snv_37, almost all of the time large
   delay (10-60 secs.) when exiting from vi)
- Solaris/amd64 with ZFS on SATA disks (snv_39, but only few delays, usually
   not longer than 10 secs.)


Daniel

Dana H. Myers

2006-Jun-18 17:36 UTC

head link

[zfs-discuss] fdsync(5, FSYNC) problem and ZFS

Daniel Rock wrote:> Sean Meighan schrieb:
> 
>> The box runs less than 20% load. Everything has been working perfectly
>> until two days ago, now it can take 10 minutes to exit from vi. The
>> following truss shows that the 3 line file that is sitting on the ZFS
>> volume (/archives) took almost 15 minutes in fdsync.
> 
> /me have similar observations. With virtual no other activity on the box
> exiting from vi (:wq) sometimes takes 10-60 seconds. Trussing the
> process always shows vi is waiting in fdsync(). This happened on at
> least two different machines:
> - Sun E3500 with ZFS on A5200 disks (snv_37, almost all of the time large
>   delay (10-60 secs.) when exiting from vi)
> - Solaris/amd64 with ZFS on SATA disks (snv_39, but only few delays,
> usually
>   not longer than 10 secs.)
I think you may be tripping over:

CR 6413510 zfs: writing to ZFS filesystem slows down fsync() on other files in
the same FS

(see http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6413510)

It is apparently because ZFS currently syncs the entire filesystem when
asked to sync just one file.

Dana

Sean Meighan

2006-Jun-19 23:48 UTC

head link

[zfs-discuss] fdsync(5, FSYNC) problem and ZFS

ZFS engineering got back to us today and said the following:

In addition to 6404018 there are couple of other performance bottle necks :

6413510 zfs: writing to ZFS filesystem slows down fsync() on other files
in the same FS
6429205 each zpool needs to monitor it''s  throughput and throttle heavy
writers

joining together and causing the slowdown in this system itsm-mpk-2.sfbay.

6404018/6413510 is caused because of the way the ZFS logging records are
handled. Disabling ZFS logging (add ''set zfs:zil_disable=1'' to
/etc/system) might give you a relief from the ''vi'' slowdown
problem.

6429205 is more lively to hit hard machines with lots of memory, CPUs
and a weak storage. Which is true in this case. One disk is not able to
handle the IO load generated by the 32 virtual CPUs. As a workaround can
you try adding one or two disks to distribute the load.

Yes. There is no fix/patch for these BUGs yet. We would recommend to try 
the workarounds till there is a fix available for these BUGs.

what is the downside of disabling loggin records? this is a production 
machine and we are now affecting  a fairly large population.

thanks
sean

Daniel Rock wrote:
> Sean Meighan schrieb:
>
>> The box runs less than 20% load. Everything has been working 
>> perfectly until two days ago, now it can take 10 minutes to exit from 
>> vi. The following truss shows that the 3 line file that is sitting on 
>> the ZFS volume (/archives) took almost 15 minutes in fdsync.
>
>
> /me have similar observations. With virtual no other activity on the 
> box exiting from vi (:wq) sometimes takes 10-60 seconds. Trussing the 
> process always shows vi is waiting in fdsync(). This happened on at 
> least two different machines:
> - Sun E3500 with ZFS on A5200 disks (snv_37, almost all of the time large
>   delay (10-60 secs.) when exiting from vi)
> - Solaris/amd64 with ZFS on SATA disks (snv_39, but only few delays, 
> usually
>   not longer than 10 secs.)
>
>
> Daniel
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

-- 
<http://www.sun.com> 	* Sean Meighan *
Mgr ITSM Engineering

*Sun Microsystems, Inc.*
US
Phone x32329 / +1 408 850-9537
Mobile 303-520-2024
Fax 408 850-9537
Email Sean.Meighan at Sun.COM

------------------------------------------------------------------------
NOTICE: This email message is for the sole use of the intended 
recipient(s) and may contain confidential and privileged information. 
Any unauthorized review, use, disclosure or distribution is prohibited. 
If you are not the intended recipient, please contact the sender by 
reply email and destroy all copies of the original message.
------------------------------------------------------------------------
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20060619/7997f7fb/attachment.html>

Neil Perrin

2006-Jun-20 20:05 UTC

head link

[zfs-discuss] fdsync(5, FSYNC) problem and ZFS

Sean,

I''m not sure yet that the bugs below are responsible for this,
but I don''t know what is either.
15 minutes to do a fdsync is way outside the slowdown usually seen.
The footprint for 6413510 is that when a huge amount of
data is being written non synchronously and a fsync comes in for the
same filesystem then all the non-synchronous data is also forced out
synchronously. So is there a lot of data being written during the vi?

Also you say "Everything has been working perfectly until two days ago,
now it can take 10 minutes to exit from vi". So what happened two days ago?

Finally I would not recommend disabling logging. That switch was only intended
for internal use. Applications that rely on POSIX synchronous semantics will
not get what they asked for.

Neil.

Sean Meighan wrote On 06/19/06 17:48,:> ZFS engineering got back to us today and said the following:
> 
> In addition to 6404018 there are couple of other performance bottle necks :
> 
> 6413510 zfs: writing to ZFS filesystem slows down fsync() on other files
> in the same FS
> 6429205 each zpool needs to monitor it''s  throughput and throttle
heavy
> writers
> 
> joining together and causing the slowdown in this system itsm-mpk-2.sfbay.
> 
> 6404018/6413510 is caused because of the way the ZFS logging records are
> handled. Disabling ZFS logging (add ''set
zfs:zil_disable=1'' to
> /etc/system) might give you a relief from the ''vi''
slowdown problem.
> 
> 6429205 is more lively to hit hard machines with lots of memory, CPUs
> and a weak storage. Which is true in this case. One disk is not able to
> handle the IO load generated by the 32 virtual CPUs. As a workaround can
> you try adding one or two disks to distribute the load.
> 
> Yes. There is no fix/patch for these BUGs yet. We would recommend to try 
> the workarounds till there is a fix available for these BUGs.
> 
> 
> what is the downside of disabling loggin records? this is a production 
> machine and we are now affecting  a fairly large population.
> 
> thanks
> sean
> 
> Daniel Rock wrote:
> 
>> Sean Meighan schrieb:
>>
>>> The box runs less than 20% load. Everything has been working 
>>> perfectly until two days ago, now it can take 10 minutes to exit
from
>>> vi. The following truss shows that the 3 line file that is sitting
on
>>> the ZFS volume (/archives) took almost 15 minutes in fdsync.
>>
>>
>> /me have similar observations. With virtual no other activity on the 
>> box exiting from vi (:wq) sometimes takes 10-60 seconds. Trussing the 
>> process always shows vi is waiting in fdsync(). This happened on at 
>> least two different machines:
>> - Sun E3500 with ZFS on A5200 disks (snv_37, almost all of the time
large
>>   delay (10-60 secs.) when exiting from vi)
>> - Solaris/amd64 with ZFS on SATA disks (snv_39, but only few delays, 
>> usually
>>   not longer than 10 secs.)
>>
>>
>> Daniel
>> _______________________________________________
>> zfs-discuss mailing list
>> zfs-discuss at opensolaris.org
>> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
> 
> 
> -- 
> <http://www.sun.com> 	* Sean Meighan *
> Mgr ITSM Engineering
> 
> *Sun Microsystems, Inc.*
> US
> Phone x32329 / +1 408 850-9537
> Mobile 303-520-2024
> Fax 408 850-9537
> Email Sean.Meighan at Sun.COM
> 	
> 
> ------------------------------------------------------------------------
> NOTICE: This email message is for the sole use of the intended 
> recipient(s) and may contain confidential and privileged information. 
> Any unauthorized review, use, disclosure or distribution is prohibited. 
> If you are not the intended recipient, please contact the sender by 
> reply email and destroy all copies of the original message.
> ------------------------------------------------------------------------
> 
> 
> ------------------------------------------------------------------------
> 
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
-- 

Neil

Roch

2006-Jun-21 13:31 UTC

head link

[zfs-discuss] fdsync(5, FSYNC) problem and ZFS

15 minutes to do a fdsync is way outside the slowdown usually seen.
  The footprint for 6413510 is that when a huge amount of
  data is being written non synchronously and a fsync comes in for the
  same filesystem then all the non-synchronous data is also forced out
  synchronously. So is there a lot of data being written during the vi?

vi will write the whole file (in 4K) chunks and fsync it.
(based on a single experiment).

So  for a largefile vi ,  on quit, we  have lots  of data to
sync in and of  itself.  But because 6413510  we potentially
have to    sync lots  of    other  data  written  by   other
applications.

Now take a Niagara with lots of available CPUs and lots
of free memory (32GB maybe?) running some ''tar x'' in
parallel. A huge chunk of the 32GB can end up as dirty.

I say too much so because of lack of throttling:

	http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6429205
	6429205 each zpool needs to monitor it''s  throughput and throttle
heavy writers

Then vi :q; fsyncs; and all of the pending data must
sync. So we have extra data to sync because of:

	http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6413510
	zfs: writing to ZFS filesystem slows down fsync() on other files in the same FS

Furthermore, we can be slowed by this:

	http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6440499
	zil should avoid txg_wait_synced() and use dmu_sync() to issue parallel IOs...

Note: 6440499 is now fixed in the gate.

And  finally  all this data goes  to  a single disk. Worse a
slice of a disk.  Since  it''s just a  slice ZFS can''t enable
the write cache. Then if there is no tag queue (is there ?) we
will handle everything one I/O at a time. If it''s a SATA
drive we have other issues...

I think  we''ve hit is  all here. So can this  lead to 15 min
fsync ? I can''t swear,  Actually I won''t be convinced myself
before  I convince you,  but we  do have  things  to chew on
already.


Do  I recall that   this   is about a    1GB  file in  vi  ?
:wq-uitting out of a 1 GB vi session on a 50MB/sec disk will
take  20sec  when everything  hums   and there  are no other
traffic involved. With no write cache / no tag queue , maybe
10X more.

-r

Sean Meighan

2006-Jun-21 15:41 UTC

head link

[zfs-discuss] 15 minute fdsync problem and ZFS: Solved

The vi we were doing was a 2 line file. If you just vi a new file, add 
one line and exit it would take 15 minutes in fdsynch. On recommendation 
of a workaround we set

set zfs:zil_disable=1

after the reboot the fdsynch is now < 0.1 seconds. Now I have no idea if it
was this setting or the fact that we went through a reboot. Whatever the root
cause we are now back to a well behaved file system.

thanks
sean



Roch wrote:
>  15 minutes to do a fdsync is way outside the slowdown usually seen.
>  The footprint for 6413510 is that when a huge amount of
>  data is being written non synchronously and a fsync comes in for the
>  same filesystem then all the non-synchronous data is also forced out
>  synchronously. So is there a lot of data being written during the vi?
>
>vi will write the whole file (in 4K) chunks and fsync it.
>(based on a single experiment).
>
>So  for a largefile vi ,  on quit, we  have lots  of data to
>sync in and of  itself.  But because 6413510  we potentially
>have to    sync lots  of    other  data  written  by   other
>applications.
>
>Now take a Niagara with lots of available CPUs and lots
>of free memory (32GB maybe?) running some ''tar x'' in
>parallel. A huge chunk of the 32GB can end up as dirty.
>
>I say too much so because of lack of throttling:
>
>	http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6429205
>	6429205 each zpool needs to monitor it''s  throughput and throttle
heavy writers
>
>Then vi :q; fsyncs; and all of the pending data must
>sync. So we have extra data to sync because of:
>
>	http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6413510
>	zfs: writing to ZFS filesystem slows down fsync() on other files in the
same FS
>
>Furthermore, we can be slowed by this:
>
>	http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6440499
>	zil should avoid txg_wait_synced() and use dmu_sync() to issue parallel
IOs...
>
>Note: 6440499 is now fixed in the gate.
>
>And  finally  all this data goes  to  a single disk. Worse a
>slice of a disk.  Since  it''s just a  slice ZFS can''t
enable
>the write cache. Then if there is no tag queue (is there ?) we
>will handle everything one I/O at a time. If it''s a SATA
>drive we have other issues...
>
>I think  we''ve hit is  all here. So can this  lead to 15 min
>fsync ? I can''t swear,  Actually I won''t be convinced
myself
>before  I convince you,  but we  do have  things  to chew on
>already.
>
>
>Do  I recall that   this   is about a    1GB  file in  vi  ?
>:wq-uitting out of a 1 GB vi session on a 50MB/sec disk will
>take  20sec  when everything  hums   and there  are no other
>traffic involved. With no write cache / no tag queue , maybe
>10X more.
>
>-r
>
>  
>
-- 
<http://www.sun.com> 	* Sean Meighan *
Mgr ITSM Engineering

*Sun Microsystems, Inc.*
US
Phone x32329 / +1 408 850-9537
Mobile 303-520-2024
Fax 408 850-9537
Email Sean.Meighan at Sun.COM
	

------------------------------------------------------------------------
NOTICE: This email message is for the sole use of the intended 
recipient(s) and may contain confidential and privileged information. 
Any unauthorized review, use, disclosure or distribution is prohibited. 
If you are not the intended recipient, please contact the sender by 
reply email and destroy all copies of the original message.
------------------------------------------------------------------------
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20060621/20c1695b/attachment.html>

Neil Perrin

2006-Jun-21 15:47 UTC

head link

[zfs-discuss] fdsync(5, FSYNC) problem and ZFS

Roch wrote On 06/21/06 07:31,:>   15 minutes to do a fdsync is way outside the slowdown usually seen.
>   The footprint for 6413510 is that when a huge amount of
>   data is being written non synchronously and a fsync comes in for the
>   same filesystem then all the non-synchronous data is also forced out
>   synchronously. So is there a lot of data being written during the vi?
> 
> vi will write the whole file (in 4K) chunks and fsync it.
> (based on a single experiment).
Sean kindly gave me access to the system and so far I have reproduced the
problem. It just requires a fsync on a file with 1 byte and for me takes
10 minutes to fsync! I have run a few dscripts but have yet to make much
more progress. I do see that zfs version is fairly old so we may be chasing
an old bug, or perhaps this really is an extreme version of 6413510,
as there is often 3MB of data being collected and written to the pool
and probably the same fs.

Neil

Neil Perrin

2006-Jun-21 15:53 UTC

head link

[zfs-discuss] 15 minute fdsync problem and ZFS: Solved

Well this does look more and more like a duplicate of:

6413510 zfs: writing to ZFS filesystem slows down fsync() on other files in the 
same FS

Neil

Roch

2006-Jun-21 16:02 UTC

head link

[zfs-discuss] 15 minute fdsync problem and ZFS: Solved

Sean Meighan writes:
 > The vi we were doing was a 2 line file. If you just vi a new file, add 
 > one line and exit it would take 15 minutes in fdsynch. On recommendation 
 > of a workaround we set
 > 
 > set zfs:zil_disable=1
 > 
 > after the reboot the fdsynch is now < 0.1 seconds. Now I have no idea
if it was this setting or the fact that we went through a reboot. Whatever the
root cause we are now back to a well behaved file system.

well behaved...In appearance only !

Maybe it''s nice to validate hypothesis but you should not
run with this option set, ever., it disable O_DSYNC and
fsync() and I don''t know what else.

Bad idea, bad. 

-r

bad.

Torrey McMahon

2006-Jun-21 16:29 UTC

head link

[zfs-discuss] 15 minute fdsync problem and ZFS: Solved

Roch wrote:> Sean Meighan writes:
>  > The vi we were doing was a 2 line file. If you just vi a new file,
add
>  > one line and exit it would take 15 minutes in fdsynch. On
recommendation
>  > of a workaround we set
>  > 
>  > set zfs:zil_disable=1
>  > 
>  > after the reboot the fdsynch is now < 0.1 seconds. Now I have no
idea if it was this setting or the fact that we went through a reboot. Whatever
the root cause we are now back to a well behaved file system.
>  
>
> well behaved...In appearance only !
>
> Maybe it''s nice to validate hypothesis but you should not
> run with this option set, ever., it disable O_DSYNC and
> fsync() and I don''t know what else.
>
> Bad idea, bad. 
>   

Why is this option available then? (Yes, that''s a loaded question.)

Neil Perrin

2006-Jun-21 16:41 UTC

head link

[zfs-discuss] 15 minute fdsync problem and ZFS: Solved

Torrey McMahon wrote On 06/21/06 10:29,:> Roch wrote:
> 
>> Sean Meighan writes:
>>  > The vi we were doing was a 2 line file. If you just vi a new
file,
>> add  > one line and exit it would take 15 minutes in fdsynch. On 
>> recommendation  > of a workaround we set
>>  >  > set zfs:zil_disable=1
>>  >  > after the reboot the fdsynch is now < 0.1 seconds. Now I
have no
>> idea if it was this setting or the fact that we went through a reboot. 
>> Whatever the root cause we are now back to a well behaved file system.
>>  
>>
>> well behaved...In appearance only !
>>
>> Maybe it''s nice to validate hypothesis but you should not
>> run with this option set, ever., it disable O_DSYNC and
>> fsync() and I don''t know what else.
>>
>> Bad idea, bad.   
> 
> 
> 
> Why is this option available then? (Yes, that''s a loaded
question.)
I wouldn''t call it an option, but an internal debugging switch that I
originally added to allow progress when initially integrating the ZIL.
As Roch says it really shouldn''t be ever set (as it does negate POSIX
synchronous semantics). Nor should it be mentioned to a customer.
In fact I''m inclined to now remove it - however it does still have a
use
as it helped root cause this problem.

Neil

Robert Milkowski

2006-Jun-21 17:09 UTC

head link

[zfs-discuss] 15 minute fdsync problem and ZFS: Solved

Hello Neil,

Wednesday, June 21, 2006, 6:41:50 PM, you wrote:


NP> Torrey McMahon wrote On 06/21/06 10:29,:>> Roch wrote:
>> 
>>> Sean Meighan writes:
>>>  > The vi we were doing was a 2 line file. If you just vi a new
file,
>>> add  > one line and exit it would take 15 minutes in fdsynch. On
>>> recommendation  > of a workaround we set
>>>  >  > set zfs:zil_disable=1
>>>  >  > after the reboot the fdsynch is now < 0.1 seconds.
Now I have no
>>> idea if it was this setting or the fact that we went through a
reboot.
>>> Whatever the root cause we are now back to a well behaved file
system.
>>>  
>>>
>>> well behaved...In appearance only !
>>>
>>> Maybe it''s nice to validate hypothesis but you should not
>>> run with this option set, ever., it disable O_DSYNC and
>>> fsync() and I don''t know what else.
>>>
>>> Bad idea, bad.   
>> 
>> 
>> 
>> Why is this option available then? (Yes, that''s a loaded
question.)
NP> I wouldn''t call it an option, but an internal debugging switch
that I
NP> originally added to allow progress when initially integrating the ZIL.
NP> As Roch says it really shouldn''t be ever set (as it does negate
POSIX
NP> synchronous semantics). Nor should it be mentioned to a customer.
NP> In fact I''m inclined to now remove it - however it does still
have a use
NP> as it helped root cause this problem.

Isn''t it similar to unsupported fastfs for ufs?

I think it could be useful in some cases after all.


-- 
Best regards,
 Robert                            mailto:rmilkowski at task.gda.pl
                                       http://milek.blogspot.com

Nicolas Williams

2006-Jun-21 17:10 UTC

head link

[zfs-discuss] 15 minute fdsync problem and ZFS: Solved

On Wed, Jun 21, 2006 at 10:41:50AM -0600, Neil Perrin
wrote:> >Why is this option available then? (Yes, that''s a loaded
question.)
> 
> I wouldn''t call it an option, but an internal debugging switch
that I
> originally added to allow progress when initially integrating the ZIL.
> As Roch says it really shouldn''t be ever set (as it does negate
POSIX
> synchronous semantics). Nor should it be mentioned to a customer.
> In fact I''m inclined to now remove it - however it does still have
a use
> as it helped root cause this problem.
Rename it to "zil_disable_danger_will_robinson" :)

Torrey McMahon

2006-Jun-21 17:19 UTC

head link

[zfs-discuss] 15 minute fdsync problem and ZFS: Solved

Nicolas Williams wrote:> On Wed, Jun 21, 2006 at 10:41:50AM -0600, Neil Perrin wrote:
>   
>>> Why is this option available then? (Yes, that''s a loaded
question.)
>>>       
>> I wouldn''t call it an option, but an internal debugging switch
that I
>> originally added to allow progress when initially integrating the ZIL.
>> As Roch says it really shouldn''t be ever set (as it does
negate POSIX
>> synchronous semantics). Nor should it be mentioned to a customer.
>> In fact I''m inclined to now remove it - however it does still
have a use
>> as it helped root cause this problem.
>>     
>
> Rename it to "zil_disable_danger_will_robinson" 

The sad truth is that debugging bits tend to survive into production and 
then we get escalations that go something like, "I set this variable in 
/etc/system and now I''m {getting data corruption, weird behavior, an
odd
rash, ...}"

The fewer the better, imho. If it can be removed, great. If not, then 
maybe something for the tunables guide.

Neil Perrin

2006-Jun-21 18:15 UTC

head link

[zfs-discuss] 15 minute fdsync problem and ZFS: Solved

Robert Milkowski wrote On 06/21/06 11:09,:> Hello Neil,
>>>Why is this option available then? (Yes, that''s a loaded
question.)
> 
> NP> I wouldn''t call it an option, but an internal debugging
switch that I
> NP> originally added to allow progress when initially integrating the
ZIL.
> NP> As Roch says it really shouldn''t be ever set (as it does
negate POSIX
> NP> synchronous semantics). Nor should it be mentioned to a customer.
> NP> In fact I''m inclined to now remove it - however it does
still have a use
> NP> as it helped root cause this problem.
> 
> Isn''t it similar to unsupported fastfs for ufs?
It is similar in the sense that it speeds up the file system.
Using fastfs can be much more dangerous though as it can lead
to a badly corrupted file system as writing meta data is delayed
and written out of order. Whereas disabling the ZIL does not affect
the integrity of the fs. The transaction group model of ZFS gives
consistency in the event of a crash/power fail. However, any data that
was promised to be on stable storage may not be unless the transaction
group committed (an operation that is started every 5s).

We once had plans to add a mount option to allow the admin
to control the ZIL. Here''s a brief section of the RFE (6280630):

         sync={deferred,standard,forced}

                 Controls synchronous semantics for the dataset.

                 When set to ''standard'' (the default),
synchronous operations
                 such as fsync(3C) behave precisely as defined in
                 fcntl.h(3HEAD).

                 When set to ''deferred'', requests for
synchronous semantics
                 are ignored.  However, ZFS still guarantees that ordering
                 is preserved -- that is, consecutive operations reach stable
                 storage in order.  (If a thread performs operation A followed
                 by operation B, then the moment that B reaches stable storage,
                 A is guaranteed to be on stable storage as well.)  ZFS also
                 guarantees that all operations will be scheduled for write to
                 stable storage within a few seconds, so that an unexpected
                 power loss only takes the last few seconds of change with it.

                 When set to ''forced'', all operations become
synchronous.
                 No operation will return until all previous operations
                 have been committed to stable storage.  This option can be
                 useful if an application is found to depend on synchronous
                 semantics without actually requesting them; otherwise, it
                 will just make everything slow, and is not recommended.

Of course we would need to stress the dangers of setting
''deferred''.
What do you guys think?

Neil.

Bill Sommerfeld

2006-Jun-21 19:03 UTC

head link

[zfs-discuss] 15 minute fdsync problem and ZFS: Solved

On Wed, 2006-06-21 at 14:15, Neil Perrin wrote:> Of course we would need to stress the dangers of setting
''deferred''.
> What do you guys think?
I can think of a use case for "deferred": improving the efficiency of
a
large mega-"transaction"/batch job such as a nightly build.

You create an initially empty or cloned dedicated filesystem for the
build, and start it off, and won''t look inside until it completes.  If
the build machine crashes in the middle of the build you''re going to
nuke it all and start over because that''s lower risk than assuming you
can pick up where it left off.

now, it happens that a bunch of tools used during a build invoke fsync. 
But in the context of a full nightly build that effort is wasted.  All
you need is one big "sync everything" at the very end, either by using
a
command like sync or lockfs -f,  or as a side effect of reverting from
sync=deferred to sync=standard.

					- Bill

Torrey McMahon

2006-Jun-21 19:44 UTC

head link

[zfs-discuss] 15 minute fdsync problem and ZFS: Solved

Neil Perrin wrote:>
> Of course we would need to stress the dangers of setting
''deferred''.
> What do you guys think? 

That''s the key: Be very explicit about what the option does and the
side
effects.

eric kustarz

2006-Jun-21 20:11 UTC

head link

[zfs-discuss] 15 minute fdsync problem and ZFS: Solved

Neil Perrin wrote:
>
>
> Robert Milkowski wrote On 06/21/06 11:09,:
>
>> Hello Neil,
>>
>>>> Why is this option available then? (Yes, that''s a
loaded question.)
>>>
>>
>> NP> I wouldn''t call it an option, but an internal debugging
switch
>> that I
>> NP> originally added to allow progress when initially integrating
the
>> ZIL.
>> NP> As Roch says it really shouldn''t be ever set (as it
does negate
>> POSIX
>> NP> synchronous semantics). Nor should it be mentioned to a
customer.
>> NP> In fact I''m inclined to now remove it - however it does
still
>> have a use
>> NP> as it helped root cause this problem.
>>
>> Isn''t it similar to unsupported fastfs for ufs?
>
>
> It is similar in the sense that it speeds up the file system.
> Using fastfs can be much more dangerous though as it can lead
> to a badly corrupted file system as writing meta data is delayed
> and written out of order. Whereas disabling the ZIL does not affect
> the integrity of the fs. The transaction group model of ZFS gives
> consistency in the event of a crash/power fail. However, any data that
> was promised to be on stable storage may not be unless the transaction
> group committed (an operation that is started every 5s).
>
> We once had plans to add a mount option to allow the admin
> to control the ZIL. Here''s a brief section of the RFE (6280630):
>
>         sync={deferred,standard,forced}
>
>                 Controls synchronous semantics for the dataset.
>
>                 When set to ''standard'' (the default),
synchronous
> operations
>                 such as fsync(3C) behave precisely as defined in
>                 fcntl.h(3HEAD).
>
>                 When set to ''deferred'', requests for
synchronous
> semantics
>                 are ignored.  However, ZFS still guarantees that ordering
>                 is preserved -- that is, consecutive operations reach 
> stable
>                 storage in order.  (If a thread performs operation A 
> followed
>                 by operation B, then the moment that B reaches stable 
> storage,
>                 A is guaranteed to be on stable storage as well.)  ZFS 
> also
>                 guarantees that all operations will be scheduled for 
> write to
>                 stable storage within a few seconds, so that an 
> unexpected
>                 power loss only takes the last few seconds of change 
> with it.
>
>                 When set to ''forced'', all operations
become synchronous.
>                 No operation will return until all previous operations
>                 have been committed to stable storage.  This option 
> can be
>                 useful if an application is found to depend on 
> synchronous
>                 semantics without actually requesting them; otherwise, it
>                 will just make everything slow, and is not recommended.
>
> Of course we would need to stress the dangers of setting
''deferred''.
> What do you guys think?
>
> Neil.

Scares me, and it seems we should wait until people are demanding it and 
we *have* to do it (if that time ever comes) - that is, we can''t
squeeze
any more performance gain out of the ''standard'' method.

If problems do occur because of ''deferred'' mode, once i
wrap-up zpool
history, we''ll have that they set this logged to disk.

eric

Olaf Manczak

2006-Jun-21 22:06 UTC

head link

[zfs-discuss] 15 minute fdsync problem and ZFS: Solved

Neil,

I think it might be wise to look at this problem from the perspective
of an application (e.g. a simple database) designer taking into account
all the new things that Solaris ZFS provides.

In case of ZFS the designer does not have to worry about consistency
of the on-disk file system format but only about "has my data been
committed either to disk (or to NVRAM if there is one)". Depending on
the problem the designer tries to address it might be either the
total write throughput, in which case the designer might love the
"deferred" option, or the ability of sync file data to stable storage
and the latency of this operation. Considering flexibility of the
file system creation in ZFS I could imagine use of multiple file
systems with different mount options for different types of files.

All in all, though, the question is if a set of the POSIX calls with
the semantics defined through the mount options gives programmers
(or application designers) enough flexibility to address most common 
issues in high level application scenarios a simple and productive way.
If so which of these different sync options are useful or needed.

-- Olaf
> It is similar in the sense that it speeds up the file system.
> Using fastfs can be much more dangerous though as it can lead
> to a badly corrupted file system as writing meta data is delayed
> and written out of order. Whereas disabling the ZIL does not affect
> the integrity of the fs. The transaction group model of ZFS gives
> consistency in the event of a crash/power fail. However, any data that
> was promised to be on stable storage may not be unless the transaction
> group committed (an operation that is started every 5s).
> 
> We once had plans to add a mount option to allow the admin
> to control the ZIL. Here''s a brief section of the RFE (6280630):
> 
>         sync={deferred,standard,forced}
> 
>                 Controls synchronous semantics for the dataset.
> 
>                 When set to ''standard'' (the default),
synchronous
> operations
>                 such as fsync(3C) behave precisely as defined in
>                 fcntl.h(3HEAD).
> 
>                 When set to ''deferred'', requests for
synchronous semantics
>                 are ignored.  However, ZFS still guarantees that ordering
>                 is preserved -- that is, consecutive operations reach 
> stable
>                 storage in order.  (If a thread performs operation A 
> followed
>                 by operation B, then the moment that B reaches stable 
> storage,
>                 A is guaranteed to be on stable storage as well.)  ZFS also
>                 guarantees that all operations will be scheduled for 
> write to
>                 stable storage within a few seconds, so that an unexpected
>                 power loss only takes the last few seconds of change 
> with it.
> 
>                 When set to ''forced'', all operations
become synchronous.
>                 No operation will return until all previous operations
>                 have been committed to stable storage.  This option can be
>                 useful if an application is found to depend on synchronous
>                 semantics without actually requesting them; otherwise, it
>                 will just make everything slow, and is not recommended.
> 
> Of course we would need to stress the dangers of setting
''deferred''.
> What do you guys think?

Jason Ozolins

2006-Jun-22 03:57 UTC

head link

[zfs-discuss] 15 minute fdsync problem and ZFS: Solved

Bill Sommerfeld wrote:> On Wed, 2006-06-21 at 14:15, Neil Perrin wrote:
> 
>>Of course we would need to stress the dangers of setting
''deferred''.
>>What do you guys think?
> 
> I can think of a use case for "deferred": improving the
efficiency of a
> large mega-"transaction"/batch job such as a nightly build.
> 
> You create an initially empty or cloned dedicated filesystem for the
> build, and start it off, and won''t look inside until it completes.
If
> the build machine crashes in the middle of the build you''re going
to
> nuke it all and start over because that''s lower risk than assuming
you
> can pick up where it left off.
> 
> now, it happens that a bunch of tools used during a build invoke fsync. 
> But in the context of a full nightly build that effort is wasted.  All
> you need is one big "sync everything" at the very end, either by
using a
> command like sync or lockfs -f,  or as a side effect of reverting from
> sync=deferred to sync=standard.
Can I give support for this use case?   Or does it take someone like 
Casper Dik with ''fastfs'' to come along later and provide a
utility that
lets people make the filesystem do what want it to?  [still annoyed that 
it took me so long to find out about fastfs - hell, the Solaris 8 or 9 
OS installation process was using the same IOCTL as fastfs uses, but for 
some reason end users still have to find fastfs out on the Net somewhere 
instead of getting it with the OS].

If the ZFS docs state why it''s not for general use, then
what''s to
separate this from the zillion other ways that a cavalier sysadmin can 
bork their data (or indeed their whole machine)?  Otherwise, why even 
let people create a striped zpool vdev without redundancy - it''s just
an
accident waiting to happen, right?   We must save people from 
themselves!  Think of the children! ;-)

-Jason =:^/

-- 
Jason.Ozolins at anu.edu.au         ANU Supercomputer Facility
APAC Grid Program                Leonard Huxley Bldg 56, Mills Road
Ph:  +61 2 6125 5449             Australian National University
Fax: +61 2 6125 8199             Canberra, ACT, 0200, Australia

Roch

2006-Jun-22 07:55 UTC

head link

[zfs-discuss] 15 minute fdsync problem and ZFS: Solved

How about the ''deferred'' option be on  a leased basis with a
deadline to revert to  normal behavior; at  most 24hrs  at a
time.  Console output everytime the option is enabled.


-r


Torrey McMahon writes:
 > Neil Perrin wrote:
 > >
 > > Of course we would need to stress the dangers of setting
''deferred''.
 > > What do you guys think? 
 > 
 > 
 > That''s the key: Be very explicit about what the option does and
the side
 > effects.
 > _______________________________________________
 > zfs-discuss mailing list
 > zfs-discuss at opensolaris.org
 > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Darren J Moffat

2006-Jun-22 08:56 UTC

head link

[zfs-discuss] 15 minute fdsync problem and ZFS: Solved

Bill Sommerfeld wrote:> On Wed, 2006-06-21 at 14:15, Neil Perrin wrote:
>> Of course we would need to stress the dangers of setting
''deferred''.
>> What do you guys think?
> 
> I can think of a use case for "deferred": improving the
efficiency of a
> large mega-"transaction"/batch job such as a nightly build.
Yum Yum!!

We could even build this into nightly(1) once we have user delegation to 
create clones.

nightly(1) would zfs clone, zfs set reservation=, zfs set sync=deferred,
and when it is done release the reservation unset deffered and snapshot.

When we can have it ?

-- 
Darren J Moffat

Dana H. Myers

2006-Jun-22 09:20 UTC

head link

[zfs-discuss] 15 minute fdsync problem and ZFS: Solved

Darren J Moffat wrote:> Bill Sommerfeld wrote:
>> On Wed, 2006-06-21 at 14:15, Neil Perrin wrote:
>>> Of course we would need to stress the dangers of setting
''deferred''.
>>> What do you guys think?
>>
>> I can think of a use case for "deferred": improving the
efficiency of a
>> large mega-"transaction"/batch job such as a nightly build.
> 
> Yum Yum!!
> 
> We could even build this into nightly(1) once we have user delegation to
> create clones.
> 
> nightly(1) would zfs clone, zfs set reservation=, zfs set sync=deferred,
> and when it is done release the reservation unset deffered and snapshot.
> 
> When we can have it ?
Before we get too far down that path, has anyone timed a nightly
with and without the zil_disable''d ?

Dana

Robert Milkowski

2006-Jun-22 12:36 UTC

head link

[zfs-discuss] 15 minute fdsync problem and ZFS: Solved

Hello Roch,

Thursday, June 22, 2006, 9:55:41 AM, you wrote:

R> How about the ''deferred'' option be on  a leased basis
with a
R> deadline to revert to  normal behavior; at  most 24hrs  at a
R> time.  Console output everytime the option is enabled.

I really hate when tools try to be more clever than sys-admins.

Generating some kind of warning - sure, why not.


-- 
Best regards,
 Robert                            mailto:rmilkowski at task.gda.pl
                                       http://milek.blogspot.com

Bill Sommerfeld

2006-Jun-22 14:05 UTC

head link

[zfs-discuss] 15 minute fdsync problem and ZFS: Solved

On Thu, 2006-06-22 at 03:55, Roch wrote:> How about the ''deferred'' option be on  a leased basis
with a
> deadline to revert to  normal behavior; at  most 24hrs  at a
> time.  why? 
> Console output everytime the option is enabled.in general, no.  error messages to the console should be reserved for
truly frightening events and this simply isn''t one of them.

						- Bill

Joe Little

2006-Jun-22 14:24 UTC

head link

[zfs-discuss] 15 minute fdsync problem and ZFS: Solved

Well, I should weigh in hear.

I have been using ZFS with an iscsi backend and a NFS front end to my
clients. Until B41 (not sure what fixed this) I was getting 20KB/sec
for RAIDZ and 200KB/sec for just ZFS on on large iscsi LUNs
(non-RAIDZ) when I was receiving many small writes, such as untarring
of a linux or opensolaris tree, or artificially a copy of 6250 8k
files. It turned out the NFS would issue 3 fsyncs on each write, and
my performance degraded terribly from my normal 20MB+/sec writes to
the backend iscsi storage. Now, a parallel test using NetApps shows no
performance drop, but that''s because of NVRAM backed storage there,
and a test against the same iscsi targets using linux and XFS and the
NFS server implementation there gave me 1.25MB/sec writes. I was about
to throw in the towel and deem ZFS/NFS has unusable until B41 came
along and at least gave me 1.25MB/sec.

This option, with all its caveats, would be ideal on various
NFS-provided filesystems (large cache directories for cluster nodes,
tmp space from my pools, etc) to get performance characteristics
similar to a NetApp Filer. If I can provide for that stable storage to
a high degree, or have only NVRAM-based storage, this could be a big
win for ZFS, if nothing else than the RFPs that would require
benchmarks/bakeoffs against a NetApp showing it can perform just as
fast, caveats and all.


On 6/21/06, Olaf Manczak <Olaf.Manczak at sun.com>
wrote:> Neil,
>
> I think it might be wise to look at this problem from the perspective
> of an application (e.g. a simple database) designer taking into account
> all the new things that Solaris ZFS provides.
>
> In case of ZFS the designer does not have to worry about consistency
> of the on-disk file system format but only about "has my data been
> committed either to disk (or to NVRAM if there is one)". Depending on
> the problem the designer tries to address it might be either the
> total write throughput, in which case the designer might love the
> "deferred" option, or the ability of sync file data to stable
storage
> and the latency of this operation. Considering flexibility of the
> file system creation in ZFS I could imagine use of multiple file
> systems with different mount options for different types of files.
>
> All in all, though, the question is if a set of the POSIX calls with
> the semantics defined through the mount options gives programmers
> (or application designers) enough flexibility to address most common
> issues in high level application scenarios a simple and productive way.
> If so which of these different sync options are useful or needed.
>
> -- Olaf
>
> > It is similar in the sense that it speeds up the file system.
> > Using fastfs can be much more dangerous though as it can lead
> > to a badly corrupted file system as writing meta data is delayed
> > and written out of order. Whereas disabling the ZIL does not affect
> > the integrity of the fs. The transaction group model of ZFS gives
> > consistency in the event of a crash/power fail. However, any data that
> > was promised to be on stable storage may not be unless the transaction
> > group committed (an operation that is started every 5s).
> >
> > We once had plans to add a mount option to allow the admin
> > to control the ZIL. Here''s a brief section of the RFE
(6280630):
> >
> >         sync={deferred,standard,forced}
> >
> >                 Controls synchronous semantics for the dataset.
> >
> >                 When set to ''standard'' (the
default), synchronous
> > operations
> >                 such as fsync(3C) behave precisely as defined in
> >                 fcntl.h(3HEAD).
> >
> >                 When set to ''deferred'', requests for
synchronous semantics
> >                 are ignored.  However, ZFS still guarantees that
ordering
> >                 is preserved -- that is, consecutive operations reach
> > stable
> >                 storage in order.  (If a thread performs operation A
> > followed
> >                 by operation B, then the moment that B reaches stable
> > storage,
> >                 A is guaranteed to be on stable storage as well.)  ZFS
also
> >                 guarantees that all operations will be scheduled for
> > write to
> >                 stable storage within a few seconds, so that an
unexpected
> >                 power loss only takes the last few seconds of change
> > with it.
> >
> >                 When set to ''forced'', all operations
become synchronous.
> >                 No operation will return until all previous operations
> >                 have been committed to stable storage.  This option
can be
> >                 useful if an application is found to depend on
synchronous
> >                 semantics without actually requesting them; otherwise,
it
> >                 will just make everything slow, and is not
recommended.
> >
> > Of course we would need to stress the dangers of setting
''deferred''.
> > What do you guys think?
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>

Roch

2006-Jun-22 17:01 UTC

head link

[zfs-discuss] 15 minute fdsync problem and ZFS: Solved

Bill Sommerfeld writes:
 > On Thu, 2006-06-22 at 03:55, Roch wrote:
 > > How about the ''deferred'' option be on  a leased
basis with a
 > > deadline to revert to  normal behavior; at  most 24hrs  at a
 > > time.  
 > why? 

I''ll trust your   judgement over mine  on  this, so  I won''t
press.   But it was mentioned  that  this would be useful to
implement   time-bounded   huge  meta-transaction such as  a
build.  Given that  we eventually do  want  to have a  point
where we know that data is  on stable-storage, I''d figure we
could  say upfront what the  time  scale is.  Is there a sync
command that targets individual FS ?

-r

Bill Sommerfeld

2006-Jun-22 17:15 UTC

head link

[zfs-discuss] 15 minute fdsync problem and ZFS: Solved

On Thu, 2006-06-22 at 13:01, Roch wrote:>  Is there a sync command that targets individual FS ?
Yes.  lockfs -f

					- Bill

Darren J Moffat

2006-Jun-22 17:19 UTC

head link

[zfs-discuss] 15 minute fdsync problem and ZFS: Solved

Bill Sommerfeld wrote:> On Thu, 2006-06-22 at 13:01, Roch wrote:
>>  Is there a sync command that targets individual FS ?
> 
> Yes.  lockfs -f
Does lockfs work with ZFS ?  The man page appears to indicate it is very 
UFS specific.

-- 
Darren J Moffat

Bill Sommerfeld

2006-Jun-22 17:32 UTC

head link

[zfs-discuss] 15 minute fdsync problem and ZFS: Solved

On Thu, 2006-06-22 at 13:19, Darren J Moffat wrote:> > Yes.  lockfs -f
> 
> Does lockfs work with ZFS ?  The man page appears to indicate it is very 
> UFS specific.
all of lockfs does not.  but, if truss is to believed, the ioctl used by
lockfs -f appears to.  or at least, it returns without error.

						- Bill

Jonathan Adams

2006-Jun-22 17:39 UTC

head link

[zfs-discuss] 15 minute fdsync problem and ZFS: Solved

On Thu, Jun 22, 2006 at 06:19:20PM +0100, Darren J Moffat
wrote:> Bill Sommerfeld wrote:
> >On Thu, 2006-06-22 at 13:01, Roch wrote:
> >> Is there a sync command that targets individual FS ?
> >
> >Yes.  lockfs -f
> 
> Does lockfs work with ZFS ?  The man page appears to indicate it is very 
> UFS specific.
Well, it just ends up doing an ioctl(), which zfs recognizes:

# dtrace -n ''syscall::ioctl:entry/pid == $target/{self->on =
1}'' \
    -n''fbt:::/self->on/{}'' -n
''syscall::ioctl:return/self->on/{self->on = 0}'' \
    -F -c ''lockfs -f /aux1''
dtrace: description ''syscall::ioctl:entry'' matched 1 probe
dtrace: description ''fbt:::'' matched 44321 probes
dtrace: description ''syscall::ioctl:return'' matched 1 probe
dtrace: pid 151072 has exited
CPU FUNCTION                                 
  0  -> ioctl                                 
  0    -> getf                                
  0      -> set_active_fd                     
  0      <- set_active_fd                     
  0    <- getf                                
  0    -> get_udatamodel                      
  0    <- get_udatamodel                      
  0    -> fop_ioctl                           
  0      -> zfs_ioctl                         
  0      <- zfs_ioctl                         
  0      -> zfs_sync                          
  0        -> zil_commit                      
  0        <- zil_commit                      
  0      <- zfs_sync                          
  0    <- fop_ioctl                           
  0    -> releasef                            
  0      -> clear_active_fd                   
  0      <- clear_active_fd                   
  0      -> cv_broadcast                      
  0      <- cv_broadcast                      
  0    <- releasef                            
  0  <- ioctl                                 

So the sync happens.

Cheers,
- jonathan

-- 
Jonathan Adams, Solaris Kernel Development

Roch

2006-Jun-22 17:46 UTC

head link

[zfs-discuss] 15 minute fdsync problem and ZFS: Solved

As I recall, the zfs sync is, unlike UFS, synchronous.

-r

Prabahar Jeyaram

2006-Jun-22 17:55 UTC

head link

[zfs-discuss] 15 minute fdsync problem and ZFS: Solved

Yep. ZFS supports the ioctl (_FIOFFS) which ''lockfs -f''
issues.

-- Prabahar.

Darren J Moffat wrote:> Bill Sommerfeld wrote:
>> On Thu, 2006-06-22 at 13:01, Roch wrote:
>>>  Is there a sync command that targets individual FS ?
>> 
>> Yes.  lockfs -f
> 
> Does lockfs work with ZFS ?  The man page appears to indicate it is very 
> UFS specific.
>

Jonathan Adams

2006-Jun-22 18:01 UTC

head link

[zfs-discuss] 15 minute fdsync problem and ZFS: Solved

On Thu, Jun 22, 2006 at 07:46:57PM +0200, Roch wrote:> 
> As I recall, the zfs sync is, unlike UFS, synchronous.
Uh, are you talking about sync(2), or lockfs -f?  IIRC, lockfs -f is always
synchronous.

Cheers,
- jonathan

-- 
Jonathan Adams, Solaris Kernel Development

Neil Perrin

2006-Jun-22 18:16 UTC

head link

[zfs-discuss] 15 minute fdsync problem and ZFS: Solved

Yes, lockfs works. It uses the ZIL - unless it''s disabled where it
waits for all outstanding txgs to commit.

The man page doesn''t say it''s specific to UFS, but does
mention
one specific UFS detail.

Darren J Moffat wrote On 06/22/06 11:19,:> Bill Sommerfeld wrote:
> 
>> On Thu, 2006-06-22 at 13:01, Roch wrote:
>>
>>>  Is there a sync command that targets individual FS ?
>>
>>
>> Yes.  lockfs -f
> 
> 
> Does lockfs work with ZFS ?  The man page appears to indicate it is very 
> UFS specific.
> 
-- 

Neil

Robert Milkowski

2006-Jun-28 21:52 UTC

head link

[zfs-discuss] 15 minute fdsync problem and ZFS: Solved

Hello Neil,

Wednesday, June 21, 2006, 8:15:54 PM, you wrote:

NP> Robert Milkowski wrote On 06/21/06 11:09,:>> Hello Neil,
>>>>Why is this option available then? (Yes, that''s a
loaded question.)
>> 
>> NP> I wouldn''t call it an option, but an internal debugging
switch that I
>> NP> originally added to allow progress when initially integrating
the ZIL.
>> NP> As Roch says it really shouldn''t be ever set (as it
does negate POSIX
>> NP> synchronous semantics). Nor should it be mentioned to a
customer.
>> NP> In fact I''m inclined to now remove it - however it does
still have a use
>> NP> as it helped root cause this problem.
>> 
>> Isn''t it similar to unsupported fastfs for ufs?
NP> It is similar in the sense that it speeds up the file system.
NP> Using fastfs can be much more dangerous though as it can lead
NP> to a badly corrupted file system as writing meta data is delayed
NP> and written out of order. Whereas disabling the ZIL does not affect
NP> the integrity of the fs. The transaction group model of ZFS gives
NP> consistency in the event of a crash/power fail. However, any data that
NP> was promised to be on stable storage may not be unless the transaction
NP> group committed (an operation that is started every 5s).

NP> We once had plans to add a mount option to allow the admin
NP> to control the ZIL. Here''s a brief section of the RFE (6280630):

NP>          sync={deferred,standard,forced}

NP>                  Controls synchronous semantics for the dataset.

NP>                  When set to ''standard'' (the default),
synchronous operations
NP>                  such as fsync(3C) behave precisely as defined in
NP>                  fcntl.h(3HEAD).

NP>                  When set to ''deferred'', requests for
synchronous semantics
NP>                  are ignored.  However, ZFS still guarantees that
ordering
NP>                  is preserved -- that is, consecutive operations reach
stable
NP>                  storage in order.  (If a thread performs operation A
followed
NP>                  by operation B, then the moment that B reaches stable
storage,
NP>                  A is guaranteed to be on stable storage as well.)  ZFS
also
NP>                  guarantees that all operations will be scheduled for
write to
NP>                  stable storage within a few seconds, so that an
unexpected
NP>                  power loss only takes the last few seconds of change
with it.

NP>                  When set to ''forced'', all operations
become synchronous.
NP>                  No operation will return until all previous operations
NP>                  have been committed to stable storage.  This option can
be
NP>                  useful if an application is found to depend on
synchronous
NP>                  semantics without actually requesting them; otherwise,
it
NP>                  will just make everything slow, and is not recommended.

NP> Of course we would need to stress the dangers of setting
''deferred''.
NP> What do you guys think?

I think it would be really useful.
I found myself many times in situation that such features (like
fastfs) were my last resort help.

The same with txg_time - in some cases tuning it could probably be
useful. Instead of playing with mdb it would be much better put into
zpool/zfs or other util (and if possible made per fs not per host).

-- 
Best regards,
 Robert                            mailto:rmilkowski at task.gda.pl
                                       http://milek.blogspot.com

Neil Perrin

2006-Jun-28 22:10 UTC

head link

[zfs-discuss] 15 minute fdsync problem and ZFS: Solved

Robert Milkowski wrote On 06/28/06 15:52,:> Hello Neil,
> 
> Wednesday, June 21, 2006, 8:15:54 PM, you wrote:
> 
> 
> NP> Robert Milkowski wrote On 06/21/06 11:09,:
> 
>>>Hello Neil,
>>>
>>>>>Why is this option available then? (Yes, that''s a
loaded question.)
>>>
>>>NP> I wouldn''t call it an option, but an internal
debugging switch that I
>>>NP> originally added to allow progress when initially integrating
the ZIL.
>>>NP> As Roch says it really shouldn''t be ever set (as it
does negate POSIX
>>>NP> synchronous semantics). Nor should it be mentioned to a
customer.
>>>NP> In fact I''m inclined to now remove it - however it
does still have a use
>>>NP> as it helped root cause this problem.
>>>
>>>Isn''t it similar to unsupported fastfs for ufs?
> 
> 
> NP> It is similar in the sense that it speeds up the file system.
> NP> Using fastfs can be much more dangerous though as it can lead
> NP> to a badly corrupted file system as writing meta data is delayed
> NP> and written out of order. Whereas disabling the ZIL does not affect
> NP> the integrity of the fs. The transaction group model of ZFS gives
> NP> consistency in the event of a crash/power fail. However, any data
that
> NP> was promised to be on stable storage may not be unless the
transaction
> NP> group committed (an operation that is started every 5s).
> 
> NP> We once had plans to add a mount option to allow the admin
> NP> to control the ZIL. Here''s a brief section of the RFE
(6280630):
> 
> NP>          sync={deferred,standard,forced}
> 
> NP>                  Controls synchronous semantics for the dataset.
> 
> NP>                  When set to ''standard'' (the
default), synchronous operations
> NP>                  such as fsync(3C) behave precisely as defined in
> NP>                  fcntl.h(3HEAD).
> 
> NP>                  When set to ''deferred'', requests
for synchronous semantics
> NP>                  are ignored.  However, ZFS still guarantees that
ordering
> NP>                  is preserved -- that is, consecutive operations
reach stable
> NP>                  storage in order.  (If a thread performs operation
A followed
> NP>                  by operation B, then the moment that B reaches
stable storage,
> NP>                  A is guaranteed to be on stable storage as well.) 
ZFS also
> NP>                  guarantees that all operations will be scheduled
for write to
> NP>                  stable storage within a few seconds, so that an
unexpected
> NP>                  power loss only takes the last few seconds of
change with it.
> 
> NP>                  When set to ''forced'', all
operations become synchronous.
> NP>                  No operation will return until all previous
operations
> NP>                  have been committed to stable storage.  This option
can be
> NP>                  useful if an application is found to depend on
synchronous
> NP>                  semantics without actually requesting them;
otherwise, it
> NP>                  will just make everything slow, and is not
recommended.
> 
> NP> Of course we would need to stress the dangers of setting
''deferred''.
> NP> What do you guys think?
> 
> I think it would be really useful.
> I found myself many times in situation that such features (like
> fastfs) were my last resort help.
The over-whelming consensus was that it would be useful. So I''ll go
ahead and
put that on my to do list.
>
> The same with txg_time - in some cases tuning it could probably be
> useful. Instead of playing with mdb it would be much better put into
> zpool/zfs or other util (and if possible made per fs not per host).
This one I''m less sure about. I have certainly tuned txg_time myself to
force certain situations, but I wouldn''t be happy exposing the inner
workings
of ZFS - which may well change.

Neil

zfs discuss - Jun 2006 - fdsync(5, FSYNC) problem and ZFS

[zfs-discuss] fdsync(5, FSYNC) problem and ZFS

[zfs-discuss] fdsync(5, FSYNC) problem and ZFS

[zfs-discuss] fdsync(5, FSYNC) problem and ZFS

[zfs-discuss] fdsync(5, FSYNC) problem and ZFS

[zfs-discuss] fdsync(5, FSYNC) problem and ZFS

[zfs-discuss] fdsync(5, FSYNC) problem and ZFS

[zfs-discuss] fdsync(5, FSYNC) problem and ZFS

[zfs-discuss] fdsync(5, FSYNC) problem and ZFS

[zfs-discuss] 15 minute fdsync problem and ZFS: Solved

[zfs-discuss] fdsync(5, FSYNC) problem and ZFS

[zfs-discuss] 15 minute fdsync problem and ZFS: Solved

[zfs-discuss] 15 minute fdsync problem and ZFS: Solved

[zfs-discuss] 15 minute fdsync problem and ZFS: Solved

[zfs-discuss] 15 minute fdsync problem and ZFS: Solved

[zfs-discuss] 15 minute fdsync problem and ZFS: Solved

[zfs-discuss] 15 minute fdsync problem and ZFS: Solved

[zfs-discuss] 15 minute fdsync problem and ZFS: Solved

[zfs-discuss] 15 minute fdsync problem and ZFS: Solved

[zfs-discuss] 15 minute fdsync problem and ZFS: Solved

[zfs-discuss] 15 minute fdsync problem and ZFS: Solved

[zfs-discuss] 15 minute fdsync problem and ZFS: Solved

[zfs-discuss] 15 minute fdsync problem and ZFS: Solved

[zfs-discuss] 15 minute fdsync problem and ZFS: Solved

[zfs-discuss] 15 minute fdsync problem and ZFS: Solved

[zfs-discuss] 15 minute fdsync problem and ZFS: Solved

[zfs-discuss] 15 minute fdsync problem and ZFS: Solved

[zfs-discuss] 15 minute fdsync problem and ZFS: Solved

[zfs-discuss] 15 minute fdsync problem and ZFS: Solved

[zfs-discuss] 15 minute fdsync problem and ZFS: Solved

[zfs-discuss] 15 minute fdsync problem and ZFS: Solved

[zfs-discuss] 15 minute fdsync problem and ZFS: Solved

[zfs-discuss] 15 minute fdsync problem and ZFS: Solved

[zfs-discuss] 15 minute fdsync problem and ZFS: Solved

[zfs-discuss] 15 minute fdsync problem and ZFS: Solved

[zfs-discuss] 15 minute fdsync problem and ZFS: Solved

[zfs-discuss] 15 minute fdsync problem and ZFS: Solved

[zfs-discuss] 15 minute fdsync problem and ZFS: Solved

[zfs-discuss] 15 minute fdsync problem and ZFS: Solved

[zfs-discuss] 15 minute fdsync problem and ZFS: Solved

[zfs-discuss] 15 minute fdsync problem and ZFS: Solved