thr3ads.net - zfs discuss - [zfs-discuss] Interaction between ZFS intent log and mmap''d files [Jul 2012]

If this information is useful, please help other people find it:
Share via:

Iwan Aucamp

2012-Jul-02 10:33 UTC

[zfs-discuss] Interaction between ZFS intent log and mmap''d files

I''m interested in some more detail on how ZFS intent log behaves for 
updated done via a memory mapped file - i.e. will the ZIL log updates 
done to an mmap''d file or not ?

Bob Friesenhahn

2012-Jul-02 20:32 UTC

head link

[zfs-discuss] Interaction between ZFS intent log and mmap''d files

On Mon, 2 Jul 2012, Iwan Aucamp wrote:
> I''m interested in some more detail on how ZFS intent log behaves
for updated
> done via a memory mapped file - i.e. will the ZIL log updates done to an 
> mmap''d file or not ?
I would to expect these writes to go into the intent log unless 
msync(2) is used on the mapping with the MS_SYNC option.

Bob
-- 
Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Nico Williams

2012-Jul-02 22:00 UTC

head link

[zfs-discuss] Interaction between ZFS intent log and mmap''d files

On Mon, Jul 2, 2012 at 3:32 PM, Bob Friesenhahn
<bfriesen at simple.dallas.tx.us> wrote:> On Mon, 2 Jul 2012, Iwan Aucamp wrote:
>> I''m interested in some more detail on how ZFS intent log
behaves for
>> updated done via a memory mapped file - i.e. will the ZIL log updates
done
>> to an mmap''d file or not ?
>
>
> I would to expect these writes to go into the intent log unless msync(2) is
> used on the mapping with the MS_SYNC option.
You can''t count on any writes to mmap(2)ed files hitting disk until
you msync(2) with MS_SYNC.  The system should want to wait as long as
possible before committing any mmap(2)ed file writes to disk.
Conversely you can''t expect that no writes will hit disk until you
msync(2) or munmap(2).

Nico
--

James Litchfield

2012-Jul-03 14:48 UTC

head link

[zfs-discuss] Interaction between ZFS intent log and mmap''d files

inline

On 07/02/12 15:00, Nico Williams wrote:> On Mon, Jul 2, 2012 at 3:32 PM, Bob Friesenhahn
> <bfriesen at simple.dallas.tx.us>  wrote:
>> On Mon, 2 Jul 2012, Iwan Aucamp wrote:
>>> I''m interested in some more detail on how ZFS intent log
behaves for
>>> updated done via a memory mapped file - i.e. will the ZIL log
updates done
>>> to an mmap''d file or not ?
>>
>> I would to expect these writes to go into the intent log unless
msync(2) is
>> used on the mapping with the MS_SYNC option.
> You can''t count on any writes to mmap(2)ed files hitting disk
until
> you msync(2) with MS_SYNC.  The system should want to wait as long as
> possible before committing any mmap(2)ed file writes to disk.
> Conversely you can''t expect that no writes will hit disk until you
> msync(2) or munmap(2).Driven by fsflush which will scan memory (in chunks) looking for dirty,
unlocked, non-kernel pages to flush to disk.>
> Nico
> --
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>

Nico Williams

2012-Jul-03 15:47 UTC

head link

[zfs-discuss] Interaction between ZFS intent log and mmap''d files

On Tue, Jul 3, 2012 at 9:48 AM, James Litchfield
<jim.litchfield at oracle.com> wrote:> On 07/02/12 15:00, Nico Williams wrote:
>> You can''t count on any writes to mmap(2)ed files hitting disk
until
>> you msync(2) with MS_SYNC.  The system should want to wait as long as
>> possible before committing any mmap(2)ed file writes to disk.
>> Conversely you can''t expect that no writes will hit disk until
you
>> msync(2) or munmap(2).
>
> Driven by fsflush which will scan memory (in chunks) looking for dirty,
> unlocked, non-kernel pages to flush to disk.
Right, but one just cannot count on that -- it''s not part of the API
specification.

Nico
--

James Litchfield

2012-Jul-03 16:15 UTC

head link

[zfs-discuss] Interaction between ZFS intent log and mmap''d files

Agreed - msync/munmap is the only guarantee.

On 07/ 3/12 08:47 AM, Nico Williams wrote:> On Tue, Jul 3, 2012 at 9:48 AM, James Litchfield
> <jim.litchfield at oracle.com>  wrote:
>> On 07/02/12 15:00, Nico Williams wrote:
>>> You can''t count on any writes to mmap(2)ed files hitting
disk until
>>> you msync(2) with MS_SYNC.  The system should want to wait as long
as
>>> possible before committing any mmap(2)ed file writes to disk.
>>> Conversely you can''t expect that no writes will hit disk
until you
>>> msync(2) or munmap(2).
>> Driven by fsflush which will scan memory (in chunks) looking for dirty,
>> unlocked, non-kernel pages to flush to disk.
> Right, but one just cannot count on that -- it''s not part of the
API
> specification.
>
> Nico
> --
>

Bob Friesenhahn

2012-Jul-04 16:14 UTC

head link

[zfs-discuss] Interaction between ZFS intent log and mmap''d files

On Tue, 3 Jul 2012, James Litchfield wrote:
> Agreed - msync/munmap is the only guarantee.
I don''t see that the munmap definition assures that anything is 
written to "disk".  The system is free to buffer the data in RAM as 
long as it likes without writing anything at all.

Bob
-- 
Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Nico Williams

2012-Jul-04 20:47 UTC

head link

[zfs-discuss] Interaction between ZFS intent log and mmap''d files

On Wed, Jul 4, 2012 at 11:14 AM, Bob Friesenhahn
<bfriesen at simple.dallas.tx.us> wrote:> On Tue, 3 Jul 2012, James Litchfield wrote:
>> Agreed - msync/munmap is the only guarantee.
>
> I don''t see that the munmap definition assures that anything is
written to
> "disk".  The system is free to buffer the data in RAM as long as
it likes
> without writing anything at all.
Oddly enough the manpages at the Open Group don''t make this clear.  So
I think it may well be advisable to use msync(3C) before munmap() on
MAP_SHARED mappings.  However, I think all implementors should, and
probably all do (Linux even documents that it does) have an implied
msync(2) when doing a munmap(2).  I really makes no sense at all to
have munmap(2) not imply msync(3C).

(That''s another thing, I don''t see where the standard requires
that
munmap(2) be synchronous.  I think it''d be nice to have an mmap(2)
option for requesting whether munmap(2) of the same mapping be
synchronous or asynchronous.  Async munmap(2) -> no need to mount
cross-calls, instead allowing to mapping to be torn down over time.
Doing a synchronous msync(3C), then a munmap(2) is a recipe for going
real slow, but if munmap(2) does not portably guarantee an implied
msync(3C), then would it be safe to do an async msync(2) then
munmap(2)??)

Nico
--

John Martin

2012-Jul-04 21:23 UTC

head link

[zfs-discuss] Interaction between ZFS intent log and mmap''d files

On 07/04/12 16:47, Nico Williams wrote:
>> I don''t see that the munmap definition assures that anything
is written to
>> "disk".  The system is free to buffer the data in RAM as long
as it likes
>> without writing anything at all.
>
> Oddly enough the manpages at the Open Group don''t make this clear.
So
> I think it may well be advisable to use msync(3C) before munmap() on
> MAP_SHARED mappings.  However, I think all implementors should, and
> probably all do (Linux even documents that it does) have an implied
> msync(2) when doing a munmap(2).  I really makes no sense at all to
> have munmap(2) not imply msync(3C).
This assumes msync() has the behavior you expect.  See:

   http://pubs.opengroup.org/onlinepubs/009695399/functions/msync.html

In particular, the paragraph starting with "For mappings to files,
...".

Stefan Ring

2012-Jul-04 21:32 UTC

head link

[zfs-discuss] Interaction between ZFS intent log and mmap''d files

> I really makes no sense at all to
> have munmap(2) not imply msync(3C).
Why not? munmap(2) does basically the equivalent of write(2). In the
case of write, that is: a later read from the same location will see
the written data, unless another write happens in-between. If power
goes down following the write, all bets are off. And translated to
munmap: a subsequent call to mmap(2) that makes the previously
munmap-ped region available will make visible everything stored to the
region prior to the munmap call. If power goes down following the
munmap, all bets are off. In both cases, if you want your data to
persist across power losses, use sync -- fsync or msync.

If only the syncing variants were available, disk accesses would be
significantly slower, and disks would thrash rather audibly all the
time.

Peter Jeremy

2012-Jul-04 23:22 UTC

head link

[zfs-discuss] Interaction between ZFS intent log and mmap''d files

On 2012-Jul-05 06:47:36 +1000, Nico Williams <nico at cryptonector.com>
wrote:>On Wed, Jul 4, 2012 at 11:14 AM, Bob Friesenhahn
><bfriesen at simple.dallas.tx.us> wrote:
>> On Tue, 3 Jul 2012, James Litchfield wrote:
>>> Agreed - msync/munmap is the only guarantee.
>>
>> I don''t see that the munmap definition assures that anything
is written to
>> "disk".  The system is free to buffer the data in RAM as long
as it likes
>> without writing anything at all.
>
>Oddly enough the manpages at the Open Group don''t make this clear.
They don''t specify the behaviour on write(2) or close(2) either.  All
this means is that there is no guarantee that munmap(2) (or write(2)
or close(2)) will immediately flush the data to stable storage.
>  So
>I think it may well be advisable to use msync(3C) before munmap() on
>MAP_SHARED mappings.
If you want to be certain that your changes will be flushed to stable
storage by a particular point in your program execution then you must
call msync(MS_SYNC) before munmap(2).
>  However, I think all implementors should, and
>probably all do (Linux even documents that it does) have an implied
>msync(2) when doing a munmap(2).
There''s nothing in the standard requiring this behaviour and it will
adversely impact performance in the general case so I would expect
that implementors _wouldn''t_ force msync(2) on munmap(2).  FreeBSD
definitely doesn''t.  As for Linux, I keep finding cases where, if a
standard doesn''t mandate specific behaviour, Linux will implement (and
document) different behaviour to the way other OSs behave in the same
situation.
>  I really makes no sense at all to
>have munmap(2) not imply msync(3C).
Actually, it makes no more sense for munmap(2) to imply msync(2) than
it does for close(2) [which is functionally equivalent] to imply
fsync(2) - ie none at all.
>(That''s another thing, I don''t see where the standard
requires that
>munmap(2) be synchronous.
http://pubs.opengroup.org/onlinepubs/009695399/functions/munmap.html
states "Further references to these pages shall result in the
generation of a SIGSEGV signal to the process."  It''s difficult to
see how to implement this behaviour unless munmap(2) is synchronous.
> Async munmap(2) -> no need to mount
>cross-calls, instead allowing to mapping to be torn down over time.
>Doing a synchronous msync(3C), then a munmap(2) is a recipe for going
>real slow, but if munmap(2) does not portably guarantee an implied
>msync(3C), then would it be safe to do an async msync(2) then
>munmap(2)??)
I don''t understand what you are trying to achieve here.  munmap(2)
should be a relatively cheap operation so there is very little to be
gained by making it asynchronous.  Can you please explain a scenario
where munmap(2) would be slow (other than cases where implementors
have deliberately and unnecessarily made it slow).  I agree that
msync(MS_SYNC) is slow but if you want a guarantee that your data is
securely written to stable storage then you need to wait for that
stable storage.  msync(MS_ASYNC) should have no impact on a later
munmap(2) and it should always be safe to call msync(MS_ASYNC) before
munmap(2) (in fact, it''s a good idea to maximise portability).

-- 
Peter Jeremy
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 196 bytes
Desc: not available
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20120705/452a2002/attachment.bin>

Bob Friesenhahn

2012-Jul-05 13:43 UTC

head link

[zfs-discuss] Interaction between ZFS intent log and mmap''d files

On Wed, 4 Jul 2012, Nico Williams wrote:>
> Oddly enough the manpages at the Open Group don''t make this clear.
So
> I think it may well be advisable to use msync(3C) before munmap() on
> MAP_SHARED mappings.  However, I think all implementors should, and
> probably all do (Linux even documents that it does) have an implied
> msync(2) when doing a munmap(2).  I really makes no sense at all to
> have munmap(2) not imply msync(3C).
As long as the system has a way to track which dirty pages map to 
particular files (Solaris historically does), it should not be 
necessary to synchronize the mapping to the underlying store simply 
due to munmap.  It may be more efficient not do to that.  The same 
pages may be mapped and unmapped many times by applications.  In fact, 
several applications may memory map the same file so they access the 
same pages and it seems wrong to flush to underlying store simply 
because one of the applications no longer references the page.

Since mmap() on zfs breaks the traditional coherent memory/filesystem 
that Solaris enjoyed prior to zfs, it may be that some rules should be 
different when zfs is involved because of its redundant use of memory 
(zfs ARC and VM page).

Bob
-- 
Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Bob Friesenhahn

2012-Jul-05 13:54 UTC

head link

[zfs-discuss] Interaction between ZFS intent log and mmap''d files

On Wed, 4 Jul 2012, Stefan Ring wrote:
>> I really makes no sense at all to
>> have munmap(2) not imply msync(3C).
>
> Why not? munmap(2) does basically the equivalent of write(2). In the
> case of write, that is: a later read from the same location will see
> the written data, unless another write happens in-between. If power
Actually, a write to memory for a memory mapped file is more similar 
to write(2).  If two programs have the same file mapped then the 
effect on the memory they share is instantaneous because it is 
the same physical memory.  A mmapped file becomes shared memory as 
soon as it is mapped at least twice.

It is pretty common for a system of applications to implement shared 
memory via memory mapped files with the mapped memory used for 
read/write.  This is a precursor to POSIX''s shm_open(3RT) which 
produces similar functionality without a known file in the filesystem

Bob
-- 
Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Stefan Ring

2012-Jul-05 21:00 UTC

head link

[zfs-discuss] Interaction between ZFS intent log and mmap''d files

> Actually, a write to memory for a memory mapped file is more similar to
> write(2).  If two programs have the same file mapped then the effect on the
> memory they share is instantaneous because it is the same physical memory.
> A mmapped file becomes shared memory as soon as it is mapped at least
twice.
True, for some interpretation of "instantaneous". It does not
establish a happens-before relationship though, as
store-munmap/mmap-load does.

Dan Vatca

2012-Jul-06 17:16 UTC

head link

[zfs-discuss] Cannot reset ZFS reservation and refreservation on volume

When creating a new zfs volume the calculated refreservation is greater than
volsize to account for number of copies and metadata:

root at test:~# zfs create -V 1G rpool/test
root at test:~# zfs get -Hp volsize,volblocksize,copies,refreservation
rpool/test
rpool/test      volsize 1073741824      local
rpool/test      volblocksize    8192    -
rpool/test      copies  1       default
rpool/test      refreservation  1107820544      local

After I set refreservation to none, I am no longer able to reset refreservation
back to the required refreservation, since there is a check in libzfs that
prevents it:

root at danstore2:/lib# zfs set refreservation=none rpool/test
root at danstore2:/lib# zfs get -Hp volsize,volblocksize,copies,refreservation
rpool/test
rpool/test      volsize 1073741824      local
rpool/test      volblocksize    8192    -
rpool/test      copies  1       default
rpool/test      refreservation  0       local
root at danstore2:/lib# zfs set refreservation=1107820544 rpool/test
cannot set property for ''rpool/test'':
''refreservation'' is greater than current volume size

Is this an intended behavior or a bug?

The same is true for reservation. Setting reservation on a volume is also
limited to volsize, but reading the documentation
(http://docs.oracle.com/cd/E19253-01/819-5461/gazvb/index.html) I understand
reservation may be as large as the user wants it to be. I think this is so
because:
1. "The quota and reservation properties are convenient for managing disk
space consumed by datasets and their descendents"
2. " ? descendents, such as snapshots and clones"
If I understand correctly, the reservation on a volume accounts for all space
consumed by the volume, its metadata and copies, and its descendant snapshots
and clones, so it does not make any sense to limit it to volsize.

Getting into libzfs code, I found that zfs_valid_proplist (in libzfs_dataset.c)
specifically checks and prevents setting reservation and refreservation to more
than volsize. I think the check should be removed for ZFS_PROP_RESERVATION, and
limited to zvol_volsize_to_reservation(volsize, nvl) for ZFS_PROP_REFRESERVATION
(when type == ZFS_TYPE_VOLUME).

Dan Vatca

On  6 Jul 2012, at 0:00, Stefan Ring wrote:
>> Actually, a write to memory for a memory mapped file is more similar to
>> write(2).  If two programs have the same file mapped then the effect on
the
>> memory they share is instantaneous because it is the same physical
memory.
>> A mmapped file becomes shared memory as soon as it is mapped at least
twice.
> 
> True, for some interpretation of "instantaneous". It does not
> establish a happens-before relationship though, as
> store-munmap/mmap-load does.
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>

Dan Vâtca

2012-Jul-06 21:09 UTC

head link

[zfs-discuss] Cannot reset ZFS reservation and refreservation on volume

When creating a new zfs volume the calculated refreservation is greater than
volsize to account for number of copies and metadata:

root at test:~# zfs create -V 1G rpool/test
root at test:~# zfs get -Hp volsize,volblocksize,copies,refreservation
rpool/test
rpool/test      volsize 1073741824      local
rpool/test      volblocksize    8192    -
rpool/test      copies  1       default
rpool/test      refreservation  1107820544      local

After I set refreservation to none, I am no longer able to reset refreservation
back to the required refreservation, since there is a check in libzfs that
prevents it:

root at danstore2:/lib# zfs set refreservation=none rpool/test
root at danstore2:/lib# zfs get -Hp volsize,volblocksize,copies,refreservation
rpool/test
rpool/test      volsize 1073741824      local
rpool/test      volblocksize    8192    -
rpool/test      copies  1       default
rpool/test      refreservation  0       local
root at danstore2:/lib# zfs set refreservation=1107820544 rpool/test
cannot set property for ''rpool/test'':
''refreservation'' is greater than current volume size

Is this an intended behavior or a bug?

The same is true for reservation. Setting reservation on a volume is also
limited to volsize, but reading the documentation
(http://docs.oracle.com/cd/E19253-01/819-5461/gazvb/index.html) I understand
reservation may be as large as the user wants it to be. I think this is so
because:
1. "The quota and reservation properties are convenient for managing disk
space consumed by datasets and their descendents"
2. " ? descendents, such as snapshots and clones"
If I understand correctly, the reservation on a volume accounts for all space
consumed by the volume, its metadata and copies, and its descendant snapshots
and clones, so it does not make any sense to limit it to volsize.

Getting into libzfs code, I found that zfs_valid_proplist (in libzfs_dataset.c)
specifically checks and prevents setting reservation and refreservation to more
than volsize. I think the check should be removed for ZFS_PROP_RESERVATION, and
limited to zvol_volsize_to_reservation(volsize, nvl) for ZFS_PROP_REFRESERVATION
(when type == ZFS_TYPE_VOLUME).

Dan

PS: sorry if this message is a duplicate (I sent the original one from the wrong
account).

On  6 Jul 2012, at 0:00, Stefan Ring wrote:
>> Actually, a write to memory for a memory mapped file is more similar to
>> write(2).  If two programs have the same file mapped then the effect on
the
>> memory they share is instantaneous because it is the same physical
memory.
>> A mmapped file becomes shared memory as soon as it is mapped at least
twice.
> 
> True, for some interpretation of "instantaneous". It does not
> establish a happens-before relationship though, as
> store-munmap/mmap-load does.
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>

zfs discuss - Jul 2012 - Interaction between ZFS intent log and mmap'd files

[zfs-discuss] Interaction between ZFS intent log and mmap''d files

[zfs-discuss] Interaction between ZFS intent log and mmap''d files

[zfs-discuss] Interaction between ZFS intent log and mmap''d files

[zfs-discuss] Interaction between ZFS intent log and mmap''d files

[zfs-discuss] Interaction between ZFS intent log and mmap''d files

[zfs-discuss] Interaction between ZFS intent log and mmap''d files

[zfs-discuss] Interaction between ZFS intent log and mmap''d files

[zfs-discuss] Interaction between ZFS intent log and mmap''d files

[zfs-discuss] Interaction between ZFS intent log and mmap''d files

[zfs-discuss] Interaction between ZFS intent log and mmap''d files

[zfs-discuss] Interaction between ZFS intent log and mmap''d files

[zfs-discuss] Interaction between ZFS intent log and mmap''d files

[zfs-discuss] Interaction between ZFS intent log and mmap''d files

[zfs-discuss] Interaction between ZFS intent log and mmap''d files

[zfs-discuss] Cannot reset ZFS reservation and refreservation on volume

[zfs-discuss] Cannot reset ZFS reservation and refreservation on volume