thr3ads.net - Gluster users - [Gluster-users] Update of work on fixing POSIX compliance issues in Glusterfs [Oct 2018]

If this information is useful, please help other people find it:
Share via:

Raghavendra Gowdappa

2018-Oct-02 02:10 UTC

[Gluster-users] Update of work on fixing POSIX compliance issues in Glusterfs

All,

There have been issues related to POSIX compliance especially while running
Database workloads <https://bugzilla.redhat.com/show_bug.cgi?id=1512691>
on
Glusterfs. Recently we've worked on fixing some of them. This mail is an
update on that effort.

The issues themselves can be classfied into following categories:

- rename atomicity. When rename (src, dst) is done with dst already
present, at no point in time access to dst (like open, stat, chmod etc)
should fail. However, since the rename itself changes the association of
dst-path from dst-inode to src-inode, inode based operations like open,
stat etc that have already completed resolution of dst-path into dst-inode
will end up not finding the dst-inode after rename causing them to fail.
However VFS provides a workaround for this by doing the resolution of path
once again provided operations fail with ESTALE. There were some issues
associated with this:
- Glusterfs in some codepaths returned ENOENT even when the operation
is on an inode and hence VFS didn't retry the resolution. Much of the
discussion around this topic can be found at this mail thread
<https://www.spinics.net/lists/gluster-devel/msg18981.html>. This
issue has been

<http://review.gluster.org/r/I2e752ca60dd8af1b989dd1d29c7b002ee58440b4>
fixed

<http://review.gluster.org/r/I8d07d2ebb5a0da6c3ea478317442cb42f1797a4b>
by

<http://review.gluster.org/r/Ia07e3cece404811703c8cfbac9b402ca5fe98c1e>
various
patches
- VFS retries exactly once. So, when retry fails with ESTALE, VFS
gives up and syscalls like open are failed. We've hit this class
of issues
in bugs like these
<https://bugzilla.redhat.com/show_bug.cgi?id=1543279>. The current
understanding is real world workloads won't hit this race and hence
one
retry mechanism is enough. NFS relies on the same mechanism of
VFS and NFS
developers say they've not hit bugs of this kind in real workloads.
- DHT in rename codepaths acquires locks on src and dst inodes. If a
parallel rename overwrote dst-inode, this locking fails and rename
operation used to fail. The issue is tracked and fixed as part of this
bug <https://bugzilla.redhat.com/show_bug.cgi?id=1543279>.
- Quorum imposition by afr in open fop. afr imposes Quorum on fd
based operations, but not on open. This means operations can fail on a
valid fd due to lack of Quorum. Not fixed yet and is tracked on this bug
<https://bugzilla.redhat.com/show_bug.cgi?id=1634664>.
- Operations on a valid fd failing after the file was deleted by
rename/unlink.
- Fuse-bridge used to randomly pick fds in fstat codepath as earlier
versions of fuse api didn't provide filehandle as argument of Getattr
request. This resulted in fstat failures when the file was deleted either
through rename/unlink after it has been successfully opened.
This is fixed
in this patch

<http://review.gluster.org/r/I67eebbf5407ca725ed111fbda4181ead10d03f6d>
and
this patch

<http://review.gluster.org/r/I88dd29b3607cd2594eee9d72a1637b5346c8d49c>
.
- performance/open-behind fakes an open call. Due to bugs in
rename/unlink codepath, it couldn't open file before the file was
deleted
due to rename or unlink. Fixed by this patch
<https://review.gluster.org/#/c/glusterfs/+/20428/>
- Stale (meta)data cached by various performance xlators
- md-cache used to cache stale fstat. Fixed by this patch

<http://review.gluster.org/r/Ia4bb9dd36494944e2d91e9e71a79b5a3974a8c77>
.
- write-behind did not provide correct stat in rename cbk when writes
on src were cached in write-behind. Fixed by this patch

<http://review.gluster.org/r/Ic9f2adf8edd0b58ebaf661f3a8d0ca086bc63111>
.
- write-behind did not provide correct stat in readdirp response.
Fixed by this patch

<http://review.gluster.org/r/I12d167bf450648baa64be1cbe1ca0fddf5379521>
- Ordering of operations done on different fds by write-behind. It
considered operations on different fds as independent. So an fstat done
after a write is complete when both operations are on different
fds, didn't
fetch stat that reflected the write operation. This is fixed by this
patch

<http://review.gluster.org/r/Iee748cebb6d2a5b32f9328aff2b5b7cbf6c52c05>
- readdir-ahead used to provide stale stat. The issue is fixed by
this patch

<http://review.gluster.org/Ia27ff49a61922e88c73a1547ad8aacc9968a69df>
- Most of the caching xlators rely on ctime/mtime of stat to find out
whether the current (meta)data is newer/stale than the cached (meta)data.
However ctime/mtime provided by replica/afr is not always
consistent as it
can pick stat from any of its subvolumes. This issue can be
solved once ctime
generatior <https://github.com/gluster/glusterfs/issues/208> becomes
production ready and is enabled by default. Note that ctime generator
xlator can also help in fixing issues with tar
<https://bugzilla.redhat.com/show_bug.cgi?id=1179169>, ElasticSearch
<https://bugzilla.redhat.com/show_bug.cgi?id=1379568> etc that rely
on correctness of ctime. Also, I still see a rare pgbench failure even
after all the fixes to bz 1512691 due to unreliable ctime/mtime from
underlying xlators.
- Though this issue
<https://bugzilla.redhat.com/show_bug.cgi?id=1601166> is not really
a
consistency issue, it hindered performance of read-ahead as
fstats flushed
read-ahead cache. Note that fstats also have an impact on
write-behind when
reads and writes are interleaved on a file as fstats wait on
cached-writes
in write-behind. A bug
<https://bugzilla.redhat.com/show_bug.cgi?id=1563508> has been filed
on fuse kernel module for implementation of noatime feature so
that fstats
are not issued during reads.
- AMQP needed flock -w to work. Tracked as part of this issue
<https://github.com/gluster/glusterfs/issues/465>.

The issues listed above are either fixed or work is in progress to fix
them. There are still more issues which are not worked upon yet and we'll
provide updates on them in future. Some of the prominent known issues (the
list is not exhaustive) are:

- Missing dentries
<https://bugzilla.redhat.com/show_bug.cgi?id=1563848> when
performance.parallel-readdir is enabled. Note that its a cache coherence
issue, the dentries and files are still intact on backend.
- Evaluate and initiate discussion on how to propagate errors
encountered during commit of cached writes, to application. A wider
discussion (across different filesystems) on this topic is found at:
https://lwn.net/Articles/752063/. Thanks to @csabahenk for pointing this
discussion.
- Sanitize the stack to return ESTALE for inode missing and ENOENT for
path missing. For eg., storage/posix sometimes return ENOENT for scenarios
where gfid handles are missing, even though the correct error is ESTALE.
Failing to return ESTALE can throw off the retry logic in VFS. An open
failing with ENOENT is wrong as open is a gfid based operation. An easy fix
would be to fuse-bridge convert all ENOENT errors to ESTALE in _all_ inode
based fop responses. Currently its done only in open(dir) codepath. This
has to be extended to other codepaths too.
- Lookup and rename in DHT are not atomic. rename is a compound
operation in DHT which involves some hardlinking and in the rename window
both src and dst are visible as hardlinks to each other. If lookup samples
src or dst in this window, it'll perceive the file to have hardlinks.
- stale dentries of src in inode-tables (of fuse, protocol/server) after
successful rename of src and dst. This can be caused due to a lookup on src
racing with rename. This issue is not very much different from the issue of
caching xlators needing a way of identifying which among the two (meta)data
is latest. ctime generator xlator can be used here to compare ctime of
parent directory as recorded in itable with that of in lookup response and
making sure only latest dentry is linked into inode table.
- Note that stale dentries can cause corruption in applications like
SAS, pgbench that rely on the pattern of create a tmp file,
write to it and
rename it to the file to be consumed by another thread. Since
src resolves
to dst inode due to stale dentries having same stat of dst, the dst file
ends up corrupted as writes of next cycle end up on the file
being consumed
for previous cycle. So, this is an important issue to be fixed.
- There are few bugs on SAS
- issues with fcntl locking
<https://bugzilla.redhat.com/show_bug.cgi?id=1630735>.
- From my limited conversation with people who use/work on SAS, it
seem to rely on fsync as a checkpoint after which the changes by one job
should be visible to other jobs which could be running on
different mounts
on a different machine. This means, fsync on one mount should
update caches
of other mounts too with updated data. This functionality is currently
missing in Glusterfs.

regards,
Raghavendra
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.gluster.org/pipermail/gluster-users/attachments/20181002/5bf12a67/attachment.html>

Gluster users - Oct 2018 - Update of work on fixing POSIX compliance issues in Glusterfs

[Gluster-users] Update of work on fixing POSIX compliance issues in Glusterfs