thr3ads.net - Gluster users - [Gluster-users] Gluster 3.12.14: wrong quota in Distributed Dispersed Volume [Nov 2018]

If this information is useful, please help other people find it:
Share via:

Hari Gowtham

2018-Nov-26 11:56 UTC

[Gluster-users] Gluster 3.12.14: wrong quota in Distributed Dispersed Volume

Yes. In that case you can run the script and see what errors it is
throwing and then clean that directory up with setting dirty and then
doing a lookup.
Again for such a huge size, it will consume a lot of resource.

On Mon, Nov 26, 2018 at 3:56 PM Gudrun Mareike Amedick
<g.amedick at uni-luebeck.de> wrote:>
> Hi,
>
> we have no notifications of OOM kills in /var/log/messages. So if I
understood this correctly, the crawls finished but my attributes weren't set
> correctly? And this script should fix them?
>
> Thanks for your help so far
>
> Gudrun
> Am Donnerstag, den 22.11.2018, 13:03 +0530 schrieb Hari Gowtham:
> > On Wed, Nov 21, 2018 at 8:55 PM Gudrun Mareike Amedick
> > <g.amedick at uni-luebeck.de> wrote:
> > >
> > >
> > > Hi Hari,
> > >
> > > I disabled and re-enabled the quota and I saw the crawlers
starting. However, this caused a pretty high load on my servers (200+) and this
seem to
> > > have gotten them killed again. At least, I have no crawlers
running, the quotas are not matching the output of du -h, and the crawler logs
all
> > > contain
> > > this line:
> > The quota crawl is an intensive process as it has to crawl the entire
> > file system. The intensity varies based on the number of bricks,
> > number of files,
> > the depth of filesystem, on going io to the filesystem and so on.
> > Being a disperse volume it will have to talk to all the bricks and
> > also with the huge size, the
> > increase in the CPU is expected.
> >
> > >
> > >
> > > [2018-11-20 14:16:35.180467] W
[glusterfsd.c:1375:cleanup_and_exit]
(-->/lib/x86_64-linux-gnu/libpthread.so.0(+0x7494) [0x7f0e3d6fe494] --
> > > >
> > > > /usr/sbin/glusterfs(glusterfs_sigwaiter+0xf5)
[0x561eb7952d45] -->/usr/sbin/glusterfs(cleanup_and_exit+0x54)
[0x561eb7952ba4] ) 0-: received
> > > > signum
> > > (15), shutting down
> > This can mean that the file attributes are set and then its stopped/
> > as you said the process was killed while it still has the attributes
> > to be set on a few set of files.
> >
> > This message is common for all the shutdown (one triggered after the
> > job is finished and one triggered to stop the process as well)
> > Can you check the /var/log/messages file for "OOM" kill?
> > If you see those messages then the shutdown is because of the increase
> > in memory consumption which is expected.
> >
> > >
> > >
> > > I suspect this means my file attributes are not set correctly.
Would the script you sent me fix that? And the script seems to be part of the
Git
> > > GlusterFS 5.0 repo. We are running 3.12. Would it still work on
3.12 (or 4.1, since we'll be upgrading soon) or could it break things?
> > Quota is not actively developed because of its performance issues
> > which need a major redesign. So the script holds true for newer
> > version as well,
> > because no changes have gone in the code for it.
> > The advantage of the script is it can be used to run over a certain
> > directory (need not be root. this reduce the number of directories/
> > files depth and so on) which is faulty.
> > The crawl is necessary for the quota to work fine. The script can help
> > only if the xattrs are set by the crawl. which I think isn't the
case
> > here.
> > (To verify if the xattrs are set on all the directories we need to do
> > a getxattr and see) So we can't use script.
> >
> >
> > >
> > >
> > > Kind regards
> > >
> > > Gudrun Amedick
> > > Am Dienstag, den 20.11.2018, 16:59 +0530 schrieb Hari Gowtham:
> > > >
> > > > reply inline.
> > > > On Tue, Nov 20, 2018 at 3:53 PM Gudrun Mareike Amedick
> > > > <g.amedick at uni-luebeck.de> wrote:
> > > > >
> > > > >
> > > > >
> > > > > Hi,
> > > > >
> > > > > I think I know what happened. According to the logs,
the crawlers recieved a signum(15). They seemed to have died before having
finished.
> > > > > Probably
> > > > > too
> > > > > much to do simultaneously. I have disabled and
re-enabled quota and will set the quotas again with more time.
> > > > >
> > > > > Is there a way to restart a crawler that was killed too
soon?
> > > > No. the disable and enable of quota starts a new crawl.
> > > >
> > > > >
> > > > >
> > > > >
> > > > > If I restart a server while a crawler is running, will
the crawler be restarted, too? We'll need to do some hardware fixing on one
of the
> > > > > servers
> > > > > soon
> > > > > and I need to know whether I have to check the crawlers
first before shutting it down.
> > > > During the shutdown of the server the crawl will be killed.
(data
> > > > usage shown will be updated as per what has been crawled)
> > > > The crawl won't be restarted on starting the server.
Only quotad will
> > > > be restarted (which is not the same as crawl).
> > > > For the crawl to happen you will have to restart the quota.
> > > >
> > > > >
> > > > >
> > > > >
> > > > > Thanks for the pointers
> > > > >
> > > > > Gudrun Amedick
> > > > > Am Dienstag, den 20.11.2018, 11:38 +0530 schrieb Hari
Gowtham:
> > > > > >
> > > > > >
> > > > > > Hi,
> > > > > >
> > > > > > Can you check if the quota crawl finished? Without
it having finished
> > > > > > the quota list will show incorrect values.
> > > > > > Looking at the under accounting, it looks like the
crawl is not yet
> > > > > > finished ( it does take a lot of time as it has to
crawl the whole
> > > > > > filesystem).
> > > > > >
> > > > > > If the crawl has finished and the usage is still
showing wrong values
> > > > > > then there should be an accounting issue.
> > > > > > The easy way to fix this is to try restarting
quota. This will not
> > > > > > cause any problems. The only downside is the
limits won't hold true
> > > > > > while the quota is disabled,
> > > > > > till its enabled and the crawl finishes.
> > > > > > Or you can try using the quota fsck script
> > > > > > https://review.gluster.org/#/c/glusterfs/+/19179/
to fix your
> > > > > > accounting issue.
> > > > > >
> > > > > > Regards,
> > > > > > Hari.
> > > > > > On Mon, Nov 19, 2018 at 10:05 PM Frank Ruehlemann
> > > > > > <f.ruehlemann at uni-luebeck.de> wrote:
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > Hi,
> > > > > > >
> > > > > > > we're running a Distributed Dispersed
volume with Gluster 3.12.14 at
> > > > > > > Debian 9.6 (Stretch).
> > > > > > >
> > > > > > > We migrated our data (>300TB) from a pure
Distributed volume into this
> > > > > > > Dispersed volume with cp, followed by
multiple rsyncs.
> > > > > > > After the migration was successful we enabled
quotas again with "gluster
> > > > > > > volume quota $VOLUME enable", which
finished successfully.
> > > > > > > And we set our required quotas with
"gluster volume quota $VOLUME
> > > > > > > limit-usage $PATH $QUOTA", which
finished without errors too.
> > > > > > >
> > > > > > > But our "gluster volume quota $VOLUME
list" shows wrong values.
> > > > > > > For example:
> > > > > > > A directory with ~170TB of data shows only
40.8TB Used.
> > > > > > > When we sum up all quoted directories
we're way under the ~310TB that
> > > > > > > "df -h /$volume" shows.
> > > > > > > And "df -h /$volume/$directory"
shows wrong values for nearly all
> > > > > > > directories.
> > > > > > >
> > > > > > > All 72 8TB-bricks and all quota deamons of
the 6 servers are visible and
> > > > > > > online in "gluster volume status
$VOLUME".
> > > > > > >
> > > > > > >
> > > > > > > In quotad.log I found multiple warnings like
this:
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > [2018-11-16 09:21:25.738901] W
[dict.c:636:dict_unref] (-->/usr/lib/x86_64-linux-
> > > > > > > >
gnu/glusterfs/3.12.14/xlator/features/quotad.so(+0x1d58)
> > > > > > > > [0x7f6844be7d58]
-->/usr/lib/x86_64-linux-gnu/glusterfs/3.12.14/xlator/features/quotad.so(+0x2b92)
[0x7f6844be8b92] -->/usr/lib/x86_64-
> > > > > > > > linux-
> > > > > > > > gnu/libglusterfs.so.0(dict_unref+0xc0)
[0x7f684b0db640] ) 0-dict: dict is NULL [Invalid argument]
> > > > > > > In some brick logs I found those:
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > [2018-11-19 07:23:30.932327] I [MSGID:
120020] [quota.c:2198:quota_unlink_cbk] 0-$VOLUME-quota: quota context not set
inode
> > > > > > > > (gfid:f100f7a9-
> > > > > > > > 0779-
> > > > > > > > 4b4c-880f-c8b3b4bdc49d) [Invalid
argument]
> > > > > > > and (replaced the volume name with
"$VOLUME") those:
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > The message "W [MSGID: 120003]
[quota.c:821:quota_build_ancestry_cbk] 0-$VOLUME-quota: parent is NULL [Invalid
argument]" repeated 13
> > > > > > > > times
> > > > > > > > between [2018-11-19 15:28:54.089404] and
[2018-11-19 15:30:12.792175]
> > > > > > > > [2018-11-19 15:31:34.559348] W [MSGID:
120003] [quota.c:821:quota_build_ancestry_cbk] 0-$VOLUME-quota: parent is NULL
[Invalid argument]
> > > > > > > I already found that setting the flag
"trusted.glusterfs.quota.dirty" might help, but I'm unsure about
the consequences that will be
> > > > > > > triggered.
> > > > > > > And I'm unsure about the necessary
version flag.
> > > > > > >
> > > > > > > Has anyone an idea how to fix this?
> > > > > > >
> > > > > > > Best Regards,
> > > > > > > --
> > > > > > > Frank R?hlemann
> > > > > > >    IT-Systemtechnik
> > > > > > >
> > > > > > > UNIVERSIT?T ZU L?BECK
> > > > > > >     IT-Service-Center
> > > > > > >
> > > > > > >     Ratzeburger Allee 160
> > > > > > >     23562 L?beck
> > > > > > >     Tel +49 451 3101 2034
> > > > > > >     Fax +49 451 3101 2004
> > > > > > >     ruehlemann at itsc.uni-luebeck.de
> > > > > > >     www.itsc.uni-luebeck.de
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
_______________________________________________
> > > > > > > Gluster-users mailing list
> > > > > > > Gluster-users at gluster.org
> > > > > > >
https://lists.gluster.org/mailman/listinfo/gluster-users
> > > >
> >
> >
> > --
> > Regards,
> > Hari Gowtham.


-- 
Regards,
Hari Gowtham.

Gudrun Mareike Amedick

2018-Nov-26 13:55 UTC

head link

[Gluster-users] Gluster 3.12.14: wrong quota in Distributed Dispersed Volume

Hi Hari,

I'm sorry to bother you again, but I have a few questions concerning the
script.

Do I understand correctly that I have to execute it once per brick on each
server?
It is a dispersed volume, so the file size on brick side and on client side can
differ. Is that a problem?

Is it a reasonable way of action if I first run "python?quota_fsck.py
--subdir $broken_dir $brickpath" to see if it reports something and if yes,
run
"python quota_fsck.py --subdir $broken_dir --fix-issues $mountpoint
$brickpath" to correct them?

I'd run "du -h $mountpoint/broken_dir" from client side as a
lookup. Is that sufficient?

Will further action be required or should this be enough?

Kind regards

Gudrun
Am Montag, den 26.11.2018, 17:26 +0530 schrieb Hari
Gowtham:> Yes. In that case you can run the script and see what errors it is
> throwing and then clean that directory up with setting dirty and then
> doing a lookup.
> Again for such a huge size, it will consume a lot of resource.
> 
> On Mon, Nov 26, 2018 at 3:56 PM Gudrun Mareike Amedick
> <g.amedick at uni-luebeck.de> wrote:
> > 
> > 
> > Hi,
> > 
> > we have no notifications of OOM kills in /var/log/messages. So if I
understood this correctly, the crawls finished but my attributes weren't set
> > correctly? And this script should fix them?
> > 
> > Thanks for your help so far
> > 
> > Gudrun
> > Am Donnerstag, den 22.11.2018, 13:03 +0530 schrieb Hari Gowtham:
> > > 
> > > On Wed, Nov 21, 2018 at 8:55 PM Gudrun Mareike Amedick
> > > <g.amedick at uni-luebeck.de> wrote:
> > > > 
> > > > 
> > > > 
> > > > Hi Hari,
> > > > 
> > > > I disabled and re-enabled the quota and I saw the crawlers
starting. However, this caused a pretty high load on my servers (200+) and this
> > > > seem to
> > > > have gotten them killed again. At least, I have no crawlers
running, the quotas are not matching the output of du -h, and the crawler logs
all
> > > > contain
> > > > this line:
> > > The quota crawl is an intensive process as it has to crawl the
entire
> > > file system. The intensity varies based on the number of bricks,
> > > number of files,
> > > the depth of filesystem, on going io to the filesystem and so on.
> > > Being a disperse volume it will have to talk to all the bricks
and
> > > also with the huge size, the
> > > increase in the CPU is expected.
> > > 
> > > > 
> > > > 
> > > > 
> > > > [2018-11-20 14:16:35.180467] W
[glusterfsd.c:1375:cleanup_and_exit]
(-->/lib/x86_64-linux-gnu/libpthread.so.0(+0x7494) [0x7f0e3d6fe494] --
> > > > > 
> > > > > 
> > > > > /usr/sbin/glusterfs(glusterfs_sigwaiter+0xf5)
[0x561eb7952d45] -->/usr/sbin/glusterfs(cleanup_and_exit+0x54)
[0x561eb7952ba4] ) 0-: received
> > > > > signum
> > > > (15), shutting down
> > > This can mean that the file attributes are set and then its
stopped/
> > > as you said the process was killed while it still has the
attributes
> > > to be set on a few set of files.
> > > 
> > > This message is common for all the shutdown (one triggered after
the
> > > job is finished and one triggered to stop the process as well)
> > > Can you check the /var/log/messages file for "OOM"
kill?
> > > If you see those messages then the shutdown is because of the
increase
> > > in memory consumption which is expected.
> > > 
> > > > 
> > > > 
> > > > 
> > > > I suspect this means my file attributes are not set
correctly. Would the script you sent me fix that? And the script seems to be
part of the
> > > > Git
> > > > GlusterFS 5.0 repo. We are running 3.12. Would it still work
on 3.12 (or 4.1, since we'll be upgrading soon) or could it break things?
> > > Quota is not actively developed because of its performance issues
> > > which need a major redesign. So the script holds true for newer
> > > version as well,
> > > because no changes have gone in the code for it.
> > > The advantage of the script is it can be used to run over a
certain
> > > directory (need not be root. this reduce the number of
directories/
> > > files depth and so on) which is faulty.
> > > The crawl is necessary for the quota to work fine. The script can
help
> > > only if the xattrs are set by the crawl. which I think isn't
the case
> > > here.
> > > (To verify if the xattrs are set on all the directories we need
to do
> > > a getxattr and see) So we can't use script.
> > > 
> > > 
> > > > 
> > > > 
> > > > 
> > > > Kind regards
> > > > 
> > > > Gudrun Amedick
> > > > Am Dienstag, den 20.11.2018, 16:59 +0530 schrieb Hari
Gowtham:
> > > > > 
> > > > > 
> > > > > reply inline.
> > > > > On Tue, Nov 20, 2018 at 3:53 PM Gudrun Mareike Amedick
> > > > > <g.amedick at uni-luebeck.de> wrote:
> > > > > > 
> > > > > > 
> > > > > > 
> > > > > > 
> > > > > > Hi,
> > > > > > 
> > > > > > I think I know what happened. According to the
logs, the crawlers recieved a signum(15). They seemed to have died before having
finished.
> > > > > > Probably
> > > > > > too
> > > > > > much to do simultaneously. I have disabled and
re-enabled quota and will set the quotas again with more time.
> > > > > > 
> > > > > > Is there a way to restart a crawler that was
killed too soon?
> > > > > No. the disable and enable of quota starts a new crawl.
> > > > > 
> > > > > > 
> > > > > > 
> > > > > > 
> > > > > > 
> > > > > > If I restart a server while a crawler is running,
will the crawler be restarted, too? We'll need to do some hardware fixing on
one of the
> > > > > > servers
> > > > > > soon
> > > > > > and I need to know whether I have to check the
crawlers first before shutting it down.
> > > > > During the shutdown of the server the crawl will be
killed. (data
> > > > > usage shown will be updated as per what has been
crawled)
> > > > > The crawl won't be restarted on starting the
server. Only quotad will
> > > > > be restarted (which is not the same as crawl).
> > > > > For the crawl to happen you will have to restart the
quota.
> > > > > 
> > > > > > 
> > > > > > 
> > > > > > 
> > > > > > 
> > > > > > Thanks for the pointers
> > > > > > 
> > > > > > Gudrun Amedick
> > > > > > Am Dienstag, den 20.11.2018, 11:38 +0530 schrieb
Hari Gowtham:
> > > > > > > 
> > > > > > > 
> > > > > > > 
> > > > > > > Hi,
> > > > > > > 
> > > > > > > Can you check if the quota crawl finished?
Without it having finished
> > > > > > > the quota list will show incorrect values.
> > > > > > > Looking at the under accounting, it looks
like the crawl is not yet
> > > > > > > finished ( it does take a lot of time as it
has to crawl the whole
> > > > > > > filesystem).
> > > > > > > 
> > > > > > > If the crawl has finished and the usage is
still showing wrong values
> > > > > > > then there should be an accounting issue.
> > > > > > > The easy way to fix this is to try restarting
quota. This will not
> > > > > > > cause any problems. The only downside is the
limits won't hold true
> > > > > > > while the quota is disabled,
> > > > > > > till its enabled and the crawl finishes.
> > > > > > > Or you can try using the quota fsck script
> > > > > > >
https://review.gluster.org/#/c/glusterfs/+/19179/ to fix your
> > > > > > > accounting issue.
> > > > > > > 
> > > > > > > Regards,
> > > > > > > Hari.
> > > > > > > On Mon, Nov 19, 2018 at 10:05 PM Frank
Ruehlemann
> > > > > > > <f.ruehlemann at uni-luebeck.de> wrote:
> > > > > > > > 
> > > > > > > > 
> > > > > > > > 
> > > > > > > > 
> > > > > > > > 
> > > > > > > > Hi,
> > > > > > > > 
> > > > > > > > we're running a Distributed
Dispersed volume with Gluster 3.12.14 at
> > > > > > > > Debian 9.6 (Stretch).
> > > > > > > > 
> > > > > > > > We migrated our data (>300TB) from a
pure Distributed volume into this
> > > > > > > > Dispersed volume with cp, followed by
multiple rsyncs.
> > > > > > > > After the migration was successful we
enabled quotas again with "gluster
> > > > > > > > volume quota $VOLUME enable", which
finished successfully.
> > > > > > > > And we set our required quotas with
"gluster volume quota $VOLUME
> > > > > > > > limit-usage $PATH $QUOTA", which
finished without errors too.
> > > > > > > > 
> > > > > > > > But our "gluster volume quota
$VOLUME list" shows wrong values.
> > > > > > > > For example:
> > > > > > > > A directory with ~170TB of data shows
only 40.8TB Used.
> > > > > > > > When we sum up all quoted directories
we're way under the ~310TB that
> > > > > > > > "df -h /$volume" shows.
> > > > > > > > And "df -h
/$volume/$directory" shows wrong values for nearly all
> > > > > > > > directories.
> > > > > > > > 
> > > > > > > > All 72 8TB-bricks and all quota deamons
of the 6 servers are visible and
> > > > > > > > online in "gluster volume status
$VOLUME".
> > > > > > > > 
> > > > > > > > 
> > > > > > > > In quotad.log I found multiple warnings
like this:
> > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > [2018-11-16 09:21:25.738901] W
[dict.c:636:dict_unref] (-->/usr/lib/x86_64-linux-
> > > > > > > > >
gnu/glusterfs/3.12.14/xlator/features/quotad.so(+0x1d58)
> > > > > > > > > [0x7f6844be7d58]
-->/usr/lib/x86_64-linux-gnu/glusterfs/3.12.14/xlator/features/quotad.so(+0x2b92)
[0x7f6844be8b92] --
> > > > > > > > > >/usr/lib/x86_64-
> > > > > > > > > linux-
> > > > > > > > >
gnu/libglusterfs.so.0(dict_unref+0xc0) [0x7f684b0db640] ) 0-dict: dict is NULL
[Invalid argument]
> > > > > > > > In some brick logs I found those:
> > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > [2018-11-19 07:23:30.932327] I
[MSGID: 120020] [quota.c:2198:quota_unlink_cbk] 0-$VOLUME-quota: quota context
not set inode
> > > > > > > > > (gfid:f100f7a9-
> > > > > > > > > 0779-
> > > > > > > > > 4b4c-880f-c8b3b4bdc49d) [Invalid
argument]
> > > > > > > > and (replaced the volume name with
"$VOLUME") those:
> > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > The message "W [MSGID: 120003]
[quota.c:821:quota_build_ancestry_cbk] 0-$VOLUME-quota: parent is NULL [Invalid
argument]" repeated
> > > > > > > > > 13
> > > > > > > > > times
> > > > > > > > > between [2018-11-19
15:28:54.089404] and [2018-11-19 15:30:12.792175]
> > > > > > > > > [2018-11-19 15:31:34.559348] W
[MSGID: 120003] [quota.c:821:quota_build_ancestry_cbk] 0-$VOLUME-quota: parent
is NULL [Invalid
> > > > > > > > > argument]
> > > > > > > > I already found that setting the flag
"trusted.glusterfs.quota.dirty" might help, but I'm unsure about
the consequences that will be
> > > > > > > > triggered.
> > > > > > > > And I'm unsure about the necessary
version flag.
> > > > > > > > 
> > > > > > > > Has anyone an idea how to fix this?
> > > > > > > > 
> > > > > > > > Best Regards,
> > > > > > > > --
> > > > > > > > Frank R?hlemann
> > > > > > > > ???IT-Systemtechnik
> > > > > > > > 
> > > > > > > > UNIVERSIT?T ZU L?BECK
> > > > > > > > ????IT-Service-Center
> > > > > > > > 
> > > > > > > > ????Ratzeburger Allee 160
> > > > > > > > ????23562 L?beck
> > > > > > > > ????Tel +49 451 3101 2034
> > > > > > > > ????Fax +49 451 3101 2004
> > > > > > > > ????ruehlemann at itsc.uni-luebeck.de
> > > > > > > > ????www.itsc.uni-luebeck.de
> > > > > > > > 
> > > > > > > > 
> > > > > > > > 
> > > > > > > > 
> > > > > > > >
_______________________________________________
> > > > > > > > Gluster-users mailing list
> > > > > > > > Gluster-users at gluster.org
> > > > > > > >
https://lists.gluster.org/mailman/listinfo/gluster-users
> > > 
> > > --
> > > Regards,
> > > Hari Gowtham.
> 
> -------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/x-pkcs7-signature
Size: 6743 bytes
Desc: not available
URL:
<http://lists.gluster.org/pipermail/gluster-users/attachments/20181126/5f8e135d/attachment.bin>

Gluster users - Nov 2018 - Gluster 3.12.14: wrong quota in Distributed Dispersed Volume

[Gluster-users] Gluster 3.12.14: wrong quota in Distributed Dispersed Volume

[Gluster-users] Gluster 3.12.14: wrong quota in Distributed Dispersed Volume