thr3ads.net - Gluster users - [Gluster-users] GFID attir is missing after adding large amounts of data [Sep 2017]

If this information is useful, please help other people find it:
Share via:

Ben Turner

2017-Sep-01 02:19 UTC

[Gluster-users] GFID attir is missing after adding large amounts of data

I re-added gluster-users to get some more eye on this.

----- Original Message -----> From: "Christoph Sch?bel" <christoph.schaebel at
dc-square.de>
> To: "Ben Turner" <bturner at redhat.com>
> Sent: Wednesday, August 30, 2017 8:18:31 AM
> Subject: Re: [Gluster-users] GFID attir is missing after adding large
amounts of	data
> 
> Hello Ben,
> 
> thank you for offering your help.
> 
> Here are outputs from all the gluster commands I could think of.
> Note that we had to remove the terrabytes of data to keep the system
> operational, because it is a live system.
> 
> # gluster volume status
> 
> Status of volume: gv0
> Gluster process                             TCP Port  RDMA Port  Online 
Pid
>
------------------------------------------------------------------------------
> Brick 10.191.206.15:/mnt/brick1/gv0         49154     0          Y      
2675
> Brick 10.191.198.15:/mnt/brick1/gv0         49154     0          Y      
2679
> Self-heal Daemon on localhost               N/A       N/A        Y
> 12309
> Self-heal Daemon on 10.191.206.15           N/A       N/A        Y      
2670
> 
> Task Status of Volume gv0
>
------------------------------------------------------------------------------
> There are no active volume tasks
OK so your bricks are all online, you have two nodes with 1 brick per node.
> 
> # gluster volume info
> 
> Volume Name: gv0
> Type: Replicate
> Volume ID: 5e47d0b8-b348-45bb-9a2a-800f301df95b
> Status: Started
> Snapshot Count: 0
> Number of Bricks: 1 x 2 = 2
> Transport-type: tcp
> Bricks:
> Brick1: 10.191.206.15:/mnt/brick1/gv0
> Brick2: 10.191.198.15:/mnt/brick1/gv0
> Options Reconfigured:
> transport.address-family: inet
> performance.readdir-ahead: on
> nfs.disable: on
You are using a replicate volume with 2 copies of your data, it looks like you
are using the defaults as I don't see any tuning.
> 
> # gluster peer status
> 
> Number of Peers: 1
> 
> Hostname: 10.191.206.15
> Uuid: 030a879d-da93-4a48-8c69-1c552d3399d2
> State: Peer in Cluster (Connected)
> 
> 
> # gluster ?version
> 
> glusterfs 3.8.11 built on Apr 11 2017 09:50:39
> Repository revision: git://git.gluster.com/glusterfs.git
> Copyright (c) 2006-2011 Gluster Inc. <http://www.gluster.com>
> GlusterFS comes with ABSOLUTELY NO WARRANTY.
> You may redistribute copies of GlusterFS under the terms of the GNU General
> Public License.
You are running Gluster 3.8 which is the latest upstream release marked stable.
> 
> # df -h
> 
> Filesystem               Size  Used Avail Use% Mounted on
> /dev/mapper/vg00-root     75G  5.7G   69G   8% /
> devtmpfs                 1.9G     0  1.9G   0% /dev
> tmpfs                    1.9G     0  1.9G   0% /dev/shm
> tmpfs                    1.9G   17M  1.9G   1% /run
> tmpfs                    1.9G     0  1.9G   0% /sys/fs/cgroup
> /dev/sda1                477M  151M  297M  34% /boot
> /dev/mapper/vg10-brick1  8.0T  700M  8.0T   1% /mnt/brick1
> localhost:/gv0           8.0T  768M  8.0T   1% /mnt/glusterfs_client
> tmpfs                    380M     0  380M   0% /run/user/0
>
Your brick is:

 /dev/mapper/vg10-brick1  8.0T  700M  8.0T   1% /mnt/brick1

The block device is 8TB.  Can you tell me more about your brick?  Is it a single
disk or a RAID?  If its a RAID can you tell me about the disks?  I am interested
in:

-Size of disks
-RAID type
-Stripe size
-RAID controller

I also see:

 localhost:/gv0           8.0T  768M  8.0T   1% /mnt/glusterfs_client

So you are mounting your volume on the local node, is this the mount where you
are writing data to?
 > 
> 
> The setup of the servers is done via shell script on CentOS 7 containing
the
> following commands:
> 
> yum install -y centos-release-gluster
> yum install -y glusterfs-server
> 
> mkdir /mnt/brick1
> ssm create -s 999G -n brick1 --fstype xfs -p vg10 /dev/sdb /mnt/brick1
I haven't used system-storage-manager before, do you know if it takes care
of properly tuning your storage stack(if you have a RAID that is)?  If you
don't have a RAID its prolly not that big of a deal, if you do have a RAID
we should make sure everything is aware of your stripe size and tune
appropriately.
> 
> echo "/dev/mapper/vg10-brick1   /mnt/brick1 xfs defaults    1  
2" >>
> /etc/fstab
> mount -a && mount
> mkdir /mnt/brick1/gv0
> 
> gluster peer probe OTHER_SERVER_IP
> 
> gluster pool list
> gluster volume create gv0 replica 2 OWN_SERVER_IP:/mnt/brick1/gv0
> OTHER_SERVER_IP:/mnt/brick1/gv0
> gluster volume start gv0
> gluster volume info gv0
> gluster volume set gv0 network.ping-timeout "10"
> gluster volume info gv0
> 
> # mount as client for archiving cronjob, is already in fstab
> mount -a
> 
> # mount via fuse-client
> mkdir -p /mnt/glusterfs_client
> echo "localhost:/gv0	/mnt/glusterfs_client	glusterfs	defaults,_netdev
0	0" >>
> /etc/fstab
> mount -a
> 
> 
> We untar multiple files (around 1300 tar files) each around 2,7GB in size.
> The tar files are not compressed.
> We untar the files with a shell script containing the following:
> 
> #! /bin/bash
>  for f in *.tar; do tar xfP $f; done
Your script looks good, I am not that familiar with the tar flag "P"
but it looks to mean:

       -P, --absolute-names
              Don't strip leading slashes from file names when creating
archives.

I don't see anything strange here, everything looks OK.
> 
> The script is run as user root, the processes glusterd, glusterfs and
> glusterfsd also run under user root.
> 
> Each tar file consists of a single folder with multiple folders and files
in
> it.
> The folder tree looks like this (note that the "=? is part of the
folder
> name):
> 
> 1498780800/
> - timeframe_hour=1498780800/ (about 25 of these folders)
> -- type=1/ (about 25 folders total)
> --- data-x.gz.parquet (between 100MB and 1kb in size)
> --- data-x.gz.parquet.crc (around 1kb in size)
> -- ?
> - ...
> 
> Unfortunately I cannot share the file contents with you.
Thats no problem, I'll try to recreate this in the lab.
> 
> We have not seen any other issues with glusterfs, when untaring just a few
of
> those files. I just tried writing a 100GB with dd and did not see any
issues
> there, the file is replicated and the GFID attribute is set correctly on
> both nodes.
ACK.  I do this all the time, if you saw an issue here I would be worried about
your setup.
> 
> We are not able to reproduce this in our lab environment which is a clone
> (actual cloned VMs) of the other system, but it only has around 1TB of
> storage.
> Do you think this could be an issue with the number of files which is
> generated by tar (over 1.5 million files). ?
> What I can say is that it is not an issue with inodes, that I checked when
> all the files where unpacked on the live system.
Hmm I am not sure.  Its strange that you can't repro this on your other
config, in the lab I have a ton of space to work with so I can run a ton of data
in my repro.
> 
> If you need anything else, let me know.
Can you help clarify your reproducer so I can give it a go in the lab?  From
what I can tell you have:

 1498780800/    <-- Just a string of numbers, this is the root dir of your
tarball
 - timeframe_hour=1498780800/ (about 25 of these folders)    <-- This is the
second level dir of your tarball, there are ~25 of these dirs that mention a
timeframe and an hour
 -- type=1/ (about 25 folders total)    <-- This is the 3rd level of your
tar, there are about 25 different type=$X dirs
 --- data-x.gz.parquet (between 100MB and 1kb in size)    <-- This is your
actual data.  Is there just 1 pair of these file per dir or multiple?
 --- data-x.gz.parquet.crc (around 1kb in size)    <-- This is a checksum for
the above file?

I have almost everything I need for my reproducer, can you answer the above
questions about the data?

-b
> 
> Thank you for your help,
> Christoph
> > Am 29.08.2017 um 06:36 schrieb Ben Turner <bturner at
redhat.com>:
> > 
> > Also include gluster v status, I want to check the status of your
bricks
> > and SHD processes.
> > 
> > -b
> > 
> > ----- Original Message -----
> >> From: "Ben Turner" <bturner at redhat.com>
> >> To: "Christoph Sch?bel" <christoph.schaebel at
dc-square.de>
> >> Cc: gluster-users at gluster.org
> >> Sent: Tuesday, August 29, 2017 12:35:05 AM
> >> Subject: Re: [Gluster-users] GFID attir is missing after adding
large
> >> amounts of	data
> >> 
> >> This is strange, a couple of questions:
> >> 
> >> 1.  What volume type is this?  What tuning have you done?  gluster
v info
> >> output would be helpful here.
> >> 
> >> 2.  How big are your bricks?
> >> 
> >> 3.  Can you write me a quick reproducer so I can try this in the
lab?  Is
> >> it
> >> just a single multi TB file you are untarring or many?  If you
give me the
> >> steps to repro, and I hit it, we can get a bug open.
> >> 
> >> 4.  Other than this are you seeing any other problems?  What if
you untar
> >> a
> >> smaller file(s)?  Can you read and write to the volume with say DD
without
> >> any problems?
> >> 
> >> It sounds like you have some other issues affecting things here,
there is
> >> no
> >> reason why you shouldn't be able to untar and write multiple
TBs of data
> >> to
> >> gluster.  Go ahead and answer those questions and I'll see
what I can do
> >> to
> >> help you out.
> >> 
> >> -b
> >> 
> >> ----- Original Message -----
> >>> From: "Christoph Sch?bel" <christoph.schaebel at
dc-square.de>
> >>> To: gluster-users at gluster.org
> >>> Sent: Monday, August 28, 2017 3:55:31 AM
> >>> Subject: [Gluster-users] GFID attir is missing after adding
large amounts
> >>> of	data
> >>> 
> >>> Hi Cluster Community,
> >>> 
> >>> we are seeing some problems when adding multiple terrabytes of
data to a
> >>> 2
> >>> node replicated GlusterFS installation.
> >>> 
> >>> The version is 3.8.11 on CentOS 7.
> >>> The machines are connected via 10Gbit LAN and are running
24/7. The OS is
> >>> virtualized on VMWare.
> >>> 
> >>> After a restart of node-1 we see that the log files are
growing to
> >>> multiple
> >>> Gigabytes a day.
> >>> 
> >>> Also there seem to be problems with the replication.
> >>> The setup worked fine until sometime after we added the
additional data
> >>> (around 3 TB in size) to node-1. We added the data to a
mountpoint via
> >>> the
> >>> client, not directly to the brick.
> >>> What we did is add tar files via a client-mount and then untar
them while
> >>> in
> >>> the client-mount folder.
> >>> The brick (/mnt/brick1/gv0) is using the XFS filesystem.
> >>> 
> >>> When checking the file attributes of one of the files
mentioned in the
> >>> brick
> >>> logs, i can see that the gfid attribute is missing on node-1.
On node-2
> >>> the
> >>> file does not even exist.
> >>> 
> >>> getfattr -m . -d -e hex
> >>>
mnt/brick1/gv0/.glusterfs/40/59/40598e46-9868-4d7c-b494-7b978e67370a/type=type1/part-r-00002-4846e211-c81d-4c08-bb5e-f22fa5a4b404.gz.parquet
> >>> 
> >>> # file:
> >>>
mnt/brick1/gv0/.glusterfs/40/59/40598e46-9868-4d7c-b494-7b978e67370a/type=type1/part-r-00002-4846e211-c81d-4c08-bb5e-f22fa5a4b404.gz.parquet
> >>>
security.selinux=0x756e636f6e66696e65645f753a6f626a6563745f723a756e6c6162656c65645f743a733000
> >>> 
> >>> We repeated this scenario a second time with a fresh setup and
got the
> >>> same
> >>> results.
> >>> 
> >>> Does anyone know what we are doing wrong ?
> >>> 
> >>> Is there maybe a problem with glusterfs and tar ?
> >>> 
> >>> 
> >>> Log excerpts:
> >>> 
> >>> 
> >>> glustershd.log
> >>> 
> >>> [2017-07-26 15:31:36.290908] I [MSGID: 108026]
> >>> [afr-self-heal-entry.c:833:afr_selfheal_entry_do]
0-gv0-replicate-0:
> >>> performing entry selfheal on
fe5c42ac-5fda-47d4-8221-484c8d826c06
> >>> [2017-07-26 15:31:36.294289] W [MSGID: 114031]
> >>> [client-rpc-fops.c:2933:client3_3_lookup_cbk] 0-gv0-client-1:
remote
> >>> operation failed. Path: (null)
(00000000-0000-0000-0000-000000000000) [No
> >>> data available]
> >>> [2017-07-26 15:31:36.298287] I [MSGID: 108026]
> >>> [afr-self-heal-entry.c:833:afr_selfheal_entry_do]
0-gv0-replicate-0:
> >>> performing entry selfheal on
e31ae2ca-a3d2-4a27-a6ce-9aae24608141
> >>> [2017-07-26 15:31:36.300695] W [MSGID: 114031]
> >>> [client-rpc-fops.c:2933:client3_3_lookup_cbk] 0-gv0-client-1:
remote
> >>> operation failed. Path: (null)
(00000000-0000-0000-0000-000000000000) [No
> >>> data available]
> >>> [2017-07-26 15:31:36.303626] I [MSGID: 108026]
> >>> [afr-self-heal-entry.c:833:afr_selfheal_entry_do]
0-gv0-replicate-0:
> >>> performing entry selfheal on
2cc9dafe-64d3-454a-a647-20deddfaebfe
> >>> [2017-07-26 15:31:36.305763] W [MSGID: 114031]
> >>> [client-rpc-fops.c:2933:client3_3_lookup_cbk] 0-gv0-client-1:
remote
> >>> operation failed. Path: (null)
(00000000-0000-0000-0000-000000000000) [No
> >>> data available]
> >>> [2017-07-26 15:31:36.308639] I [MSGID: 108026]
> >>> [afr-self-heal-entry.c:833:afr_selfheal_entry_do]
0-gv0-replicate-0:
> >>> performing entry selfheal on
cbabf9ed-41be-4d08-9cdb-5734557ddbea
> >>> [2017-07-26 15:31:36.310819] W [MSGID: 114031]
> >>> [client-rpc-fops.c:2933:client3_3_lookup_cbk] 0-gv0-client-1:
remote
> >>> operation failed. Path: (null)
(00000000-0000-0000-0000-000000000000) [No
> >>> data available]
> >>> [2017-07-26 15:31:36.315057] I [MSGID: 108026]
> >>> [afr-self-heal-entry.c:833:afr_selfheal_entry_do]
0-gv0-replicate-0:
> >>> performing entry selfheal on
8a3c1c16-8edf-40f0-b2ea-8e70c39e1a69
> >>> [2017-07-26 15:31:36.317196] W [MSGID: 114031]
> >>> [client-rpc-fops.c:2933:client3_3_lookup_cbk] 0-gv0-client-1:
remote
> >>> operation failed. Path: (null)
(00000000-0000-0000-0000-000000000000) [No
> >>> data available]
> >>> 
> >>> 
> >>> 
> >>> bricks/mnt-brick1-gv0.log
> >>> 
> >>> 2017-07-26 15:31:36.287831] E [MSGID: 115050]
> >>> [server-rpc-fops.c:156:server_lookup_cbk] 0-gv0-server:
6153546: LOOKUP
> >>>
<gfid:d99930df-6b47-4b55-9af3-c767afd6584c>/part-r-00001-becc67f0-1665-47b6-8566-fa0245f560ad.gz.parquet
> >>>
(d99930df-6b47-4b55-9af3-c767afd6584c/part-r-00001-becc67f0-1665-47b6-8566-fa0245f560ad.gz.parquet)
> >>> ==> (No data available) [No data available]
> >>> [2017-07-26 15:31:36.294202] E [MSGID: 113002]
[posix.c:266:posix_lookup]
> >>> 0-gv0-posix: buf->ia_gfid is null for
> >>>
/mnt/brick1/gv0/.glusterfs/e7/2d/e72d9005-b958-432b-b4a9-37aaadd9d2df/type=type1/part-r-00001-becc67f0-1665-47b6-8566-fa0245f560ad.gz.parquet
> >>> [No data available]
> >>> [2017-07-26 15:31:36.294235] E [MSGID: 115050]
> >>> [server-rpc-fops.c:156:server_lookup_cbk] 0-gv0-server:
6153564: LOOKUP
> >>>
<gfid:fe5c42ac-5fda-47d4-8221-484c8d826c06>/part-r-00001-becc67f0-1665-47b6-8566-fa0245f560ad.gz.parquet
> >>>
(fe5c42ac-5fda-47d4-8221-484c8d826c06/part-r-00001-becc67f0-1665-47b6-8566-fa0245f560ad.gz.parquet)
> >>> ==> (No data available) [No data available]
> >>> [2017-07-26 15:31:36.300611] E [MSGID: 113002]
[posix.c:266:posix_lookup]
> >>> 0-gv0-posix: buf->ia_gfid is null for
> >>>
/mnt/brick1/gv0/.glusterfs/33/d4/33d47146-bc30-49dd-ada8-475bb75435bf/type=type2/part-r-00002-becc67f0-1665-47b6-8566-fa0245f560ad.gz.parquet
> >>> [No data available]
> >>> [2017-07-26 15:31:36.300645] E [MSGID: 115050]
> >>> [server-rpc-fops.c:156:server_lookup_cbk] 0-gv0-server:
6153582: LOOKUP
> >>>
<gfid:e31ae2ca-a3d2-4a27-a6ce-9aae24608141>/part-r-00002-becc67f0-1665-47b6-8566-fa0245f560ad.gz.parquet
> >>>
(e31ae2ca-a3d2-4a27-a6ce-9aae24608141/part-r-00002-becc67f0-1665-47b6-8566-fa0245f560ad.gz.parquet)
> >>> ==> (No data available) [No data available]
> >>> [2017-07-26 15:31:36.305671] E [MSGID: 113002]
[posix.c:266:posix_lookup]
> >>> 0-gv0-posix: buf->ia_gfid is null for
> >>>
/mnt/brick1/gv0/.glusterfs/33/d4/33d47146-bc30-49dd-ada8-475bb75435bf/type=type1/part-r-00002-becc67f0-1665-47b6-8566-fa0245f560ad.gz.parquet
> >>> [No data available]
> >>> [2017-07-26 15:31:36.305711] E [MSGID: 115050]
> >>> [server-rpc-fops.c:156:server_lookup_cbk] 0-gv0-server:
6153600: LOOKUP
> >>>
<gfid:2cc9dafe-64d3-454a-a647-20deddfaebfe>/part-r-00002-becc67f0-1665-47b6-8566-fa0245f560ad.gz.parquet
> >>>
(2cc9dafe-64d3-454a-a647-20deddfaebfe/part-r-00002-becc67f0-1665-47b6-8566-fa0245f560ad.gz.parquet)
> >>> ==> (No data available) [No data available]
> >>> [2017-07-26 15:31:36.310735] E [MSGID: 113002]
[posix.c:266:posix_lookup]
> >>> 0-gv0-posix: buf->ia_gfid is null for
> >>>
/mnt/brick1/gv0/.glusterfs/df/71/df715321-3078-47c8-bf23-dec47abe46d7/type=type2/part-r-00002-becc67f0-1665-47b6-8566-fa0245f560ad.gz.parquet
> >>> [No data available]
> >>> [2017-07-26 15:31:36.310767] E [MSGID: 115050]
> >>> [server-rpc-fops.c:156:server_lookup_cbk] 0-gv0-server:
6153618: LOOKUP
> >>>
<gfid:cbabf9ed-41be-4d08-9cdb-5734557ddbea>/part-r-00002-becc67f0-1665-47b6-8566-fa0245f560ad.gz.parquet
> >>>
(cbabf9ed-41be-4d08-9cdb-5734557ddbea/part-r-00002-becc67f0-1665-47b6-8566-fa0245f560ad.gz.parquet)
> >>> ==> (No data available) [No data available]
> >>> [2017-07-26 15:31:36.317113] E [MSGID: 113002]
[posix.c:266:posix_lookup]
> >>> 0-gv0-posix: buf->ia_gfid is null for
> >>>
/mnt/brick1/gv0/.glusterfs/df/71/df715321-3078-47c8-bf23-dec47abe46d7/type=type3/part-r-00002-becc67f0-1665-47b6-8566-fa0245f560ad.gz.parquet
> >>> [No data available]
> >>> [2017-07-26 15:31:36.317146] E [MSGID: 115050]
> >>> [server-rpc-fops.c:156:server_lookup_cbk] 0-gv0-server:
6153636: LOOKUP
> >>>
<gfid:8a3c1c16-8edf-40f0-b2ea-8e70c39e1a69>/part-r-00002-becc67f0-1665-47b6-8566-fa0245f560ad.gz.parquet
> >>>
(8a3c1c16-8edf-40f0-b2ea-8e70c39e1a69/part-r-00002-becc67f0-1665-47b6-8566-fa0245f560ad.gz.parquet)
> >>> ==> (No data available) [No data available]
> >>> 
> >>> 
> >>> Regards,
> >>> Christoph
> >>> _______________________________________________
> >>> Gluster-users mailing list
> >>> Gluster-users at gluster.org
> >>> http://lists.gluster.org/mailman/listinfo/gluster-users
> >>> 
> >> 
> 
>

Christoph Schäbel

2017-Sep-01 08:20 UTC

head link

[Gluster-users] GFID attir is missing after adding large amounts of data

My answers inline.
> Am 01.09.2017 um 04:19 schrieb Ben Turner <bturner at redhat.com>:
> 
> I re-added gluster-users to get some more eye on this.
> 
> ----- Original Message -----
>> From: "Christoph Sch?bel" <christoph.schaebel at
dc-square.de>
>> To: "Ben Turner" <bturner at redhat.com>
>> Sent: Wednesday, August 30, 2017 8:18:31 AM
>> Subject: Re: [Gluster-users] GFID attir is missing after adding large
amounts of	data
>> 
>> Hello Ben,
>> 
>> thank you for offering your help.
>> 
>> Here are outputs from all the gluster commands I could think of.
>> Note that we had to remove the terrabytes of data to keep the system
>> operational, because it is a live system.
>> 
>> # gluster volume status
>> 
>> Status of volume: gv0
>> Gluster process                             TCP Port  RDMA Port  Online
Pid
>>
------------------------------------------------------------------------------
>> Brick 10.191.206.15:/mnt/brick1/gv0         49154     0          Y     
2675
>> Brick 10.191.198.15:/mnt/brick1/gv0         49154     0          Y     
2679
>> Self-heal Daemon on localhost               N/A       N/A        Y
>> 12309
>> Self-heal Daemon on 10.191.206.15           N/A       N/A        Y     
2670
>> 
>> Task Status of Volume gv0
>>
------------------------------------------------------------------------------
>> There are no active volume tasks
> 
> OK so your bricks are all online, you have two nodes with 1 brick per node.
Yes
> 
>> 
>> # gluster volume info
>> 
>> Volume Name: gv0
>> Type: Replicate
>> Volume ID: 5e47d0b8-b348-45bb-9a2a-800f301df95b
>> Status: Started
>> Snapshot Count: 0
>> Number of Bricks: 1 x 2 = 2
>> Transport-type: tcp
>> Bricks:
>> Brick1: 10.191.206.15:/mnt/brick1/gv0
>> Brick2: 10.191.198.15:/mnt/brick1/gv0
>> Options Reconfigured:
>> transport.address-family: inet
>> performance.readdir-ahead: on
>> nfs.disable: on
> 
> You are using a replicate volume with 2 copies of your data, it looks like
you are using the defaults as I don't see any tuning.
The only thing we tuned is the network.ping-timeout, we set this to 10 seconds
(if this is not the default anyways)
> 
>> 
>> # gluster peer status
>> 
>> Number of Peers: 1
>> 
>> Hostname: 10.191.206.15
>> Uuid: 030a879d-da93-4a48-8c69-1c552d3399d2
>> State: Peer in Cluster (Connected)
>> 
>> 
>> # gluster ?version
>> 
>> glusterfs 3.8.11 built on Apr 11 2017 09:50:39
>> Repository revision: git://git.gluster.com/glusterfs.git
>> Copyright (c) 2006-2011 Gluster Inc. <http://www.gluster.com>
>> GlusterFS comes with ABSOLUTELY NO WARRANTY.
>> You may redistribute copies of GlusterFS under the terms of the GNU
General
>> Public License.
> 
> You are running Gluster 3.8 which is the latest upstream release marked
stable.
> 
>> 
>> # df -h
>> 
>> Filesystem               Size  Used Avail Use% Mounted on
>> /dev/mapper/vg00-root     75G  5.7G   69G   8% /
>> devtmpfs                 1.9G     0  1.9G   0% /dev
>> tmpfs                    1.9G     0  1.9G   0% /dev/shm
>> tmpfs                    1.9G   17M  1.9G   1% /run
>> tmpfs                    1.9G     0  1.9G   0% /sys/fs/cgroup
>> /dev/sda1                477M  151M  297M  34% /boot
>> /dev/mapper/vg10-brick1  8.0T  700M  8.0T   1% /mnt/brick1
>> localhost:/gv0           8.0T  768M  8.0T   1% /mnt/glusterfs_client
>> tmpfs                    380M     0  380M   0% /run/user/0
>> 
> 
> Your brick is:
> 
> /dev/mapper/vg10-brick1  8.0T  700M  8.0T   1% /mnt/brick1
> 
> The block device is 8TB.  Can you tell me more about your brick?  Is it a
single disk or a RAID?  If its a RAID can you tell me about the disks?  I am
interested in:
> 
> -Size of disks
> -RAID type
> -Stripe size
> -RAID controller
Not sure about the disks, because it comes from a large storage system (not the
cheap NAS kind, but the really expensive rack kind) which is then used by VMWare
to present a single Volume to my virtual machine. I am pretty sure that on the
storage system there is some kind of RAID going on, but I am not sure if that
does have an effect on the "virtual? disk that is presented to my VM. To
the VM the disk does not look like a RAID, as far as I can tell.

# lvdisplay 
  --- Logical volume --- 
  LV Path                /dev/vg10/brick1 
  LV Name                brick1 
  VG Name                vg10 
  LV UUID                OEvHEG-m5zc-2MQ1-3gNd-o2gh-q405-YWG02j 
  LV Write Access        read/write 
  LV Creation host, time localhost, 2017-01-26 09:44:08 +0000 
  LV Status              available 
  # open                 1 
  LV Size                8.00 TiB 
  Current LE             2096890 
  Segments               1 
  Allocation             inherit 
  Read ahead sectors     auto 
  - currently set to     8192 
  Block device           253:1 

  --- Logical volume --- 
  LV Path                /dev/vg00/root 
  LV Name                root 
  VG Name                vg00 
  LV UUID                3uyF7l-Xhfa-6frx-qjsP-Iy0u-JdbQ-Me03AS 
  LV Write Access        read/write 
  LV Creation host, time localhost, 2016-12-15 14:24:08 +0000 
  LV Status              available 
  # open                 1 
  LV Size                74.49 GiB 
  Current LE             19069 
  Segments               1 
  Allocation             inherit 
  Read ahead sectors     auto 
  - currently set to     8192 
  Block device           253:0 

# ssm list 
----------------------------------------------------------- 
Device         Free      Used      Total  Pool  Mount point 
----------------------------------------------------------- 
/dev/fd0                         4.00 KB 
/dev/sda                        80.00 GB        PARTITIONED 
/dev/sda1                      500.00 MB        /boot 
/dev/sda2  20.00 MB  74.49 GB   74.51 GB  vg00 
/dev/sda3                        5.00 GB        SWAP 
/dev/sdb    1.02 GB   8.00 TB    8.00 TB  vg10 
----------------------------------------------------------- 
------------------------------------------------- 
Pool  Type  Devices      Free      Used     Total 
------------------------------------------------- 
vg00  lvm   1        20.00 MB  74.49 GB  74.51 GB 
vg10  lvm   1         1.02 GB   8.00 TB   8.00 TB 
------------------------------------------------- 
------------------------------------------------------------------------------------
Volume            Pool  Volume size  FS      FS size       Free  Type    Mount
point
------------------------------------------------------------------------------------
/dev/vg00/root    vg00     74.49 GB  xfs    74.45 GB   69.36 GB  linear  / 
/dev/vg10/brick1  vg10      8.00 TB  xfs     8.00 TB    8.00 TB  linear 
/mnt/brick1
/dev/sda1                 500.00 MB  ext4  500.00 MB  300.92 MB  part    /boot 
------------------------------------------------------------------------------------
> 
> I also see:
> 
> localhost:/gv0           8.0T  768M  8.0T   1% /mnt/glusterfs_client
> 
> So you are mounting your volume on the local node, is this the mount where
you are writing data to?
Yes, this is the mount I am writing to.
> 
>> 
>> 
>> The setup of the servers is done via shell script on CentOS 7
containing the
>> following commands:
>> 
>> yum install -y centos-release-gluster
>> yum install -y glusterfs-server
>> 
>> mkdir /mnt/brick1
>> ssm create -s 999G -n brick1 --fstype xfs -p vg10 /dev/sdb /mnt/brick1
> 
> I haven't used system-storage-manager before, do you know if it takes
care of properly tuning your storage stack(if you have a RAID that is)?  If you
don't have a RAID its prolly not that big of a deal, if you do have a RAID
we should make sure everything is aware of your stripe size and tune
appropriately.
I am not sure if ssm does any tuning by default, but since there does not seem
to be a RAID (at least for the VM) I don?t think tuning is necessary.
> 
>> 
>> echo "/dev/mapper/vg10-brick1   /mnt/brick1 xfs defaults    1  
2" >>
>> /etc/fstab
>> mount -a && mount
>> mkdir /mnt/brick1/gv0
>> 
>> gluster peer probe OTHER_SERVER_IP
>> 
>> gluster pool list
>> gluster volume create gv0 replica 2 OWN_SERVER_IP:/mnt/brick1/gv0
>> OTHER_SERVER_IP:/mnt/brick1/gv0
>> gluster volume start gv0
>> gluster volume info gv0
>> gluster volume set gv0 network.ping-timeout "10"
>> gluster volume info gv0
>> 
>> # mount as client for archiving cronjob, is already in fstab
>> mount -a
>> 
>> # mount via fuse-client
>> mkdir -p /mnt/glusterfs_client
>> echo "localhost:/gv0	/mnt/glusterfs_client	glusterfs
defaults,_netdev	0	0" >>
>> /etc/fstab
>> mount -a
>> 
>> 
>> We untar multiple files (around 1300 tar files) each around 2,7GB in
size.
>> The tar files are not compressed.
>> We untar the files with a shell script containing the following:
>> 
>> #! /bin/bash
>> for f in *.tar; do tar xfP $f; done
> 
> Your script looks good, I am not that familiar with the tar flag
"P" but it looks to mean:
> 
>       -P, --absolute-names
>              Don't strip leading slashes from file names when creating
archives.
> 
> I don't see anything strange here, everything looks OK.
> 
>> 
>> The script is run as user root, the processes glusterd, glusterfs and
>> glusterfsd also run under user root.
>> 
>> Each tar file consists of a single folder with multiple folders and
files in
>> it.
>> The folder tree looks like this (note that the "=? is part of the
folder
>> name):
>> 
>> 1498780800/
>> - timeframe_hour=1498780800/ (about 25 of these folders)
>> -- type=1/ (about 25 folders total)
>> --- data-x.gz.parquet (between 100MB and 1kb in size)
>> --- data-x.gz.parquet.crc (around 1kb in size)
>> -- ?
>> - ...
>> 
>> Unfortunately I cannot share the file contents with you.
> 
> Thats no problem, I'll try to recreate this in the lab.
> 
>> 
>> We have not seen any other issues with glusterfs, when untaring just a
few of
>> those files. I just tried writing a 100GB with dd and did not see any
issues
>> there, the file is replicated and the GFID attribute is set correctly
on
>> both nodes.
> 
> ACK.  I do this all the time, if you saw an issue here I would be worried
about your setup.
> 
>> 
>> We are not able to reproduce this in our lab environment which is a
clone
>> (actual cloned VMs) of the other system, but it only has around 1TB of
>> storage.
>> Do you think this could be an issue with the number of files which is
>> generated by tar (over 1.5 million files). ?
>> What I can say is that it is not an issue with inodes, that I checked
when
>> all the files where unpacked on the live system.
> 
> Hmm I am not sure.  Its strange that you can't repro this on your other
config, in the lab I have a ton of space to work with so I can run a ton of data
in my repro.
> 
>> 
>> If you need anything else, let me know.
> 
> Can you help clarify your reproducer so I can give it a go in the lab? 
From what I can tell you have:
> 
> 1498780800/    <-- Just a string of numbers, this is the root dir of
your tarball
> - timeframe_hour=1498780800/ (about 25 of these folders)    <-- This is
the second level dir of your tarball, there are ~25 of these dirs that mention a
timeframe and an hour
> -- type=1/ (about 25 folders total)    <-- This is the 3rd level of your
tar, there are about 25 different type=$X dirs
> --- data-x.gz.parquet (between 100MB and 1kb in size)    <-- This is
your actual data.  Is there just 1 pair of these file per dir or multiple?
> --- data-x.gz.parquet.crc (around 1kb in size)    <-- This is a checksum
for the above file?
> 
> I have almost everything I need for my reproducer, can you answer the above
questions about the data?
Yes this is all correct. There is just 1 pair in the last level, and the *.crc
file is a checksum file.


Thank you for your help,
Christoph

Maybe Matching Threads

Search for more apparently analagous threads

Gluster users - Sep 2017 - GFID attir is missing after adding large amounts of data

[Gluster-users] GFID attir is missing after adding large amounts of data

[Gluster-users] GFID attir is missing after adding large amounts of data

Maybe Matching Threads