Ben Turner
2017-Sep-01 02:19 UTC
[Gluster-users] GFID attir is missing after adding large amounts of data
I re-added gluster-users to get some more eye on this. ----- Original Message -----> From: "Christoph Sch?bel" <christoph.schaebel at dc-square.de> > To: "Ben Turner" <bturner at redhat.com> > Sent: Wednesday, August 30, 2017 8:18:31 AM > Subject: Re: [Gluster-users] GFID attir is missing after adding large amounts of data > > Hello Ben, > > thank you for offering your help. > > Here are outputs from all the gluster commands I could think of. > Note that we had to remove the terrabytes of data to keep the system > operational, because it is a live system. > > # gluster volume status > > Status of volume: gv0 > Gluster process TCP Port RDMA Port Online Pid > ------------------------------------------------------------------------------ > Brick 10.191.206.15:/mnt/brick1/gv0 49154 0 Y 2675 > Brick 10.191.198.15:/mnt/brick1/gv0 49154 0 Y 2679 > Self-heal Daemon on localhost N/A N/A Y > 12309 > Self-heal Daemon on 10.191.206.15 N/A N/A Y 2670 > > Task Status of Volume gv0 > ------------------------------------------------------------------------------ > There are no active volume tasksOK so your bricks are all online, you have two nodes with 1 brick per node.> > # gluster volume info > > Volume Name: gv0 > Type: Replicate > Volume ID: 5e47d0b8-b348-45bb-9a2a-800f301df95b > Status: Started > Snapshot Count: 0 > Number of Bricks: 1 x 2 = 2 > Transport-type: tcp > Bricks: > Brick1: 10.191.206.15:/mnt/brick1/gv0 > Brick2: 10.191.198.15:/mnt/brick1/gv0 > Options Reconfigured: > transport.address-family: inet > performance.readdir-ahead: on > nfs.disable: onYou are using a replicate volume with 2 copies of your data, it looks like you are using the defaults as I don't see any tuning.> > # gluster peer status > > Number of Peers: 1 > > Hostname: 10.191.206.15 > Uuid: 030a879d-da93-4a48-8c69-1c552d3399d2 > State: Peer in Cluster (Connected) > > > # gluster ?version > > glusterfs 3.8.11 built on Apr 11 2017 09:50:39 > Repository revision: git://git.gluster.com/glusterfs.git > Copyright (c) 2006-2011 Gluster Inc. <http://www.gluster.com> > GlusterFS comes with ABSOLUTELY NO WARRANTY. > You may redistribute copies of GlusterFS under the terms of the GNU General > Public License.You are running Gluster 3.8 which is the latest upstream release marked stable.> > # df -h > > Filesystem Size Used Avail Use% Mounted on > /dev/mapper/vg00-root 75G 5.7G 69G 8% / > devtmpfs 1.9G 0 1.9G 0% /dev > tmpfs 1.9G 0 1.9G 0% /dev/shm > tmpfs 1.9G 17M 1.9G 1% /run > tmpfs 1.9G 0 1.9G 0% /sys/fs/cgroup > /dev/sda1 477M 151M 297M 34% /boot > /dev/mapper/vg10-brick1 8.0T 700M 8.0T 1% /mnt/brick1 > localhost:/gv0 8.0T 768M 8.0T 1% /mnt/glusterfs_client > tmpfs 380M 0 380M 0% /run/user/0 >Your brick is: /dev/mapper/vg10-brick1 8.0T 700M 8.0T 1% /mnt/brick1 The block device is 8TB. Can you tell me more about your brick? Is it a single disk or a RAID? If its a RAID can you tell me about the disks? I am interested in: -Size of disks -RAID type -Stripe size -RAID controller I also see: localhost:/gv0 8.0T 768M 8.0T 1% /mnt/glusterfs_client So you are mounting your volume on the local node, is this the mount where you are writing data to?> > > The setup of the servers is done via shell script on CentOS 7 containing the > following commands: > > yum install -y centos-release-gluster > yum install -y glusterfs-server > > mkdir /mnt/brick1 > ssm create -s 999G -n brick1 --fstype xfs -p vg10 /dev/sdb /mnt/brick1I haven't used system-storage-manager before, do you know if it takes care of properly tuning your storage stack(if you have a RAID that is)? If you don't have a RAID its prolly not that big of a deal, if you do have a RAID we should make sure everything is aware of your stripe size and tune appropriately.> > echo "/dev/mapper/vg10-brick1 /mnt/brick1 xfs defaults 1 2" >> > /etc/fstab > mount -a && mount > mkdir /mnt/brick1/gv0 > > gluster peer probe OTHER_SERVER_IP > > gluster pool list > gluster volume create gv0 replica 2 OWN_SERVER_IP:/mnt/brick1/gv0 > OTHER_SERVER_IP:/mnt/brick1/gv0 > gluster volume start gv0 > gluster volume info gv0 > gluster volume set gv0 network.ping-timeout "10" > gluster volume info gv0 > > # mount as client for archiving cronjob, is already in fstab > mount -a > > # mount via fuse-client > mkdir -p /mnt/glusterfs_client > echo "localhost:/gv0 /mnt/glusterfs_client glusterfs defaults,_netdev 0 0" >> > /etc/fstab > mount -a > > > We untar multiple files (around 1300 tar files) each around 2,7GB in size. > The tar files are not compressed. > We untar the files with a shell script containing the following: > > #! /bin/bash > for f in *.tar; do tar xfP $f; doneYour script looks good, I am not that familiar with the tar flag "P" but it looks to mean: -P, --absolute-names Don't strip leading slashes from file names when creating archives. I don't see anything strange here, everything looks OK.> > The script is run as user root, the processes glusterd, glusterfs and > glusterfsd also run under user root. > > Each tar file consists of a single folder with multiple folders and files in > it. > The folder tree looks like this (note that the "=? is part of the folder > name): > > 1498780800/ > - timeframe_hour=1498780800/ (about 25 of these folders) > -- type=1/ (about 25 folders total) > --- data-x.gz.parquet (between 100MB and 1kb in size) > --- data-x.gz.parquet.crc (around 1kb in size) > -- ? > - ... > > Unfortunately I cannot share the file contents with you.Thats no problem, I'll try to recreate this in the lab.> > We have not seen any other issues with glusterfs, when untaring just a few of > those files. I just tried writing a 100GB with dd and did not see any issues > there, the file is replicated and the GFID attribute is set correctly on > both nodes.ACK. I do this all the time, if you saw an issue here I would be worried about your setup.> > We are not able to reproduce this in our lab environment which is a clone > (actual cloned VMs) of the other system, but it only has around 1TB of > storage. > Do you think this could be an issue with the number of files which is > generated by tar (over 1.5 million files). ? > What I can say is that it is not an issue with inodes, that I checked when > all the files where unpacked on the live system.Hmm I am not sure. Its strange that you can't repro this on your other config, in the lab I have a ton of space to work with so I can run a ton of data in my repro.> > If you need anything else, let me know.Can you help clarify your reproducer so I can give it a go in the lab? From what I can tell you have: 1498780800/ <-- Just a string of numbers, this is the root dir of your tarball - timeframe_hour=1498780800/ (about 25 of these folders) <-- This is the second level dir of your tarball, there are ~25 of these dirs that mention a timeframe and an hour -- type=1/ (about 25 folders total) <-- This is the 3rd level of your tar, there are about 25 different type=$X dirs --- data-x.gz.parquet (between 100MB and 1kb in size) <-- This is your actual data. Is there just 1 pair of these file per dir or multiple? --- data-x.gz.parquet.crc (around 1kb in size) <-- This is a checksum for the above file? I have almost everything I need for my reproducer, can you answer the above questions about the data? -b> > Thank you for your help, > Christoph > > Am 29.08.2017 um 06:36 schrieb Ben Turner <bturner at redhat.com>: > > > > Also include gluster v status, I want to check the status of your bricks > > and SHD processes. > > > > -b > > > > ----- Original Message ----- > >> From: "Ben Turner" <bturner at redhat.com> > >> To: "Christoph Sch?bel" <christoph.schaebel at dc-square.de> > >> Cc: gluster-users at gluster.org > >> Sent: Tuesday, August 29, 2017 12:35:05 AM > >> Subject: Re: [Gluster-users] GFID attir is missing after adding large > >> amounts of data > >> > >> This is strange, a couple of questions: > >> > >> 1. What volume type is this? What tuning have you done? gluster v info > >> output would be helpful here. > >> > >> 2. How big are your bricks? > >> > >> 3. Can you write me a quick reproducer so I can try this in the lab? Is > >> it > >> just a single multi TB file you are untarring or many? If you give me the > >> steps to repro, and I hit it, we can get a bug open. > >> > >> 4. Other than this are you seeing any other problems? What if you untar > >> a > >> smaller file(s)? Can you read and write to the volume with say DD without > >> any problems? > >> > >> It sounds like you have some other issues affecting things here, there is > >> no > >> reason why you shouldn't be able to untar and write multiple TBs of data > >> to > >> gluster. Go ahead and answer those questions and I'll see what I can do > >> to > >> help you out. > >> > >> -b > >> > >> ----- Original Message ----- > >>> From: "Christoph Sch?bel" <christoph.schaebel at dc-square.de> > >>> To: gluster-users at gluster.org > >>> Sent: Monday, August 28, 2017 3:55:31 AM > >>> Subject: [Gluster-users] GFID attir is missing after adding large amounts > >>> of data > >>> > >>> Hi Cluster Community, > >>> > >>> we are seeing some problems when adding multiple terrabytes of data to a > >>> 2 > >>> node replicated GlusterFS installation. > >>> > >>> The version is 3.8.11 on CentOS 7. > >>> The machines are connected via 10Gbit LAN and are running 24/7. The OS is > >>> virtualized on VMWare. > >>> > >>> After a restart of node-1 we see that the log files are growing to > >>> multiple > >>> Gigabytes a day. > >>> > >>> Also there seem to be problems with the replication. > >>> The setup worked fine until sometime after we added the additional data > >>> (around 3 TB in size) to node-1. We added the data to a mountpoint via > >>> the > >>> client, not directly to the brick. > >>> What we did is add tar files via a client-mount and then untar them while > >>> in > >>> the client-mount folder. > >>> The brick (/mnt/brick1/gv0) is using the XFS filesystem. > >>> > >>> When checking the file attributes of one of the files mentioned in the > >>> brick > >>> logs, i can see that the gfid attribute is missing on node-1. On node-2 > >>> the > >>> file does not even exist. > >>> > >>> getfattr -m . -d -e hex > >>> mnt/brick1/gv0/.glusterfs/40/59/40598e46-9868-4d7c-b494-7b978e67370a/type=type1/part-r-00002-4846e211-c81d-4c08-bb5e-f22fa5a4b404.gz.parquet > >>> > >>> # file: > >>> mnt/brick1/gv0/.glusterfs/40/59/40598e46-9868-4d7c-b494-7b978e67370a/type=type1/part-r-00002-4846e211-c81d-4c08-bb5e-f22fa5a4b404.gz.parquet > >>> security.selinux=0x756e636f6e66696e65645f753a6f626a6563745f723a756e6c6162656c65645f743a733000 > >>> > >>> We repeated this scenario a second time with a fresh setup and got the > >>> same > >>> results. > >>> > >>> Does anyone know what we are doing wrong ? > >>> > >>> Is there maybe a problem with glusterfs and tar ? > >>> > >>> > >>> Log excerpts: > >>> > >>> > >>> glustershd.log > >>> > >>> [2017-07-26 15:31:36.290908] I [MSGID: 108026] > >>> [afr-self-heal-entry.c:833:afr_selfheal_entry_do] 0-gv0-replicate-0: > >>> performing entry selfheal on fe5c42ac-5fda-47d4-8221-484c8d826c06 > >>> [2017-07-26 15:31:36.294289] W [MSGID: 114031] > >>> [client-rpc-fops.c:2933:client3_3_lookup_cbk] 0-gv0-client-1: remote > >>> operation failed. Path: (null) (00000000-0000-0000-0000-000000000000) [No > >>> data available] > >>> [2017-07-26 15:31:36.298287] I [MSGID: 108026] > >>> [afr-self-heal-entry.c:833:afr_selfheal_entry_do] 0-gv0-replicate-0: > >>> performing entry selfheal on e31ae2ca-a3d2-4a27-a6ce-9aae24608141 > >>> [2017-07-26 15:31:36.300695] W [MSGID: 114031] > >>> [client-rpc-fops.c:2933:client3_3_lookup_cbk] 0-gv0-client-1: remote > >>> operation failed. Path: (null) (00000000-0000-0000-0000-000000000000) [No > >>> data available] > >>> [2017-07-26 15:31:36.303626] I [MSGID: 108026] > >>> [afr-self-heal-entry.c:833:afr_selfheal_entry_do] 0-gv0-replicate-0: > >>> performing entry selfheal on 2cc9dafe-64d3-454a-a647-20deddfaebfe > >>> [2017-07-26 15:31:36.305763] W [MSGID: 114031] > >>> [client-rpc-fops.c:2933:client3_3_lookup_cbk] 0-gv0-client-1: remote > >>> operation failed. Path: (null) (00000000-0000-0000-0000-000000000000) [No > >>> data available] > >>> [2017-07-26 15:31:36.308639] I [MSGID: 108026] > >>> [afr-self-heal-entry.c:833:afr_selfheal_entry_do] 0-gv0-replicate-0: > >>> performing entry selfheal on cbabf9ed-41be-4d08-9cdb-5734557ddbea > >>> [2017-07-26 15:31:36.310819] W [MSGID: 114031] > >>> [client-rpc-fops.c:2933:client3_3_lookup_cbk] 0-gv0-client-1: remote > >>> operation failed. Path: (null) (00000000-0000-0000-0000-000000000000) [No > >>> data available] > >>> [2017-07-26 15:31:36.315057] I [MSGID: 108026] > >>> [afr-self-heal-entry.c:833:afr_selfheal_entry_do] 0-gv0-replicate-0: > >>> performing entry selfheal on 8a3c1c16-8edf-40f0-b2ea-8e70c39e1a69 > >>> [2017-07-26 15:31:36.317196] W [MSGID: 114031] > >>> [client-rpc-fops.c:2933:client3_3_lookup_cbk] 0-gv0-client-1: remote > >>> operation failed. Path: (null) (00000000-0000-0000-0000-000000000000) [No > >>> data available] > >>> > >>> > >>> > >>> bricks/mnt-brick1-gv0.log > >>> > >>> 2017-07-26 15:31:36.287831] E [MSGID: 115050] > >>> [server-rpc-fops.c:156:server_lookup_cbk] 0-gv0-server: 6153546: LOOKUP > >>> <gfid:d99930df-6b47-4b55-9af3-c767afd6584c>/part-r-00001-becc67f0-1665-47b6-8566-fa0245f560ad.gz.parquet > >>> (d99930df-6b47-4b55-9af3-c767afd6584c/part-r-00001-becc67f0-1665-47b6-8566-fa0245f560ad.gz.parquet) > >>> ==> (No data available) [No data available] > >>> [2017-07-26 15:31:36.294202] E [MSGID: 113002] [posix.c:266:posix_lookup] > >>> 0-gv0-posix: buf->ia_gfid is null for > >>> /mnt/brick1/gv0/.glusterfs/e7/2d/e72d9005-b958-432b-b4a9-37aaadd9d2df/type=type1/part-r-00001-becc67f0-1665-47b6-8566-fa0245f560ad.gz.parquet > >>> [No data available] > >>> [2017-07-26 15:31:36.294235] E [MSGID: 115050] > >>> [server-rpc-fops.c:156:server_lookup_cbk] 0-gv0-server: 6153564: LOOKUP > >>> <gfid:fe5c42ac-5fda-47d4-8221-484c8d826c06>/part-r-00001-becc67f0-1665-47b6-8566-fa0245f560ad.gz.parquet > >>> (fe5c42ac-5fda-47d4-8221-484c8d826c06/part-r-00001-becc67f0-1665-47b6-8566-fa0245f560ad.gz.parquet) > >>> ==> (No data available) [No data available] > >>> [2017-07-26 15:31:36.300611] E [MSGID: 113002] [posix.c:266:posix_lookup] > >>> 0-gv0-posix: buf->ia_gfid is null for > >>> /mnt/brick1/gv0/.glusterfs/33/d4/33d47146-bc30-49dd-ada8-475bb75435bf/type=type2/part-r-00002-becc67f0-1665-47b6-8566-fa0245f560ad.gz.parquet > >>> [No data available] > >>> [2017-07-26 15:31:36.300645] E [MSGID: 115050] > >>> [server-rpc-fops.c:156:server_lookup_cbk] 0-gv0-server: 6153582: LOOKUP > >>> <gfid:e31ae2ca-a3d2-4a27-a6ce-9aae24608141>/part-r-00002-becc67f0-1665-47b6-8566-fa0245f560ad.gz.parquet > >>> (e31ae2ca-a3d2-4a27-a6ce-9aae24608141/part-r-00002-becc67f0-1665-47b6-8566-fa0245f560ad.gz.parquet) > >>> ==> (No data available) [No data available] > >>> [2017-07-26 15:31:36.305671] E [MSGID: 113002] [posix.c:266:posix_lookup] > >>> 0-gv0-posix: buf->ia_gfid is null for > >>> /mnt/brick1/gv0/.glusterfs/33/d4/33d47146-bc30-49dd-ada8-475bb75435bf/type=type1/part-r-00002-becc67f0-1665-47b6-8566-fa0245f560ad.gz.parquet > >>> [No data available] > >>> [2017-07-26 15:31:36.305711] E [MSGID: 115050] > >>> [server-rpc-fops.c:156:server_lookup_cbk] 0-gv0-server: 6153600: LOOKUP > >>> <gfid:2cc9dafe-64d3-454a-a647-20deddfaebfe>/part-r-00002-becc67f0-1665-47b6-8566-fa0245f560ad.gz.parquet > >>> (2cc9dafe-64d3-454a-a647-20deddfaebfe/part-r-00002-becc67f0-1665-47b6-8566-fa0245f560ad.gz.parquet) > >>> ==> (No data available) [No data available] > >>> [2017-07-26 15:31:36.310735] E [MSGID: 113002] [posix.c:266:posix_lookup] > >>> 0-gv0-posix: buf->ia_gfid is null for > >>> /mnt/brick1/gv0/.glusterfs/df/71/df715321-3078-47c8-bf23-dec47abe46d7/type=type2/part-r-00002-becc67f0-1665-47b6-8566-fa0245f560ad.gz.parquet > >>> [No data available] > >>> [2017-07-26 15:31:36.310767] E [MSGID: 115050] > >>> [server-rpc-fops.c:156:server_lookup_cbk] 0-gv0-server: 6153618: LOOKUP > >>> <gfid:cbabf9ed-41be-4d08-9cdb-5734557ddbea>/part-r-00002-becc67f0-1665-47b6-8566-fa0245f560ad.gz.parquet > >>> (cbabf9ed-41be-4d08-9cdb-5734557ddbea/part-r-00002-becc67f0-1665-47b6-8566-fa0245f560ad.gz.parquet) > >>> ==> (No data available) [No data available] > >>> [2017-07-26 15:31:36.317113] E [MSGID: 113002] [posix.c:266:posix_lookup] > >>> 0-gv0-posix: buf->ia_gfid is null for > >>> /mnt/brick1/gv0/.glusterfs/df/71/df715321-3078-47c8-bf23-dec47abe46d7/type=type3/part-r-00002-becc67f0-1665-47b6-8566-fa0245f560ad.gz.parquet > >>> [No data available] > >>> [2017-07-26 15:31:36.317146] E [MSGID: 115050] > >>> [server-rpc-fops.c:156:server_lookup_cbk] 0-gv0-server: 6153636: LOOKUP > >>> <gfid:8a3c1c16-8edf-40f0-b2ea-8e70c39e1a69>/part-r-00002-becc67f0-1665-47b6-8566-fa0245f560ad.gz.parquet > >>> (8a3c1c16-8edf-40f0-b2ea-8e70c39e1a69/part-r-00002-becc67f0-1665-47b6-8566-fa0245f560ad.gz.parquet) > >>> ==> (No data available) [No data available] > >>> > >>> > >>> Regards, > >>> Christoph > >>> _______________________________________________ > >>> Gluster-users mailing list > >>> Gluster-users at gluster.org > >>> http://lists.gluster.org/mailman/listinfo/gluster-users > >>> > >> > >
Christoph Schäbel
2017-Sep-01 08:20 UTC
[Gluster-users] GFID attir is missing after adding large amounts of data
My answers inline.> Am 01.09.2017 um 04:19 schrieb Ben Turner <bturner at redhat.com>: > > I re-added gluster-users to get some more eye on this. > > ----- Original Message ----- >> From: "Christoph Sch?bel" <christoph.schaebel at dc-square.de> >> To: "Ben Turner" <bturner at redhat.com> >> Sent: Wednesday, August 30, 2017 8:18:31 AM >> Subject: Re: [Gluster-users] GFID attir is missing after adding large amounts of data >> >> Hello Ben, >> >> thank you for offering your help. >> >> Here are outputs from all the gluster commands I could think of. >> Note that we had to remove the terrabytes of data to keep the system >> operational, because it is a live system. >> >> # gluster volume status >> >> Status of volume: gv0 >> Gluster process TCP Port RDMA Port Online Pid >> ------------------------------------------------------------------------------ >> Brick 10.191.206.15:/mnt/brick1/gv0 49154 0 Y 2675 >> Brick 10.191.198.15:/mnt/brick1/gv0 49154 0 Y 2679 >> Self-heal Daemon on localhost N/A N/A Y >> 12309 >> Self-heal Daemon on 10.191.206.15 N/A N/A Y 2670 >> >> Task Status of Volume gv0 >> ------------------------------------------------------------------------------ >> There are no active volume tasks > > OK so your bricks are all online, you have two nodes with 1 brick per node.Yes> >> >> # gluster volume info >> >> Volume Name: gv0 >> Type: Replicate >> Volume ID: 5e47d0b8-b348-45bb-9a2a-800f301df95b >> Status: Started >> Snapshot Count: 0 >> Number of Bricks: 1 x 2 = 2 >> Transport-type: tcp >> Bricks: >> Brick1: 10.191.206.15:/mnt/brick1/gv0 >> Brick2: 10.191.198.15:/mnt/brick1/gv0 >> Options Reconfigured: >> transport.address-family: inet >> performance.readdir-ahead: on >> nfs.disable: on > > You are using a replicate volume with 2 copies of your data, it looks like you are using the defaults as I don't see any tuning.The only thing we tuned is the network.ping-timeout, we set this to 10 seconds (if this is not the default anyways)> >> >> # gluster peer status >> >> Number of Peers: 1 >> >> Hostname: 10.191.206.15 >> Uuid: 030a879d-da93-4a48-8c69-1c552d3399d2 >> State: Peer in Cluster (Connected) >> >> >> # gluster ?version >> >> glusterfs 3.8.11 built on Apr 11 2017 09:50:39 >> Repository revision: git://git.gluster.com/glusterfs.git >> Copyright (c) 2006-2011 Gluster Inc. <http://www.gluster.com> >> GlusterFS comes with ABSOLUTELY NO WARRANTY. >> You may redistribute copies of GlusterFS under the terms of the GNU General >> Public License. > > You are running Gluster 3.8 which is the latest upstream release marked stable. > >> >> # df -h >> >> Filesystem Size Used Avail Use% Mounted on >> /dev/mapper/vg00-root 75G 5.7G 69G 8% / >> devtmpfs 1.9G 0 1.9G 0% /dev >> tmpfs 1.9G 0 1.9G 0% /dev/shm >> tmpfs 1.9G 17M 1.9G 1% /run >> tmpfs 1.9G 0 1.9G 0% /sys/fs/cgroup >> /dev/sda1 477M 151M 297M 34% /boot >> /dev/mapper/vg10-brick1 8.0T 700M 8.0T 1% /mnt/brick1 >> localhost:/gv0 8.0T 768M 8.0T 1% /mnt/glusterfs_client >> tmpfs 380M 0 380M 0% /run/user/0 >> > > Your brick is: > > /dev/mapper/vg10-brick1 8.0T 700M 8.0T 1% /mnt/brick1 > > The block device is 8TB. Can you tell me more about your brick? Is it a single disk or a RAID? If its a RAID can you tell me about the disks? I am interested in: > > -Size of disks > -RAID type > -Stripe size > -RAID controllerNot sure about the disks, because it comes from a large storage system (not the cheap NAS kind, but the really expensive rack kind) which is then used by VMWare to present a single Volume to my virtual machine. I am pretty sure that on the storage system there is some kind of RAID going on, but I am not sure if that does have an effect on the "virtual? disk that is presented to my VM. To the VM the disk does not look like a RAID, as far as I can tell. # lvdisplay --- Logical volume --- LV Path /dev/vg10/brick1 LV Name brick1 VG Name vg10 LV UUID OEvHEG-m5zc-2MQ1-3gNd-o2gh-q405-YWG02j LV Write Access read/write LV Creation host, time localhost, 2017-01-26 09:44:08 +0000 LV Status available # open 1 LV Size 8.00 TiB Current LE 2096890 Segments 1 Allocation inherit Read ahead sectors auto - currently set to 8192 Block device 253:1 --- Logical volume --- LV Path /dev/vg00/root LV Name root VG Name vg00 LV UUID 3uyF7l-Xhfa-6frx-qjsP-Iy0u-JdbQ-Me03AS LV Write Access read/write LV Creation host, time localhost, 2016-12-15 14:24:08 +0000 LV Status available # open 1 LV Size 74.49 GiB Current LE 19069 Segments 1 Allocation inherit Read ahead sectors auto - currently set to 8192 Block device 253:0 # ssm list ----------------------------------------------------------- Device Free Used Total Pool Mount point ----------------------------------------------------------- /dev/fd0 4.00 KB /dev/sda 80.00 GB PARTITIONED /dev/sda1 500.00 MB /boot /dev/sda2 20.00 MB 74.49 GB 74.51 GB vg00 /dev/sda3 5.00 GB SWAP /dev/sdb 1.02 GB 8.00 TB 8.00 TB vg10 ----------------------------------------------------------- ------------------------------------------------- Pool Type Devices Free Used Total ------------------------------------------------- vg00 lvm 1 20.00 MB 74.49 GB 74.51 GB vg10 lvm 1 1.02 GB 8.00 TB 8.00 TB ------------------------------------------------- ------------------------------------------------------------------------------------ Volume Pool Volume size FS FS size Free Type Mount point ------------------------------------------------------------------------------------ /dev/vg00/root vg00 74.49 GB xfs 74.45 GB 69.36 GB linear / /dev/vg10/brick1 vg10 8.00 TB xfs 8.00 TB 8.00 TB linear /mnt/brick1 /dev/sda1 500.00 MB ext4 500.00 MB 300.92 MB part /boot ------------------------------------------------------------------------------------> > I also see: > > localhost:/gv0 8.0T 768M 8.0T 1% /mnt/glusterfs_client > > So you are mounting your volume on the local node, is this the mount where you are writing data to?Yes, this is the mount I am writing to.> >> >> >> The setup of the servers is done via shell script on CentOS 7 containing the >> following commands: >> >> yum install -y centos-release-gluster >> yum install -y glusterfs-server >> >> mkdir /mnt/brick1 >> ssm create -s 999G -n brick1 --fstype xfs -p vg10 /dev/sdb /mnt/brick1 > > I haven't used system-storage-manager before, do you know if it takes care of properly tuning your storage stack(if you have a RAID that is)? If you don't have a RAID its prolly not that big of a deal, if you do have a RAID we should make sure everything is aware of your stripe size and tune appropriately.I am not sure if ssm does any tuning by default, but since there does not seem to be a RAID (at least for the VM) I don?t think tuning is necessary.> >> >> echo "/dev/mapper/vg10-brick1 /mnt/brick1 xfs defaults 1 2" >> >> /etc/fstab >> mount -a && mount >> mkdir /mnt/brick1/gv0 >> >> gluster peer probe OTHER_SERVER_IP >> >> gluster pool list >> gluster volume create gv0 replica 2 OWN_SERVER_IP:/mnt/brick1/gv0 >> OTHER_SERVER_IP:/mnt/brick1/gv0 >> gluster volume start gv0 >> gluster volume info gv0 >> gluster volume set gv0 network.ping-timeout "10" >> gluster volume info gv0 >> >> # mount as client for archiving cronjob, is already in fstab >> mount -a >> >> # mount via fuse-client >> mkdir -p /mnt/glusterfs_client >> echo "localhost:/gv0 /mnt/glusterfs_client glusterfs defaults,_netdev 0 0" >> >> /etc/fstab >> mount -a >> >> >> We untar multiple files (around 1300 tar files) each around 2,7GB in size. >> The tar files are not compressed. >> We untar the files with a shell script containing the following: >> >> #! /bin/bash >> for f in *.tar; do tar xfP $f; done > > Your script looks good, I am not that familiar with the tar flag "P" but it looks to mean: > > -P, --absolute-names > Don't strip leading slashes from file names when creating archives. > > I don't see anything strange here, everything looks OK. > >> >> The script is run as user root, the processes glusterd, glusterfs and >> glusterfsd also run under user root. >> >> Each tar file consists of a single folder with multiple folders and files in >> it. >> The folder tree looks like this (note that the "=? is part of the folder >> name): >> >> 1498780800/ >> - timeframe_hour=1498780800/ (about 25 of these folders) >> -- type=1/ (about 25 folders total) >> --- data-x.gz.parquet (between 100MB and 1kb in size) >> --- data-x.gz.parquet.crc (around 1kb in size) >> -- ? >> - ... >> >> Unfortunately I cannot share the file contents with you. > > Thats no problem, I'll try to recreate this in the lab. > >> >> We have not seen any other issues with glusterfs, when untaring just a few of >> those files. I just tried writing a 100GB with dd and did not see any issues >> there, the file is replicated and the GFID attribute is set correctly on >> both nodes. > > ACK. I do this all the time, if you saw an issue here I would be worried about your setup. > >> >> We are not able to reproduce this in our lab environment which is a clone >> (actual cloned VMs) of the other system, but it only has around 1TB of >> storage. >> Do you think this could be an issue with the number of files which is >> generated by tar (over 1.5 million files). ? >> What I can say is that it is not an issue with inodes, that I checked when >> all the files where unpacked on the live system. > > Hmm I am not sure. Its strange that you can't repro this on your other config, in the lab I have a ton of space to work with so I can run a ton of data in my repro. > >> >> If you need anything else, let me know. > > Can you help clarify your reproducer so I can give it a go in the lab? From what I can tell you have: > > 1498780800/ <-- Just a string of numbers, this is the root dir of your tarball > - timeframe_hour=1498780800/ (about 25 of these folders) <-- This is the second level dir of your tarball, there are ~25 of these dirs that mention a timeframe and an hour > -- type=1/ (about 25 folders total) <-- This is the 3rd level of your tar, there are about 25 different type=$X dirs > --- data-x.gz.parquet (between 100MB and 1kb in size) <-- This is your actual data. Is there just 1 pair of these file per dir or multiple? > --- data-x.gz.parquet.crc (around 1kb in size) <-- This is a checksum for the above file? > > I have almost everything I need for my reproducer, can you answer the above questions about the data?Yes this is all correct. There is just 1 pair in the last level, and the *.crc file is a checksum file. Thank you for your help, Christoph