Hi gluster users, I'm having an issue that I'm hoping to get some help with on a dispersed volume (EC: 2x(4+2)) that's causing me some headaches. This is on a cluster running Gluster 6.9 on CentOS 7. At some point in the last week, writes to one of my bricks have started failing due to an "No Space Left on Device" error: [2021-07-06 16:08:57.261307] E [MSGID: 115067] [server-rpc-fops_v2.c:1373:server4_writev_cbk] 0-gluster-01-server: 1853436561: WRITEV -2 (f2d6f2f8-4fd7-4692-bd60-23124897be54), client: CTX_ID:648a7383-46c8-4ed7-a921-acafc90bec1a-GRAPH_ID:4-PID:19471-HOST:rhevh08.mgmt.triumf.ca-PC_NAME:gluster-01-client-5-RECON_NO:-5, error-xlator: gluster-01-posix [No space left on device] The disk is quite full (listed as 100% on the server), but does have some writable room left: /dev/mapper/vg--brick1-brick1 11T 11T 97G 100% /data/glusterfs/gluster-01/brick1 however, I'm not sure if the amount of disk space used on the physical drive is the true cause of the "No Space Left on Device" errors anyway. I can still manually write to this brick outside of Gluster, so it seems like the operating system isn't preventing the writes from happening. During my investigation, I noticed that one .glusterfs paths on the problem server is using up much more space than it is on the other servers. I can't quite figure out why that might be, or how that happened. I'm wondering if there's any advice on what the cause might've been. I had done some package updates on this server with the issue and not on the other servers. This included the kernel version, but didn't include the Gluster packages. So possibly this, or the reboot to load the new kernel may have caused a problem. I have scripts on my gluster machines to nicely kill all of the brick processes before rebooting, so I'm not leaning towards an abrupt shutdown being the cause, but it's a possibility. I'm also looking for advice on how to safely remove the problem file and rebuild it from the other Gluster peers. I've seen some documentation on this, but I'm a little nervous about corrupting the volume if I misunderstand the process. I'm not free to take the volume or cluster down and do maintenance at this point, but that might be something I'll have to consider if it's my only option. For reference, here's the comparison of the same path that seems to be taking up extra space on one of the hosts: 1: 26G /data/gluster-01/brick1/vol/.glusterfs/99/56 2: 26G /data/gluster-01/brick1/vol/.glusterfs/99/56 3: 26G /data/gluster-01/brick1/vol/.glusterfs/99/56 4: 26G /data/gluster-01/brick1/vol/.glusterfs/99/56 5: 26G /data/gluster-01/brick1/vol/.glusterfs/99/56 6: 3.0T /data/gluster-01/brick1/vol/.glusterfs/99/56 Any and all advice is appreciated. Thanks! -- Daniel Thomson DevOps Engineer t +1 604 222 7428 dthomson at triumf.ca TRIUMF Canada's particle accelerator centre www.triumf.ca @TRIUMFLab 4004 Wesbrook Mall Vancouver BC V6T 2A3 Canada -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 833 bytes Desc: not available URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20210706/e58aac73/attachment.sig>
Il 06/07/2021 18:28, Dan Thomson ha scritto: Hi. Maybe you're hitting the "reserved space for root" (usually 5%): when you try to write from the server directly to the brick, you're mos probably doing it from root and you use the reserved space. When you try writing from a client you're likely using a normal user and get the "no space left". Another possible issue to watch out for, is exhaustion of inodes (I've been bitten by it for arbiter bricks partition). HIH, Diego> Hi gluster users, > > I'm having an issue that I'm hoping to get some help with on a > dispersed volume (EC: 2x(4+2)) that's causing me some headaches. This is > on a cluster running Gluster 6.9 on CentOS 7. > > At some point in the last week, writes to one of my bricks have started > failing due to an "No Space Left on Device" error: > > [2021-07-06 16:08:57.261307] E [MSGID: 115067] > [server-rpc-fops_v2.c:1373:server4_writev_cbk] 0-gluster-01-server: > 1853436561: WRITEV -2 (f2d6f2f8-4fd7-4692-bd60-23124897be54), client: > CTX_ID:648a7383-46c8-4ed7-a921-acafc90bec1a-GRAPH_ID:4-PID:19471-HOST:rhevh08.mgmt.triumf.ca-PC_NAME:gluster-01-client-5-RECON_NO:-5, > error-xlator: gluster-01-posix [No space left on device] > > The disk is quite full (listed as 100% on the server), but does have > some writable room left: > > /dev/mapper/vg--brick1-brick1 > 11T?? 11T?? 97G 100% /data/glusterfs/gluster-01/brick1 > > however, I'm not sure if the amount of disk space used on the physical > drive is the true cause of the "No Space Left on Device" errors anyway. > I can still manually write to this brick outside of Gluster, so it seems > like the operating system isn't preventing the writes from happening. > > During my investigation, I noticed that one .glusterfs paths on the problem > server is using up much more space than it is on the other servers. I can't > quite figure out why that might be, or how that happened. I'm wondering > if there's any advice on what the cause might've been. > > I had done some package updates on this server with the issue and not on > the > other servers. This included the kernel version, but didn't include the > Gluster > packages. So possibly this, or the reboot to load the new kernel may > have caused a problem. I have scripts on my gluster machines to nicely kill > all of the brick processes before rebooting, so I'm not leaning towards > an abrupt shutdown being the cause, but it's a possibility. > > I'm also looking for advice on how to safely remove the problem file and > rebuild it from the other Gluster peers. I've seen some documentation on > this, but I'm a little nervous about corrupting the volume if I > misunderstand the process. I'm not free to take the volume or cluster > down and > do maintenance at this point, but that might be something I'll have to > consider > if it's my only option. > > For reference, here's the comparison of the same path that seems to be > taking up extra space on one of the hosts: > > 1: 26G???? /data/gluster-01/brick1/vol/.glusterfs/99/56 > 2: 26G???? /data/gluster-01/brick1/vol/.glusterfs/99/56 > 3: 26G???? /data/gluster-01/brick1/vol/.glusterfs/99/56 > 4: 26G???? /data/gluster-01/brick1/vol/.glusterfs/99/56 > 5: 26G???? /data/gluster-01/brick1/vol/.glusterfs/99/56 > 6: 3.0T??? /data/gluster-01/brick1/vol/.glusterfs/99/56 > > Any and all advice is appreciated. > > Thanks! > -- > > Daniel Thomson > DevOps Engineer > t +1 604 222 7428 > dthomson at triumf.ca > TRIUMF Canada's particle accelerator centre > www.triumf.ca @TRIUMFLab > 4004 Wesbrook Mall > Vancouver BC V6T 2A3 Canada > > ________ > > > > Community Meeting Calendar: > > Schedule - > Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC > Bridge: https://meet.google.com/cpu-eiue-hvk > Gluster-users mailing list > Gluster-users at gluster.org > https://lists.gluster.org/mailman/listinfo/gluster-users >-- Diego Zuccato DIFA - Dip. di Fisica e Astronomia Servizi Informatici Alma Mater Studiorum - Universit? di Bologna V.le Berti-Pichat 6/2 - 40127 Bologna - Italy tel.: +39 051 20 95786
Hi Dan, On Mon, Jul 12, 2021 at 2:20 PM Dan Thomson <dthomson at triumf.ca> wrote:> Hi gluster users, > > I'm having an issue that I'm hoping to get some help with on a > dispersed volume (EC: 2x(4+2)) that's causing me some headaches. This is > on a cluster running Gluster 6.9 on CentOS 7. > > At some point in the last week, writes to one of my bricks have started > failing due to an "No Space Left on Device" error: > > [2021-07-06 16:08:57.261307] E [MSGID: 115067] > [server-rpc-fops_v2.c:1373:server4_writev_cbk] 0-gluster-01-server: > 1853436561: WRITEV -2 (f2d6f2f8-4fd7-4692-bd60-23124897be54), client: > CTX_ID:648a7383-46c8-4ed7-a921-acafc90bec1a-GRAPH_ID:4-PID:19471-HOST:rhevh08.mgmt.triumf.ca-PC_NAME:gluster-01-client-5-RECON_NO:-5, > error-xlator: gluster-01-posix [No space left on device] > > The disk is quite full (listed as 100% on the server), but does have > some writable room left: > > /dev/mapper/vg--brick1-brick1 > 11T 11T 97G 100% /data/glusterfs/gluster-01/brick1 > > however, I'm not sure if the amount of disk space used on the physical > drive is the true cause of the "No Space Left on Device" errors anyway. > I can still manually write to this brick outside of Gluster, so it seems > like the operating system isn't preventing the writes from happening. >As Strahil has said, you are probably hitting the minimum space reserved by Gluster. You can try those options. However I don't recommend keeping bricks above 90% utilization. All filesystems, including XFS, tend to degrade performance when available space is limited. If the brick's filesystem works worse, Gluster performance will also drop.> During my investigation, I noticed that one .glusterfs paths on the problem > server is using up much more space than it is on the other servers. I can't > quite figure out why that might be, or how that happened. I'm wondering > if there's any advice on what the cause might've been. > > I had done some package updates on this server with the issue and not on > the > other servers. This included the kernel version, but didn't include the > Gluster > packages. So possibly this, or the reboot to load the new kernel may > have caused a problem. I have scripts on my gluster machines to nicely kill > all of the brick processes before rebooting, so I'm not leaning towards > an abrupt shutdown being the cause, but it's a possibility. > > I'm also looking for advice on how to safely remove the problem file and > rebuild it from the other Gluster peers. I've seen some documentation on > this, but I'm a little nervous about corrupting the volume if I > misunderstand the process. I'm not free to take the volume or cluster down > and > do maintenance at this point, but that might be something I'll have to > consider > if it's my only option. > > For reference, here's the comparison of the same path that seems to be > taking up extra space on one of the hosts: > > 1: 26G /data/gluster-01/brick1/vol/.glusterfs/99/56 > 2: 26G /data/gluster-01/brick1/vol/.glusterfs/99/56 > 3: 26G /data/gluster-01/brick1/vol/.glusterfs/99/56 > 4: 26G /data/gluster-01/brick1/vol/.glusterfs/99/56 > 5: 26G /data/gluster-01/brick1/vol/.glusterfs/99/56 > 6: 3.0T /data/gluster-01/brick1/vol/.glusterfs/99/56 >This is not normal at all. In a dispersed volume all bricks should use roughly the same used space. Can you provide the output of the following commands: # gluster volume info <volname> # gluster volume status <volname> Also provide the output of this command from all bricks: # ls -ls /data/gluster-01/brick1/vol/.glusterfs/99/56 Regards, Xavi> Any and all advice is appreciated. > > Thanks! > -- > > Daniel Thomson > DevOps Engineer > t +1 604 222 7428 > dthomson at triumf.ca > TRIUMF Canada's particle accelerator centre > www.triumf.ca @TRIUMFLab > 4004 Wesbrook Mall > Vancouver BC V6T 2A3 Canada > ________ > > > > Community Meeting Calendar: > > Schedule - > Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC > Bridge: https://meet.google.com/cpu-eiue-hvk > Gluster-users mailing list > Gluster-users at gluster.org > https://lists.gluster.org/mailman/listinfo/gluster-users >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20210713/c7ea009b/attachment.html>