Micha Ober
2016-Oct-01 10:44 UTC
[Gluster-users] Some files are not healed (but not in split-brain), manual fix with setfattr does not work
Hi all, I noticed that I have two files which are not healed: root at giant5:~# gluster volume heal gv0 info Gathering Heal info on volume gv0 has been successful Brick giant1:/gluster/sdc/gv0 Number of entries: 1 /holicki/lqcd/slurm-7251.out Brick giant2:/gluster/sdc/gv0 Number of entries: 1 /holicki/lqcd/slurm-7251.out Brick giant3:/gluster/sdc/gv0 Number of entries: 1 /holicki/lqcd/slurm-7249.out Brick giant4:/gluster/sdc/gv0 Number of entries: 1 /holicki/lqcd/slurm-7249.out Brick giant5:/gluster/sdc/gv0 Number of entries: 1 <gfid:e9793d5e-7174-49b0-9fa9-90f8c35948e7> Brick giant6:/gluster/sdc/gv0 Number of entries: 1 <gfid:e9793d5e-7174-49b0-9fa9-90f8c35948e7> Brick giant1:/gluster/sdd/gv0 Number of entries: 1 /jwilhelm/CUDA/Ns16_Nt32_m0.0100_beta1.7000_lambda0.0050/mu1.000/slurm-5660.out Brick giant2:/gluster/sdd/gv0 Number of entries: 1 /jwilhelm/CUDA/Ns16_Nt32_m0.0100_beta1.7000_lambda0.0050/mu1.000/slurm-5660.out Brick giant3:/gluster/sdd/gv0 Number of entries: 0 Brick giant4:/gluster/sdd/gv0 Number of entries: 0 Brick giant5:/gluster/sdd/gv0 Number of entries: 0 Brick giant6:/gluster/sdd/gv0 Number of entries: 0 (Disregard the file "slurm-7251.out", this is/was IO in progress.) The logs are filled with entries like this: [2016-09-30 12:45:26.611375] I [afr-self-heal-data.c:655:afr_sh_data_fix] 0-gv0-replicate-3: no active sinks for performing self-heal on file /jwilhelm/CUDA/Ns16_Nt32_m0.0100_beta1.7000_lambda0.0050/mu1.000/slurm-5660.out [2016-09-30 12:45:36.874802] I [afr-self-heal-data.c:655:afr_sh_data_fix] 0-gv0-replicate-3: no active sinks for performing self-heal on file /jwilhelm/CUDA/Ns16_Nt32_m0.0100_beta1.7000_lambda0.0050/mu1.000/slurm-5660.out [2016-09-30 12:45:53.701884] I [afr-self-heal-data.c:655:afr_sh_data_fix] 0-gv0-replicate-3: no active sinks for performing self-heal on file /jwilhelm/CUDA/Ns16_Nt32_m0.0100_beta1.7000_lambda0.0050/mu1.000/slurm-5660.out I checked with md5sum that both files are identical. Then, I used setfattr as proposed in an older thread in this mailing list: setfattr -n trusted.afr.gv0-client-7 -v 0x000000000000000000000000 /gluster/sdd/gv0/jwilhelm/CUDA/Ns16_Nt32_m0.0100_beta1.7000_lambda0.0050/mu1.000/slurm-5660.out I did this on both nodes for both clients, so it now looks like this (on both nodes/bricks): getfattr -d -m . -e hex /gluster/sdd/gv0/jwilhelm/CUDA/Ns16_Nt32_m0.0100_beta1.7000_lambda0.0050/mu1.000/slurm-5660.out getfattr: Removing leading '/' from absolute path names # file: gluster/sdd/gv0/jwilhelm/CUDA/Ns16_Nt32_m0.0100_beta1.7000_lambda0.0050/mu1.000/slurm-5660.out trusted.afr.gv0-client-6=0x000000000000000000000000 trusted.afr.gv0-client-7=0x000000000000000000000000 trusted.gfid=0xcb7978fa42e74a0b97928a87126338ac I triggered heal, but the files do not disappear from heal info. But also, they are not listed in split-brain oder heal-failed. I used gfid-resolver.sh for the other file: e9793d5e-7174-49b0-9fa9-90f8c35948e7 == File: /gluster/sdc/gv0/jwilhelm/CUDA/Ns16_Nt32_m0.0100_beta1.7000_lambda0.0050/mu0.800/slurm-5663.out This file is also marked as dirty: root at giant5:/var/log/glusterfs# getfattr -d -m . -e hex /gluster/sdc/gv0/jwilhelm/CUDA/Ns16_Nt32_m0.0100_beta1.7000_lambda0.0050/mu0.800/slurm-5663.out getfattr: Removing leading '/' from absolute path names # file: gluster/sdc/gv0/jwilhelm/CUDA/Ns16_Nt32_m0.0100_beta1.7000_lambda0.0050/mu0.800/slurm-5663.out trusted.afr.gv0-client-4=0x000000010000000000000000 trusted.afr.gv0-client-5=0x000000010000000000000000 trusted.gfid=0xe9793d5e717449b09fa990f8c35948e7 How can I fix this, i.e. get the files healed? I'm using gluster 3.4.2 on Ubuntu 14.04.3. I also thought about scheduling a downtime and upgrading gluster, but I don't know if I can do this as long as there are files to be healed. Thanks for any advice. -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://www.gluster.org/pipermail/gluster-users/attachments/20161001/bdbafde3/attachment.html>
Micha Ober
2016-Oct-02 09:57 UTC
[Gluster-users] Some files are not healed (but not in split-brain), manual fix with setfattr does not work
I noticed one more detail, since I found some entries which are not healed on another gluster filesystem. I resolved the gfids and noticed that *only* the log files generated by the SLURM workload manager (slurm-*.out) are affected. Are there any known problems with SLURM + glusterfs? 2016-10-01 12:44 GMT+02:00 Micha Ober <micha2k at gmail.com>:> Hi all, > > I noticed that I have two files which are not healed: > > root at giant5:~# gluster volume heal gv0 info > Gathering Heal info on volume gv0 has been successful > > Brick giant1:/gluster/sdc/gv0 > Number of entries: 1 > /holicki/lqcd/slurm-7251.out > > Brick giant2:/gluster/sdc/gv0 > Number of entries: 1 > /holicki/lqcd/slurm-7251.out > > Brick giant3:/gluster/sdc/gv0 > Number of entries: 1 > /holicki/lqcd/slurm-7249.out > > Brick giant4:/gluster/sdc/gv0 > Number of entries: 1 > /holicki/lqcd/slurm-7249.out > > Brick giant5:/gluster/sdc/gv0 > Number of entries: 1 > <gfid:e9793d5e-7174-49b0-9fa9-90f8c35948e7> > > Brick giant6:/gluster/sdc/gv0 > Number of entries: 1 > <gfid:e9793d5e-7174-49b0-9fa9-90f8c35948e7> > > Brick giant1:/gluster/sdd/gv0 > Number of entries: 1 > /jwilhelm/CUDA/Ns16_Nt32_m0.0100_beta1.7000_lambda0.0050/ > mu1.000/slurm-5660.out > > Brick giant2:/gluster/sdd/gv0 > Number of entries: 1 > /jwilhelm/CUDA/Ns16_Nt32_m0.0100_beta1.7000_lambda0.0050/ > mu1.000/slurm-5660.out > > Brick giant3:/gluster/sdd/gv0 > Number of entries: 0 > > Brick giant4:/gluster/sdd/gv0 > Number of entries: 0 > > Brick giant5:/gluster/sdd/gv0 > Number of entries: 0 > > Brick giant6:/gluster/sdd/gv0 > Number of entries: 0 > > > (Disregard the file "slurm-7251.out", this is/was IO in progress.) > > The logs are filled with entries like this: > > [2016-09-30 12:45:26.611375] I [afr-self-heal-data.c:655:afr_sh_data_fix] > 0-gv0-replicate-3: no active sinks for performing self-heal on file > /jwilhelm/CUDA/Ns16_Nt32_m0.0100_beta1.7000_lambda0.0050/ > mu1.000/slurm-5660.out > [2016-09-30 12:45:36.874802] I [afr-self-heal-data.c:655:afr_sh_data_fix] > 0-gv0-replicate-3: no active sinks for performing self-heal on file > /jwilhelm/CUDA/Ns16_Nt32_m0.0100_beta1.7000_lambda0.0050/ > mu1.000/slurm-5660.out > [2016-09-30 12:45:53.701884] I [afr-self-heal-data.c:655:afr_sh_data_fix] > 0-gv0-replicate-3: no active sinks for performing self-heal on file > /jwilhelm/CUDA/Ns16_Nt32_m0.0100_beta1.7000_lambda0.0050/ > mu1.000/slurm-5660.out > > I checked with md5sum that both files are identical. > Then, I used setfattr as proposed in an older thread in this mailing list: > > setfattr -n trusted.afr.gv0-client-7 -v 0x000000000000000000000000 > /gluster/sdd/gv0/jwilhelm/CUDA/Ns16_Nt32_m0.0100_beta1. > 7000_lambda0.0050/mu1.000/slurm-5660.out > > I did this on both nodes for both clients, so it now looks like this (on > both nodes/bricks): > > getfattr -d -m . -e hex /gluster/sdd/gv0/jwilhelm/ > CUDA/Ns16_Nt32_m0.0100_beta1.7000_lambda0.0050/mu1.000/slurm-5660.out > getfattr: Removing leading '/' from absolute path names > # file: gluster/sdd/gv0/jwilhelm/CUDA/Ns16_Nt32_m0.0100_beta1.7000_ > lambda0.0050/mu1.000/slurm-5660.out > trusted.afr.gv0-client-6=0x000000000000000000000000 > trusted.afr.gv0-client-7=0x000000000000000000000000 > trusted.gfid=0xcb7978fa42e74a0b97928a87126338ac > > I triggered heal, but the files do not disappear from heal info. But also, > they are not listed in split-brain oder heal-failed. > > I used gfid-resolver.sh for the other file: > e9793d5e-7174-49b0-9fa9-90f8c35948e7 == File: > /gluster/sdc/gv0/jwilhelm/CUDA/Ns16_Nt32_m0.0100_beta1. > 7000_lambda0.0050/mu0.800/slurm-5663.out > > This file is also marked as dirty: > > root at giant5:/var/log/glusterfs# getfattr -d -m . -e hex > /gluster/sdc/gv0/jwilhelm/CUDA/Ns16_Nt32_m0.0100_beta1. > 7000_lambda0.0050/mu0.800/slurm-5663.out > getfattr: Removing leading '/' from absolute path names > # file: gluster/sdc/gv0/jwilhelm/CUDA/Ns16_Nt32_m0.0100_beta1.7000_ > lambda0.0050/mu0.800/slurm-5663.out > trusted.afr.gv0-client-4=0x000000010000000000000000 > trusted.afr.gv0-client-5=0x000000010000000000000000 > trusted.gfid=0xe9793d5e717449b09fa990f8c35948e7 > > > How can I fix this, i.e. get the files healed? I'm using gluster 3.4.2 on > Ubuntu 14.04.3. > > I also thought about scheduling a downtime and upgrading gluster, but I > don't know if I can do this as long as there are files to be healed. > > Thanks for any advice. >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://www.gluster.org/pipermail/gluster-users/attachments/20161002/099015de/attachment.html>