B.K.Raghuram
2013-Nov-06  05:23 UTC
[Gluster-users] Re; Strange behaviour with add-brick followed by remove-brick
Here are the steps that I did to reproduce the problem. Essentially,
if you try to remove a brick that is not the same as the localhost
then it seems to migrate the files on the localhost brick instead and
hence there is a lot of data loss.. If instead, I try to remove the
localhost brick, it works fine. Can we try and get this fix into 3.4.2
as this seems to be the only way to replace a brick, given that
replace-brick is being removed!
[root at s5n9 ~]# gluster volume create v1 transport tcp
s5n9.testing.lan:/data/v1 s5n10.testing.lan:/data/v1
volume create: v1: success: please start the volume to access data
[root at s5n9 ~]# gluster volume start v1
volume start: v1: success
[root at s5n9 ~]# gluster volume info v1
Volume Name: v1
Type: Distribute
Volume ID: 6402b139-2957-4d62-810b-b70e6f9ba922
Status: Started
Number of Bricks: 2
Transport-type: tcp
Bricks:
Brick1: s5n9.testing.lan:/data/v1
Brick2: s5n10.testing.lan:/data/v1
***********Now NFS mounted the volume onto my laptop and with a script
created 300 files in the mount. Distribution results below **********
[root at s5n9 ~]# ls -l /data/v1 | wc -l
160
[root at s5n10 ~]# ls -l /data/v1 | wc -l
142
[root at s5n9 ~]# gluster volume add-brick v1 s6n11.testing.lan:/data/v1
volume add-brick: success
[root at s5n9 ~]# gluster volume remove-brick v1 s5n10.testing.lan:/data/v1
start
volume remove-brick start: success
ID: 8f3c37d6-2f24-4418-b75a-751dcb6f2b98
[root at s5n9 ~]# gluster volume remove-brick v1 s5n10.testing.lan:/data/v1
status
                                    Node Rebalanced-files
size       scanned      failures       skipped         status run-time
in secs
                               ---------      -----------
-----------   -----------   -----------   -----------   ------------
--------------
                               localhost                0
0Bytes             0             0    not started             0.00
                       s6n12.testing.lan                0
0Bytes             0             0    not started             0.00
                       s6n11.testing.lan                0
0Bytes             0             0    not started             0.00
                       s5n10.testing.lan                0
0Bytes           300             0      completed             1.00
[root at s5n9 ~]# gluster volume remove-brick v1 s5n10.testing.lan:/data/v1
commit
Removing brick(s) can result in data loss. Do you want to Continue? (y/n) y
volume remove-brick commit: success
[root at s5n9 ~]# gluster volume info v1
Volume Name: v1
Type: Distribute
Volume ID: 6402b139-2957-4d62-810b-b70e6f9ba922
Status: Started
Number of Bricks: 2
Transport-type: tcp
Bricks:
Brick1: s5n9.testing.lan:/data/v1
Brick2: s6n11.testing.lan:/data/v1
[root at s5n9 ~]# ls -l /data/v1 | wc -l
160
[root at s5n10 ~]# ls -l /data/v1 | wc -l
142
[root at s6n11 ~]# ls -l /data/v1 | wc -l
160
[root at s5n9 ~]# ls /data/v1
file10   file110  file131  file144  file156  file173  file19   file206
 file224  file238  file250  file264  file279  file291  file31  file44
file62  file86
file100  file114  file132  file146  file159  file174  file192  file209
 file225  file24   file252  file265  file28   file292  file32  file46
file63  file87
file101  file116  file134  file147  file16   file18   file196  file210
 file228  file240  file254  file266  file281  file293  file37  file47
file66  file9
file102  file12   file135  file148  file161  file181  file198  file212
 file229  file241  file255  file267  file284  file294  file38  file48
file69  file91
file103  file121  file136  file149  file165  file183  file200  file215
 file231  file243  file256  file268  file285  file295  file4   file50
file7   file93
file104  file122  file137  file150  file17   file184  file201  file216
 file233  file245  file258  file271  file286  file296  file40  file53
file71  file97
file105  file124  file138  file152  file170  file186  file202  file218
 file234  file246  file261  file273  file287  file297  file41  file54
file73
file107  file125  file140  file153  file171  file188  file203  file220
 file236  file248  file262  file275  file288  file298  file42  file55
file75
file11   file13   file141  file154  file172  file189  file204  file222
 file237  file25   file263  file278  file290  file3    file43  file58
file80
[root at s6n11 ~]# ls  /data/v1
file10   file110  file131  file144  file156  file173  file19   file206
 file224  file238  file250  file264  file279  file291  file31  file44
file62  file86
file100  file114  file132  file146  file159  file174  file192  file209
 file225  file24   file252  file265  file28   file292  file32  file46
file63  file87
file101  file116  file134  file147  file16   file18   file196  file210
 file228  file240  file254  file266  file281  file293  file37  file47
file66  file9
file102  file12   file135  file148  file161  file181  file198  file212
 file229  file241  file255  file267  file284  file294  file38  file48
file69  file91
file103  file121  file136  file149  file165  file183  file200  file215
 file231  file243  file256  file268  file285  file295  file4   file50
file7   file93
file104  file122  file137  file150  file17   file184  file201  file216
 file233  file245  file258  file271  file286  file296  file40  file53
file71  file97
file105  file124  file138  file152  file170  file186  file202  file218
 file234  file246  file261  file273  file287  file297  file41  file54
file73
file107  file125  file140  file153  file171  file188  file203  file220
 file236  file248  file262  file275  file288  file298  file42  file55
file75
file11   file13   file141  file154  file172  file189  file204  file222
 file237  file25   file263  file278  file290  file3    file43  file58
file80
******* An ls of the mountpoint after this whole process only shows
159 files - the ones that are on s5n9. So everything that was on s5n10
is gone!! ****
Lalatendu Mohanty
2013-Nov-12  15:59 UTC
[Gluster-users] Re; Strange behaviour with add-brick followed by remove-brick
On 11/06/2013 10:53 AM, B.K.Raghuram wrote:> Here are the steps that I did to reproduce the problem. Essentially, > if you try to remove a brick that is not the same as the localhost > then it seems to migrate the files on the localhost brick instead and > hence there is a lot of data loss.. If instead, I try to remove the > localhost brick, it works fine. Can we try and get this fix into 3.4.2 > as this seems to be the only way to replace a brick, given that > replace-brick is being removed! > > [root at s5n9 ~]# gluster volume create v1 transport tcp > s5n9.testing.lan:/data/v1 s5n10.testing.lan:/data/v1 > volume create: v1: success: please start the volume to access data > [root at s5n9 ~]# gluster volume start v1 > volume start: v1: success > [root at s5n9 ~]# gluster volume info v1 > > Volume Name: v1 > Type: Distribute > Volume ID: 6402b139-2957-4d62-810b-b70e6f9ba922 > Status: Started > Number of Bricks: 2 > Transport-type: tcp > Bricks: > Brick1: s5n9.testing.lan:/data/v1 > Brick2: s5n10.testing.lan:/data/v1 > > ***********Now NFS mounted the volume onto my laptop and with a script > created 300 files in the mount. Distribution results below ********** > [root at s5n9 ~]# ls -l /data/v1 | wc -l > 160 > [root at s5n10 ~]# ls -l /data/v1 | wc -l > 142 > > [root at s5n9 ~]# gluster volume add-brick v1 s6n11.testing.lan:/data/v1 > volume add-brick: success > [root at s5n9 ~]# gluster volume remove-brick v1 s5n10.testing.lan:/data/v1 start > volume remove-brick start: success > ID: 8f3c37d6-2f24-4418-b75a-751dcb6f2b98 > [root at s5n9 ~]# gluster volume remove-brick v1 s5n10.testing.lan:/data/v1 status > Node Rebalanced-files > size scanned failures skipped status run-time > in secs > --------- ----------- > ----------- ----------- ----------- ----------- ------------ > -------------- > localhost 0 > 0Bytes 0 0 not started 0.00 > s6n12.testing.lan 0 > 0Bytes 0 0 not started 0.00 > s6n11.testing.lan 0 > 0Bytes 0 0 not started 0.00 > s5n10.testing.lan 0 > 0Bytes 300 0 completed 1.00 > > > [root at s5n9 ~]# gluster volume remove-brick v1 s5n10.testing.lan:/data/v1 commit > Removing brick(s) can result in data loss. Do you want to Continue? (y/n) y > volume remove-brick commit: success > > [root at s5n9 ~]# gluster volume info v1 > > Volume Name: v1 > Type: Distribute > Volume ID: 6402b139-2957-4d62-810b-b70e6f9ba922 > Status: Started > Number of Bricks: 2 > Transport-type: tcp > Bricks: > Brick1: s5n9.testing.lan:/data/v1 > Brick2: s6n11.testing.lan:/data/v1 > > > [root at s5n9 ~]# ls -l /data/v1 | wc -l > 160 > [root at s5n10 ~]# ls -l /data/v1 | wc -l > 142 > [root at s6n11 ~]# ls -l /data/v1 | wc -l > 160 > [root at s5n9 ~]# ls /data/v1 > file10 file110 file131 file144 file156 file173 file19 file206 > file224 file238 file250 file264 file279 file291 file31 file44 > file62 file86 > file100 file114 file132 file146 file159 file174 file192 file209 > file225 file24 file252 file265 file28 file292 file32 file46 > file63 file87 > file101 file116 file134 file147 file16 file18 file196 file210 > file228 file240 file254 file266 file281 file293 file37 file47 > file66 file9 > file102 file12 file135 file148 file161 file181 file198 file212 > file229 file241 file255 file267 file284 file294 file38 file48 > file69 file91 > file103 file121 file136 file149 file165 file183 file200 file215 > file231 file243 file256 file268 file285 file295 file4 file50 > file7 file93 > file104 file122 file137 file150 file17 file184 file201 file216 > file233 file245 file258 file271 file286 file296 file40 file53 > file71 file97 > file105 file124 file138 file152 file170 file186 file202 file218 > file234 file246 file261 file273 file287 file297 file41 file54 > file73 > file107 file125 file140 file153 file171 file188 file203 file220 > file236 file248 file262 file275 file288 file298 file42 file55 > file75 > file11 file13 file141 file154 file172 file189 file204 file222 > file237 file25 file263 file278 file290 file3 file43 file58 > file80 > > [root at s6n11 ~]# ls /data/v1 > file10 file110 file131 file144 file156 file173 file19 file206 > file224 file238 file250 file264 file279 file291 file31 file44 > file62 file86 > file100 file114 file132 file146 file159 file174 file192 file209 > file225 file24 file252 file265 file28 file292 file32 file46 > file63 file87 > file101 file116 file134 file147 file16 file18 file196 file210 > file228 file240 file254 file266 file281 file293 file37 file47 > file66 file9 > file102 file12 file135 file148 file161 file181 file198 file212 > file229 file241 file255 file267 file284 file294 file38 file48 > file69 file91 > file103 file121 file136 file149 file165 file183 file200 file215 > file231 file243 file256 file268 file285 file295 file4 file50 > file7 file93 > file104 file122 file137 file150 file17 file184 file201 file216 > file233 file245 file258 file271 file286 file296 file40 file53 > file71 file97 > file105 file124 file138 file152 file170 file186 file202 file218 > file234 file246 file261 file273 file287 file297 file41 file54 > file73 > file107 file125 file140 file153 file171 file188 file203 file220 > file236 file248 file262 file275 file288 file298 file42 file55 > file75 > file11 file13 file141 file154 file172 file189 file204 file222 > file237 file25 file263 file278 file290 file3 file43 file58 > file80 > > > ******* An ls of the mountpoint after this whole process only shows > 159 files - the ones that are on s5n9. So everything that was on s5n10 > is gone!! ****This matches the descirption in bug https://bugzilla.redhat.com/show_bug.cgi?id=1024369. Also in the bug comments, I can see it is confirmed that the issue is not there in upstream master. But we need to back-port the fix/fixes to 3.4 branch. -Lala