B.K.Raghuram
2013-Nov-06 05:23 UTC
[Gluster-users] Re; Strange behaviour with add-brick followed by remove-brick
Here are the steps that I did to reproduce the problem. Essentially, if you try to remove a brick that is not the same as the localhost then it seems to migrate the files on the localhost brick instead and hence there is a lot of data loss.. If instead, I try to remove the localhost brick, it works fine. Can we try and get this fix into 3.4.2 as this seems to be the only way to replace a brick, given that replace-brick is being removed! [root at s5n9 ~]# gluster volume create v1 transport tcp s5n9.testing.lan:/data/v1 s5n10.testing.lan:/data/v1 volume create: v1: success: please start the volume to access data [root at s5n9 ~]# gluster volume start v1 volume start: v1: success [root at s5n9 ~]# gluster volume info v1 Volume Name: v1 Type: Distribute Volume ID: 6402b139-2957-4d62-810b-b70e6f9ba922 Status: Started Number of Bricks: 2 Transport-type: tcp Bricks: Brick1: s5n9.testing.lan:/data/v1 Brick2: s5n10.testing.lan:/data/v1 ***********Now NFS mounted the volume onto my laptop and with a script created 300 files in the mount. Distribution results below ********** [root at s5n9 ~]# ls -l /data/v1 | wc -l 160 [root at s5n10 ~]# ls -l /data/v1 | wc -l 142 [root at s5n9 ~]# gluster volume add-brick v1 s6n11.testing.lan:/data/v1 volume add-brick: success [root at s5n9 ~]# gluster volume remove-brick v1 s5n10.testing.lan:/data/v1 start volume remove-brick start: success ID: 8f3c37d6-2f24-4418-b75a-751dcb6f2b98 [root at s5n9 ~]# gluster volume remove-brick v1 s5n10.testing.lan:/data/v1 status Node Rebalanced-files size scanned failures skipped status run-time in secs --------- ----------- ----------- ----------- ----------- ----------- ------------ -------------- localhost 0 0Bytes 0 0 not started 0.00 s6n12.testing.lan 0 0Bytes 0 0 not started 0.00 s6n11.testing.lan 0 0Bytes 0 0 not started 0.00 s5n10.testing.lan 0 0Bytes 300 0 completed 1.00 [root at s5n9 ~]# gluster volume remove-brick v1 s5n10.testing.lan:/data/v1 commit Removing brick(s) can result in data loss. Do you want to Continue? (y/n) y volume remove-brick commit: success [root at s5n9 ~]# gluster volume info v1 Volume Name: v1 Type: Distribute Volume ID: 6402b139-2957-4d62-810b-b70e6f9ba922 Status: Started Number of Bricks: 2 Transport-type: tcp Bricks: Brick1: s5n9.testing.lan:/data/v1 Brick2: s6n11.testing.lan:/data/v1 [root at s5n9 ~]# ls -l /data/v1 | wc -l 160 [root at s5n10 ~]# ls -l /data/v1 | wc -l 142 [root at s6n11 ~]# ls -l /data/v1 | wc -l 160 [root at s5n9 ~]# ls /data/v1 file10 file110 file131 file144 file156 file173 file19 file206 file224 file238 file250 file264 file279 file291 file31 file44 file62 file86 file100 file114 file132 file146 file159 file174 file192 file209 file225 file24 file252 file265 file28 file292 file32 file46 file63 file87 file101 file116 file134 file147 file16 file18 file196 file210 file228 file240 file254 file266 file281 file293 file37 file47 file66 file9 file102 file12 file135 file148 file161 file181 file198 file212 file229 file241 file255 file267 file284 file294 file38 file48 file69 file91 file103 file121 file136 file149 file165 file183 file200 file215 file231 file243 file256 file268 file285 file295 file4 file50 file7 file93 file104 file122 file137 file150 file17 file184 file201 file216 file233 file245 file258 file271 file286 file296 file40 file53 file71 file97 file105 file124 file138 file152 file170 file186 file202 file218 file234 file246 file261 file273 file287 file297 file41 file54 file73 file107 file125 file140 file153 file171 file188 file203 file220 file236 file248 file262 file275 file288 file298 file42 file55 file75 file11 file13 file141 file154 file172 file189 file204 file222 file237 file25 file263 file278 file290 file3 file43 file58 file80 [root at s6n11 ~]# ls /data/v1 file10 file110 file131 file144 file156 file173 file19 file206 file224 file238 file250 file264 file279 file291 file31 file44 file62 file86 file100 file114 file132 file146 file159 file174 file192 file209 file225 file24 file252 file265 file28 file292 file32 file46 file63 file87 file101 file116 file134 file147 file16 file18 file196 file210 file228 file240 file254 file266 file281 file293 file37 file47 file66 file9 file102 file12 file135 file148 file161 file181 file198 file212 file229 file241 file255 file267 file284 file294 file38 file48 file69 file91 file103 file121 file136 file149 file165 file183 file200 file215 file231 file243 file256 file268 file285 file295 file4 file50 file7 file93 file104 file122 file137 file150 file17 file184 file201 file216 file233 file245 file258 file271 file286 file296 file40 file53 file71 file97 file105 file124 file138 file152 file170 file186 file202 file218 file234 file246 file261 file273 file287 file297 file41 file54 file73 file107 file125 file140 file153 file171 file188 file203 file220 file236 file248 file262 file275 file288 file298 file42 file55 file75 file11 file13 file141 file154 file172 file189 file204 file222 file237 file25 file263 file278 file290 file3 file43 file58 file80 ******* An ls of the mountpoint after this whole process only shows 159 files - the ones that are on s5n9. So everything that was on s5n10 is gone!! ****
Lalatendu Mohanty
2013-Nov-12 15:59 UTC
[Gluster-users] Re; Strange behaviour with add-brick followed by remove-brick
On 11/06/2013 10:53 AM, B.K.Raghuram wrote:> Here are the steps that I did to reproduce the problem. Essentially, > if you try to remove a brick that is not the same as the localhost > then it seems to migrate the files on the localhost brick instead and > hence there is a lot of data loss.. If instead, I try to remove the > localhost brick, it works fine. Can we try and get this fix into 3.4.2 > as this seems to be the only way to replace a brick, given that > replace-brick is being removed! > > [root at s5n9 ~]# gluster volume create v1 transport tcp > s5n9.testing.lan:/data/v1 s5n10.testing.lan:/data/v1 > volume create: v1: success: please start the volume to access data > [root at s5n9 ~]# gluster volume start v1 > volume start: v1: success > [root at s5n9 ~]# gluster volume info v1 > > Volume Name: v1 > Type: Distribute > Volume ID: 6402b139-2957-4d62-810b-b70e6f9ba922 > Status: Started > Number of Bricks: 2 > Transport-type: tcp > Bricks: > Brick1: s5n9.testing.lan:/data/v1 > Brick2: s5n10.testing.lan:/data/v1 > > ***********Now NFS mounted the volume onto my laptop and with a script > created 300 files in the mount. Distribution results below ********** > [root at s5n9 ~]# ls -l /data/v1 | wc -l > 160 > [root at s5n10 ~]# ls -l /data/v1 | wc -l > 142 > > [root at s5n9 ~]# gluster volume add-brick v1 s6n11.testing.lan:/data/v1 > volume add-brick: success > [root at s5n9 ~]# gluster volume remove-brick v1 s5n10.testing.lan:/data/v1 start > volume remove-brick start: success > ID: 8f3c37d6-2f24-4418-b75a-751dcb6f2b98 > [root at s5n9 ~]# gluster volume remove-brick v1 s5n10.testing.lan:/data/v1 status > Node Rebalanced-files > size scanned failures skipped status run-time > in secs > --------- ----------- > ----------- ----------- ----------- ----------- ------------ > -------------- > localhost 0 > 0Bytes 0 0 not started 0.00 > s6n12.testing.lan 0 > 0Bytes 0 0 not started 0.00 > s6n11.testing.lan 0 > 0Bytes 0 0 not started 0.00 > s5n10.testing.lan 0 > 0Bytes 300 0 completed 1.00 > > > [root at s5n9 ~]# gluster volume remove-brick v1 s5n10.testing.lan:/data/v1 commit > Removing brick(s) can result in data loss. Do you want to Continue? (y/n) y > volume remove-brick commit: success > > [root at s5n9 ~]# gluster volume info v1 > > Volume Name: v1 > Type: Distribute > Volume ID: 6402b139-2957-4d62-810b-b70e6f9ba922 > Status: Started > Number of Bricks: 2 > Transport-type: tcp > Bricks: > Brick1: s5n9.testing.lan:/data/v1 > Brick2: s6n11.testing.lan:/data/v1 > > > [root at s5n9 ~]# ls -l /data/v1 | wc -l > 160 > [root at s5n10 ~]# ls -l /data/v1 | wc -l > 142 > [root at s6n11 ~]# ls -l /data/v1 | wc -l > 160 > [root at s5n9 ~]# ls /data/v1 > file10 file110 file131 file144 file156 file173 file19 file206 > file224 file238 file250 file264 file279 file291 file31 file44 > file62 file86 > file100 file114 file132 file146 file159 file174 file192 file209 > file225 file24 file252 file265 file28 file292 file32 file46 > file63 file87 > file101 file116 file134 file147 file16 file18 file196 file210 > file228 file240 file254 file266 file281 file293 file37 file47 > file66 file9 > file102 file12 file135 file148 file161 file181 file198 file212 > file229 file241 file255 file267 file284 file294 file38 file48 > file69 file91 > file103 file121 file136 file149 file165 file183 file200 file215 > file231 file243 file256 file268 file285 file295 file4 file50 > file7 file93 > file104 file122 file137 file150 file17 file184 file201 file216 > file233 file245 file258 file271 file286 file296 file40 file53 > file71 file97 > file105 file124 file138 file152 file170 file186 file202 file218 > file234 file246 file261 file273 file287 file297 file41 file54 > file73 > file107 file125 file140 file153 file171 file188 file203 file220 > file236 file248 file262 file275 file288 file298 file42 file55 > file75 > file11 file13 file141 file154 file172 file189 file204 file222 > file237 file25 file263 file278 file290 file3 file43 file58 > file80 > > [root at s6n11 ~]# ls /data/v1 > file10 file110 file131 file144 file156 file173 file19 file206 > file224 file238 file250 file264 file279 file291 file31 file44 > file62 file86 > file100 file114 file132 file146 file159 file174 file192 file209 > file225 file24 file252 file265 file28 file292 file32 file46 > file63 file87 > file101 file116 file134 file147 file16 file18 file196 file210 > file228 file240 file254 file266 file281 file293 file37 file47 > file66 file9 > file102 file12 file135 file148 file161 file181 file198 file212 > file229 file241 file255 file267 file284 file294 file38 file48 > file69 file91 > file103 file121 file136 file149 file165 file183 file200 file215 > file231 file243 file256 file268 file285 file295 file4 file50 > file7 file93 > file104 file122 file137 file150 file17 file184 file201 file216 > file233 file245 file258 file271 file286 file296 file40 file53 > file71 file97 > file105 file124 file138 file152 file170 file186 file202 file218 > file234 file246 file261 file273 file287 file297 file41 file54 > file73 > file107 file125 file140 file153 file171 file188 file203 file220 > file236 file248 file262 file275 file288 file298 file42 file55 > file75 > file11 file13 file141 file154 file172 file189 file204 file222 > file237 file25 file263 file278 file290 file3 file43 file58 > file80 > > > ******* An ls of the mountpoint after this whole process only shows > 159 files - the ones that are on s5n9. So everything that was on s5n10 > is gone!! ****This matches the descirption in bug https://bugzilla.redhat.com/show_bug.cgi?id=1024369. Also in the bug comments, I can see it is confirmed that the issue is not there in upstream master. But we need to back-port the fix/fixes to 3.4 branch. -Lala