thr3ads.net - CentOS - [CentOS] drbd [Oct 2014]

If this information is useful, please help other people find it:
Share via:

John R Pierce

2014-Oct-12 06:30 UTC

[CentOS] drbd

so I've had a drbd replica running for a while of a 16TB raid thats used 
as a backuppc repository.

when I have rebooted the backuppc server, the replica doesn't seem to 
auto-restart til I do it manually, and the backupc /data file system on 
this 16TB LUN doesn't seem to automount, either.

I've rebooted this thing a few times in the 18 months or so its been 
running...   not always cleanly...

anyways, I'm started a drbd verify (from the slave) about 10 hours ago, 
it has 15 hours more to run, and so far it's logged...

Oct 11 13:58:26 sg2 kernel: block drbd0: Starting Online Verify from 
sector 3534084704
Oct 11 14:00:23 sg2 kernel: block drbd0: [drbd0_worker/2197] 
sock_sendmsg time expired, ko = 4294967295
Oct 11 14:00:29 sg2 kernel: block drbd0: [drbd0_worker/2197] 
sock_sendmsg time expired, ko = 4294967294
Oct 11 14:00:35 sg2 kernel: block drbd0: [drbd0_worker/2197] 
sock_sendmsg time expired, ko = 4294967293
Oct 11 14:00:41 sg2 kernel: block drbd0: [drbd0_worker/2197] 
sock_sendmsg time expired, ko = 4294967292
Oct 11 14:01:16 sg2 kernel: block drbd0: [drbd0_worker/2197] 
sock_sendmsg time expired, ko = 4294967295
Oct 11 14:02:05 sg2 kernel: block drbd0: [drbd0_worker/2197] 
sock_sendmsg time expired, ko = 4294967295
Oct 11 14:02:11 sg2 kernel: block drbd0: [drbd0_worker/2197] 
sock_sendmsg time expired, ko = 4294967294
Oct 11 14:02:17 sg2 kernel: block drbd0: [drbd0_worker/2197] 
sock_sendmsg time expired, ko = 4294967293
Oct 11 14:33:41 sg2 kernel: block drbd0: Out of sync: start=3932979480, 
size=8 (sectors)
Oct 11 14:34:46 sg2 kernel: block drbd0: Out of sync: start=3946056120, 
size=8 (sectors)
Oct 11 15:37:07 sg2 kernel: block drbd0: Out of sync: start=4696809024, 
size=8 (sectors)
Oct 11 17:08:15 sg2 kernel: block drbd0: Out of sync: start=6084949528, 
size=8 (sectors)
Oct 11 17:30:53 sg2 kernel: block drbd0: Out of sync: start=6567543472, 
size=8 (sectors)
Oct 11 17:59:04 sg2 kernel: block drbd0: Out of sync: start=7169767896, 
size=8 (sectors)
Oct 11 20:00:50 sg2 kernel: block drbd0: [drbd0_worker/2197] 
sock_sendmsg time expired, ko = 4294967295
Oct 11 20:01:09 sg2 kernel: block drbd0: [drbd0_worker/2197] 
sock_sendmsg time expired, ko = 4294967295
Oct 11 20:01:15 sg2 kernel: block drbd0: [drbd0_worker/2197] 
sock_sendmsg time expired, ko = 4294967294
Oct 11 20:01:29 sg2 kernel: block drbd0: [drbd0_worker/2197] 
sock_sendmsg time expired, ko = 4294967295
Oct 11 20:29:18 sg2 kernel: block drbd0: Out of sync: start=10362907296, 
size=8 (sectors)
Oct 11 20:29:54 sg2 kernel: block drbd0: Out of sync: start=10375790488, 
size=8 (sectors)
Oct 11 21:01:51 sg2 kernel: block drbd0: [drbd0_worker/2197] 
sock_sendmsg time expired, ko = 4294967295
Oct 11 21:42:15 sg2 kernel: block drbd0: Out of sync: start=11907921096, 
size=8 (sectors)
Oct 11 21:43:38 sg2 kernel: block drbd0: Out of sync: start=11937086248, 
size=8 (sectors)
Oct 11 21:44:00 sg2 kernel: block drbd0: Out of sync: start=11944705032, 
size=8 (sectors)
Oct 11 21:49:26 sg2 kernel: block drbd0: Out of sync: start=12062270432, 
size=8 (sectors)
Oct 11 22:07:10 sg2 kernel: block drbd0: Out of sync: start=12440235128, 
size=8 (sectors)
Oct 11 22:58:54 sg2 kernel: block drbd0: Out of sync: start=13548501984, 
size=8 (sectors)
Oct 11 23:23:17 sg2 kernel: block drbd0: Out of sync: start=14072873320, 
size=8 (sectors)
$ date
Sat Oct 11 23:28:11 PDT 2014

its 35% done at this point...   15 4K blocks out wrong of 1/3rd of 16TB 
isn't a lot, but its still more than I like to see.

$ cat /proc/drbd
version: 8.3.15 (api:88/proto:86-97)
GIT-hash: 0ce4d235fc02b5c53c1c52c53433d11a694eab8c build by 
phil at Build64R6, 2012-12-20 20:09:51
  0: cs:VerifyS ro:Secondary/Primary ds:UpToDate/UpToDate C r-----
     ns:0 nr:105707 dw:187685496 dr:654444832 al:0 bm:1 lo:107 pe:2104 
ua:435 ap:0 ep:1 wo:f oos:60
         [=====>..............] verified: 34.6% (9846140/15051076)M
         finish: 14:55:27 speed: 187,648 (155,708) want: 204,800 K/sec




really, if I let this complete, then disconnect/reconnect the replica, 
it will repair these glitches ?   I'm gathering I shoudl schedule these 
verifies weekly or something.




-- 
john r pierce                                      37N 122W
somewhere on the middle of the left coast

John R Pierce

2014-Oct-12 08:07 UTC

head link

[CentOS] drbd

On 10/11/2014 11:30 PM, John R Pierce wrote:> so I've had a drbd replica running for a while of a 16TB raid thats 
> used as a backuppc repository.
oh.   this is running on a pair of centos 6.latest boxes, each dual xeon 
x5650 w/ 48GB ram, with LSI SAS2 raid card hooked up to a whole lotta 
sas/sata drives.



-- 
john r pierce                                      37N 122W
somewhere on the middle of the left coast

Digimer

2014-Oct-12 16:30 UTC

head link

[CentOS] drbd

On 12/10/14 02:30 AM, John R Pierce wrote:> so I've had a drbd replica running for a while of a 16TB raid thats
used
> as a backuppc repository.
>
> when I have rebooted the backuppc server, the replica doesn't seem to
> auto-restart til I do it manually, and the backupc /data file system on
> this 16TB LUN doesn't seem to automount, either.
>
> I've rebooted this thing a few times in the 18 months or so its been
> running...   not always cleanly...
>
> anyways, I'm started a drbd verify (from the slave) about 10 hours ago,
> it has 15 hours more to run, and so far it's logged...
>
> Oct 11 13:58:26 sg2 kernel: block drbd0: Starting Online Verify from
> sector 3534084704
> Oct 11 14:00:23 sg2 kernel: block drbd0: [drbd0_worker/2197]
> sock_sendmsg time expired, ko = 4294967295
> Oct 11 14:00:29 sg2 kernel: block drbd0: [drbd0_worker/2197]
> sock_sendmsg time expired, ko = 4294967294
> Oct 11 14:00:35 sg2 kernel: block drbd0: [drbd0_worker/2197]
> sock_sendmsg time expired, ko = 4294967293
> Oct 11 14:00:41 sg2 kernel: block drbd0: [drbd0_worker/2197]
> sock_sendmsg time expired, ko = 4294967292
> Oct 11 14:01:16 sg2 kernel: block drbd0: [drbd0_worker/2197]
> sock_sendmsg time expired, ko = 4294967295
> Oct 11 14:02:05 sg2 kernel: block drbd0: [drbd0_worker/2197]
> sock_sendmsg time expired, ko = 4294967295
> Oct 11 14:02:11 sg2 kernel: block drbd0: [drbd0_worker/2197]
> sock_sendmsg time expired, ko = 4294967294
> Oct 11 14:02:17 sg2 kernel: block drbd0: [drbd0_worker/2197]
> sock_sendmsg time expired, ko = 4294967293
> Oct 11 14:33:41 sg2 kernel: block drbd0: Out of sync: start=3932979480,
> size=8 (sectors)
> Oct 11 14:34:46 sg2 kernel: block drbd0: Out of sync: start=3946056120,
> size=8 (sectors)
> Oct 11 15:37:07 sg2 kernel: block drbd0: Out of sync: start=4696809024,
> size=8 (sectors)
> Oct 11 17:08:15 sg2 kernel: block drbd0: Out of sync: start=6084949528,
> size=8 (sectors)
> Oct 11 17:30:53 sg2 kernel: block drbd0: Out of sync: start=6567543472,
> size=8 (sectors)
> Oct 11 17:59:04 sg2 kernel: block drbd0: Out of sync: start=7169767896,
> size=8 (sectors)
> Oct 11 20:00:50 sg2 kernel: block drbd0: [drbd0_worker/2197]
> sock_sendmsg time expired, ko = 4294967295
> Oct 11 20:01:09 sg2 kernel: block drbd0: [drbd0_worker/2197]
> sock_sendmsg time expired, ko = 4294967295
> Oct 11 20:01:15 sg2 kernel: block drbd0: [drbd0_worker/2197]
> sock_sendmsg time expired, ko = 4294967294
> Oct 11 20:01:29 sg2 kernel: block drbd0: [drbd0_worker/2197]
> sock_sendmsg time expired, ko = 4294967295
> Oct 11 20:29:18 sg2 kernel: block drbd0: Out of sync: start=10362907296,
> size=8 (sectors)
> Oct 11 20:29:54 sg2 kernel: block drbd0: Out of sync: start=10375790488,
> size=8 (sectors)
> Oct 11 21:01:51 sg2 kernel: block drbd0: [drbd0_worker/2197]
> sock_sendmsg time expired, ko = 4294967295
> Oct 11 21:42:15 sg2 kernel: block drbd0: Out of sync: start=11907921096,
> size=8 (sectors)
> Oct 11 21:43:38 sg2 kernel: block drbd0: Out of sync: start=11937086248,
> size=8 (sectors)
> Oct 11 21:44:00 sg2 kernel: block drbd0: Out of sync: start=11944705032,
> size=8 (sectors)
> Oct 11 21:49:26 sg2 kernel: block drbd0: Out of sync: start=12062270432,
> size=8 (sectors)
> Oct 11 22:07:10 sg2 kernel: block drbd0: Out of sync: start=12440235128,
> size=8 (sectors)
> Oct 11 22:58:54 sg2 kernel: block drbd0: Out of sync: start=13548501984,
> size=8 (sectors)
> Oct 11 23:23:17 sg2 kernel: block drbd0: Out of sync: start=14072873320,
> size=8 (sectors)
> $ date
> Sat Oct 11 23:28:11 PDT 2014
>
> its 35% done at this point...   15 4K blocks out wrong of 1/3rd of 16TB
> isn't a lot, but its still more than I like to see.
>
> $ cat /proc/drbd
> version: 8.3.15 (api:88/proto:86-97)
> GIT-hash: 0ce4d235fc02b5c53c1c52c53433d11a694eab8c build by
> phil at Build64R6, 2012-12-20 20:09:51
>   0: cs:VerifyS ro:Secondary/Primary ds:UpToDate/UpToDate C r-----
>      ns:0 nr:105707 dw:187685496 dr:654444832 al:0 bm:1 lo:107 pe:2104
> ua:435 ap:0 ep:1 wo:f oos:60
>          [=====>..............] verified: 34.6% (9846140/15051076)M
>          finish: 14:55:27 speed: 187,648 (155,708) want: 204,800 K/sec
>
>
>
>
> really, if I let this complete, then disconnect/reconnect the replica,
> it will repair these glitches ?   I'm gathering I shoudl schedule these
> verifies weekly or something.
That the backing device of one node fell out of sync is a cause concern. 
"Weekly" scan might be a bit much, but monthly or so isn't
unreasonable.
Of course, as you're seeing here, it's a lengthy process and it consumes
non-trivial amounts of bandwidth and adds a fair load to the disks.

How long was it in production before this verify?

I can't speak to backuppc, but I am curious how you're managing the 
resources. Are you using cman + rgmanager or pacemaker?

-- 
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without 
access to education?

Seemingly Similar Threads

Search for more seemingly similar threads

CentOS - Oct 2014 - drbd

[CentOS] drbd

[CentOS] drbd

[CentOS] drbd

Seemingly Similar Threads