Just recently used heartbeat to failover resources so that I could power down a lustre node to add more ram and failed back to do the same to our second lustre node. Only then do I find that now our lustre install is missing a physical volume out of lvm. pvscan only shows three out of four partitions. Any hints? I''ve tried some recovery steps in lvm with pvcreate using the archived config for the missing pv but no luck, says no device with such uuid. I''m lost on what to do now. This is lustre 1.8.4
Does the device show up in /dev ? Have you physically checked for Fibre/SAS connectivity, RAID controller errors etc? You may need to supply more information about your setup. It sounds more like a RAID/disk issue than a Lustre issue. ----- Original Message ----- From: "David Noriega" <tsk133 at my.utsa.edu> To: lustre-discuss at lists.lustre.org Sent: Monday, 2 July, 2012 8:51:18 AM Subject: [Lustre-discuss] Lustre missing physical volume Just recently used heartbeat to failover resources so that I could power down a lustre node to add more ram and failed back to do the same to our second lustre node. Only then do I find that now our lustre install is missing a physical volume out of lvm. pvscan only shows three out of four partitions. Any hints? I''ve tried some recovery steps in lvm with pvcreate using the archived config for the missing pv but no luck, says no device with such uuid. I''m lost on what to do now. This is lustre 1.8.4 _______________________________________________ Lustre-discuss mailing list Lustre-discuss at lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20120702/92f14054/attachment.html
Sorry for the rushed email. For some reason the LVM metadata got screwed up, managed to restore it, though now running into another issue. I''ve mounted the OSTs yet it seems they are not all cooperating. One of the OSTs will stay listed as Resource Unavailable and this seems to be the main message on the OSS node: LustreError: 137-5: UUID ''lustre-OST0002_UUID'' is not available for connect (no target) LustreError: Skipped 470 previous similar messages LustreError: 5214:0:(ldlm_lib.c:1914:target_send_reply_msg()) @@@ processing error (-19) req at ffff8103ffc73400 x1404513746630678/t0 o8-><?>@<?>:0/0 lens 368/0 e 0 to 0 dl 1341207057 ref 1 fl Interpret:/0/0 rc -19/0 LustreError: 5214:0:(ldlm_lib.c:1914:target_send_reply_msg()) Skipped 470 previous similar messages I''ve tried remounting this ost on the other data node but still won''t connect from the client side. I''ve even rebooted the mds and still no go. I''ve run e2fsck to check the OSTs and no issues and the disk arrays report no problems on their end and fibre connections are good and the multipath driver doesnt report anything(These are Sun disk arrays so using the rdac driver instead of the basic multpath daemon). On the client side I''ll see this: Lustre: 3289:0:(client.c:1476:ptlrpc_expire_one_request()) @@@ Request x1404591888147958 sent from lustre-OST0002-osc-ffff8104104ad800 to NID 192.168.5.101 at tcp 0s ago has failed due to network error (30s prior to deadline). req at ffff81015113b400 x1404591888147958/t0 o8->lustre-OST0002_UUID at 192.168.5.101@tcp:28/4 lens 368/584 e 0 to 1 dl 1341187631 ref 1 fl Rpc:N/0/0 rc 0/0 Lustre: 3290:0:(import.c:517:import_select_connection()) lustre-OST0002-osc-ffff8104104ad800: tried all connections, increasing latency to 22s Lustre: 3290:0:(import.c:517:import_select_connection()) Skipped 39 previous similar messages On Sun, Jul 1, 2012 at 8:10 PM, Mark Day <mark.day at rsp.com.au> wrote:> Does the device show up in /dev ? > Have you physically checked for Fibre/SAS connectivity, RAID controller > errors etc? > > You may need to supply more information about your setup. It sounds more > like a RAID/disk issue than a Lustre issue. > > ________________________________ > From: "David Noriega" <tsk133 at my.utsa.edu> > To: lustre-discuss at lists.lustre.org > Sent: Monday, 2 July, 2012 8:51:18 AM > Subject: [Lustre-discuss] Lustre missing physical volume > > > Just recently used heartbeat to failover resources so that I could > power down a lustre node to add more ram and failed back to do the > same to our second lustre node. Only then do I find that now our > lustre install is missing a physical volume out of lvm. pvscan only > shows three out of four partitions. > > Any hints? I''ve tried some recovery steps in lvm with pvcreate using > the archived config for the missing pv but no luck, says no device > with such uuid. I''m lost on what to do now. This is lustre 1.8.4 > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss >-- David Noriega CSBC/CBI System Administrator University of Texas at San Antonio One UTSA Circle San Antonio, TX 78249 Office: BSE 3.112 Phone: 210-458-7100 http://www.cbi.utsa.edu
What an adventure this turned into. Turns out when I had to relabel the physical volumes, I got two of them backwards(realized this when I checked /proc/fs/luster/devices) and somehow this was tripping things up. I swapped them back using pvremove and pvcreate, remounted and after a few minutes, the clients reconnected and the system is happy again. On Mon, Jul 2, 2012 at 12:42 AM, David Noriega <tsk133 at my.utsa.edu> wrote:> Sorry for the rushed email. For some reason the LVM metadata got > screwed up, managed to restore it, though now running into another > issue. I''ve mounted the OSTs yet it seems they are not all > cooperating. One of the OSTs will stay listed as Resource Unavailable > and this seems to be the main message on the OSS node: > > LustreError: 137-5: UUID ''lustre-OST0002_UUID'' is not available for > connect (no target) > LustreError: Skipped 470 previous similar messages > LustreError: 5214:0:(ldlm_lib.c:1914:target_send_reply_msg()) @@@ > processing error (-19) req at ffff8103ffc73400 x1404513746630678/t0 > o8-><?>@<?>:0/0 lens 368/0 e 0 to 0 dl 1341207057 ref 1 fl > Interpret:/0/0 rc -19/0 > LustreError: 5214:0:(ldlm_lib.c:1914:target_send_reply_msg()) Skipped > 470 previous similar messages > > I''ve tried remounting this ost on the other data node but still won''t > connect from the client side. I''ve even rebooted the mds and still no > go. I''ve run e2fsck to check the OSTs and no issues and the disk > arrays report no problems on their end and fibre connections are good > and the multipath driver doesnt report anything(These are Sun disk > arrays so using the rdac driver instead of the basic multpath daemon). > > On the client side I''ll see this: > Lustre: 3289:0:(client.c:1476:ptlrpc_expire_one_request()) @@@ Request > x1404591888147958 sent from lustre-OST0002-osc-ffff8104104ad800 to NID > 192.168.5.101 at tcp 0s ago has failed due to network error (30s prior to > deadline). > req at ffff81015113b400 x1404591888147958/t0 > o8->lustre-OST0002_UUID at 192.168.5.101@tcp:28/4 lens 368/584 e 0 to 1 > dl 1341187631 ref 1 fl Rpc:N/0/0 rc 0/0 > > Lustre: 3290:0:(import.c:517:import_select_connection()) > lustre-OST0002-osc-ffff8104104ad800: tried all connections, increasing > latency to 22s > Lustre: 3290:0:(import.c:517:import_select_connection()) Skipped 39 > previous similar messages > > > On Sun, Jul 1, 2012 at 8:10 PM, Mark Day <mark.day at rsp.com.au> wrote: >> Does the device show up in /dev ? >> Have you physically checked for Fibre/SAS connectivity, RAID controller >> errors etc? >> >> You may need to supply more information about your setup. It sounds more >> like a RAID/disk issue than a Lustre issue. >> >> ________________________________ >> From: "David Noriega" <tsk133 at my.utsa.edu> >> To: lustre-discuss at lists.lustre.org >> Sent: Monday, 2 July, 2012 8:51:18 AM >> Subject: [Lustre-discuss] Lustre missing physical volume >> >> >> Just recently used heartbeat to failover resources so that I could >> power down a lustre node to add more ram and failed back to do the >> same to our second lustre node. Only then do I find that now our >> lustre install is missing a physical volume out of lvm. pvscan only >> shows three out of four partitions. >> >> Any hints? I''ve tried some recovery steps in lvm with pvcreate using >> the archived config for the missing pv but no luck, says no device >> with such uuid. I''m lost on what to do now. This is lustre 1.8.4 >> _______________________________________________ >> Lustre-discuss mailing list >> Lustre-discuss at lists.lustre.org >> http://lists.lustre.org/mailman/listinfo/lustre-discuss >> > > > > -- > David Noriega > CSBC/CBI System Administrator > University of Texas at San Antonio > One UTSA Circle > San Antonio, TX 78249 > Office: BSE 3.112 > Phone: 210-458-7100 > http://www.cbi.utsa.edu-- David Noriega CSBC/CBI System Administrator University of Texas at San Antonio One UTSA Circle San Antonio, TX 78249 Office: BSE 3.112 Phone: 210-458-7100 http://www.cbi.utsa.edu