Sebastian Gutierrez
2008-Nov-13 19:45 UTC
[Lustre-discuss] Performance Issue Troubleshooting
Hello I am working with lustre 1.6.5. A couple of weeks ago we had a move of the racks that the MDS and OSTs were in. Ever since the move we have had our users complain about performance issues. Most of the complaints have been regarding listing files and bash auto-completion but the jobs running on the cluster seem to be running into issues as well. I have received few Lustre errors in /var/log/messages except for a few nodes being evicted. I have rebooted all the clients to try to remedy the performance issues. On the OSS we have 4 Gig-E ports bonded. So i assume that the bottle neck may be the MDS which only has one Gig-E network up-link. I have inherited this system and am verifying the configuration as I go. I may also be missing something fairly obvious. So I apologize if I give to much information. lfs check servers > from the mds_2 shows all active. OSC_hoxa-mds-1.Stanford.EDU_ost1-cluster_MNT_cluster-ffff81022e1a6400 active. OSC_hoxa-mds-1.Stanford.EDU_ost2-cluster_MNT_cluster-ffff81022e1a6400 active. OSC_hoxa-mds-1.Stanford.EDU_ost3-cluster_MNT_cluster-ffff81022e1a6400 active. OSC_hoxa-mds-1.Stanford.EDU_ost4-cluster_MNT_cluster-ffff81022e1a6400 active. MDC_hoxa-mds-1.Stanford.EDU_mds-hoxa_MNT_cluster-ffff81022e1a6400 active. from mds_1 the command shows 0 UP mgs MGS MGS 25 1 UP mgc MGC192.168.136.10 at tcp c5642c46-5232-2004-4ba7-01a5ae11047f 5 2 UP mdt MDS MDS_uuid 3 3 UP lov lustre-mdtlov lustre-mdtlov_UUID 4 4 UP mds lustre-MDT0000 mds-hoxa_UUID 91 5 UP osc lustre-OST0000-osc lustre-mdtlov_UUID 5 6 UP osc lustre-OST0003-osc lustre-mdtlov_UUID 5 7 UP osc lustre-OST0001-osc lustre-mdtlov_UUID 5 8 UP osc lustre-OST0002-osc lustre-mdtlov_UUID 5 lctl device_list 0 UP mgc MGC192.168.136.10 at tcp 5467a230-9a4b-41bc-dabf-5ae86ba7a287 5 2 UP lov lov-cluster-ffff81022e1a6400 fd455bf8-544e-5f93-f316-270497373710 4 3 UP osc OSC_hoxa-mds-1.Stanford.EDU_ost1-cluster_MNT_cluster-ffff81022e1a6400 fd455bf8-544e-5f93-f316-270497373710 5 4 UP osc OSC_hoxa-mds-1.Stanford.EDU_ost2-cluster_MNT_cluster-ffff81022e1a6400 fd455bf8-544e-5f93-f316-270497373710 5 5 UP osc OSC_hoxa-mds-1.Stanford.EDU_ost3-cluster_MNT_cluster-ffff81022e1a6400 fd455bf8-544e-5f93-f316-270497373710 5 6 UP osc OSC_hoxa-mds-1.Stanford.EDU_ost4-cluster_MNT_cluster-ffff81022e1a6400 fd455bf8-544e-5f93-f316-270497373710 5 7 UP mdc MDC_hoxa-mds-1.Stanford.EDU_mds-hoxa_MNT_cluster-ffff81022e1a6400 fd455bf8-544e-5f93-f316-270497373710 5 running the following from a target lfs getstripe /cluster/ OBDS: 0: ost1-cluster_UUID ACTIVE 1: ost2-cluster_UUID ACTIVE 2: ost3-cluster_UUID ACTIVE 3: ost4-cluster_UUID ACTIVE /cluster/ default stripe_count: 1 stripe_size: 1048576 stripe_offset: 0 I have also verified that the sub folders also have the same configuration. So i started doing llstat -i10 on the oss and the MDS on different stat files. While doing a ls that experiences the lockups. I found that there seems to be quite a few requests with high wait times. I have looked into the directory i was doing the LS in and it looks like the user has about 80000 files in his directories. The files sizes are about 1.5k. Any pointers to track this down would be greatly appreciated. /proc/fs/lustre/mdt/MDS/mds/stats @ 1226456419.648209 Name Cur.Count Cur.Rate #Events Unit last min avg max stddev req_waittime 34367 3436 1265403433[usec] 428069 3 80.58 2068859 683.10 req_qdepth 34367 3436 1265403433[reqs] 3540 0 1.41 317 2.18 req_active 34367 3436 1265403433[reqs] 54803 1 10.12 127 8.66 req_timeout 34368 3436 1265403434[sec] 34368 1 3.26 169 7.82 reqbuf_avail 70115 7011 2746703429[bufs] 17929503 157 249.87 256 6.44 ldlm_ibits_enqueue 34351 3435 1244343736[reqs] 34351 1 1.00 1 0.00 This is without the ls that locks up the box. /proc/fs/lustre/mdt/MDS/mds/stats @ 1226456999.791520 Name Cur.Count Cur.Rate #Events Unit last min avg max stddev req_waittime 10664 1066 1266340431[usec] 136698 3 80.54 2068859 682.85 req_qdepth 10664 1066 1266340431[reqs] 2162 0 1.40 317 2.18 req_active 10664 1066 1266340431[reqs] 14923 1 10.12 127 8.65 req_timeout 10664 1066 1266340431[sec] 10664 1 3.26 169 7.82 reqbuf_avail 23857 2385 2748636467[bufs] 6102353 157 249.87 256 6.44 ldlm_ibits_enqueue 10647 1064 1245279702[reqs] 10647 1 1.00 1 0.00