Lex
2010-Feb-01 09:44 UTC
[Lustre-discuss] High difference in I/O network traffic in lustre client
Hi guys In effort to improve our storage system performance, i found some strange signs but unfortunately, couldn''t explain it by myself. So i post here for all you guys can''t help me to clarify it I''m using lustre client as web server for downloading file. When our system in a heavy load ( about 12.000 concurrent connection for 8 web server - lustre client ), %iowait has been pushed to about 98%, load average was about 1-2000 !!!! ( just because of %iowait, i still could manipulate normally almost every command over ssh ) i think it''s a terrible number in describing load average ! But, at that case, the in and out network traffic*are almost the same * ( although just about few MB/s :( ) The odd thing is, right now, when we only have about 3.500 concurrent connection, load average is about 50 ( still too big, right ? ), iowait is about 70%, the difference between receive and transmit network is too hight, about 10-20MB ( see attached file, please ) We just have about 20 connection for our local lustre storage system: *netstat -nat | grep 192.168.1.75 tcp 0 560 192.168.1.75:1023 192.168.1.85:988 ESTABLISHED tcp 0 0 192.168.1.75:1023 192.168.1.81:988 ESTABLISHED tcp 0 0 192.168.1.75:988 192.168.1.85:1023 ESTABLISHED tcp 0 0 192.168.1.75:988 192.168.1.85:1022 ESTABLISHED tcp 0 0 192.168.1.75:988 192.168.1.81:1023 ESTABLISHED tcp 0 0 192.168.1.75:988 192.168.1.81:1022 ESTABLISHED tcp 0 0 192.168.1.75:988 192.168.1.100:1023 ESTABLISHED tcp 0 0 192.168.1.75:1021 192.168.1.78:988 ESTABLISHED tcp 0 0 192.168.1.75:1023 192.168.1.78:988 ESTABLISHED tcp 0 0 192.168.1.75:1022 192.168.1.78:988 ESTABLISHED tcp 0 560 192.168.1.75:1023 192.168.1.100:988 ESTABLISHED* and about 400 connection with client from internet : *netstat -nat | grep out_wan_ip | grep EST | wc -l 407* We''re currently using 2 Gigabit Ethernet card, one for 192.168.1.0/24network for lnet and the other as wan ip for delivering file out to internet and *about 15MB/s thoughput was "lost" somehow* !!!! So, my question is: - Is there anyone have idea or hint about high load situation with our lustre client - web server like i described above ? I followed this link <http://rackerhacker.com/2008/03/11/hunting-down-elusive-sources-of-iowait/>and found out *kjournald *process is the main main "culprit" ( with our ost, it was "*ll*" process ) - What makes the too high difference between receive and transit direction in our lustre client - web server ? i''m really stressed with poor performance in our storage system and hope anyone here can help me point out some thing Any help would be highly appreciated Best regards * * -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20100201/1aa231ee/attachment-0001.html -------------- next part -------------- A non-text attachment was scrubbed... Name: iotraf.jpg Type: image/jpeg Size: 66692 bytes Desc: not available Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20100201/1aa231ee/attachment-0001.jpg
Mag Gam
2010-Feb-01 13:05 UTC
[Lustre-discuss] High difference in I/O network traffic in lustre client
How many OSS and OSTs do you have ? What type of hardware are they running on? What type of network connection? The file you are trying to access what OSS is it on? Are the files striped? What On Mon, Feb 1, 2010 at 4:44 AM, Lex <lexluthor87 at gmail.com> wrote:> Hi guys > > In effort to improve our storage system performance, i found some strange > signs but unfortunately, couldn''t explain it by myself. So i post here for > all you guys can''t help me to clarify it > > I''m using lustre client as web server for downloading file. When our system > in a heavy load ( about 12.000 concurrent connection for 8 web server - > lustre client ), %iowait has been pushed to about 98%, load average was > about 1-2000 !!!! ( just because of %iowait, i still could manipulate > normally almost every command over ssh ) i think it''s a terrible number in > describing load average ! But, at that case, the in and out network traffic > are almost the same ( although just about few MB/s :( ) > > The odd thing is, right now, when we only have about 3.500 concurrent > connection, load average is about 50 ( still too big, right ? ), iowait is > about 70%, the difference between receive and transmit network is too hight, > about 10-20MB ( see attached file, please ) > > We just have about 20 connection for our local lustre storage system: > > netstat -nat | grep 192.168.1.75 > tcp??????? 0??? 560 192.168.1.75:1023?????????? 192.168.1.85:988 > ESTABLISHED > tcp??????? 0????? 0 192.168.1.75:1023?????????? 192.168.1.81:988 > ESTABLISHED > tcp??????? 0????? 0 192.168.1.75:988??????????? 192.168.1.85:1023 > ESTABLISHED > tcp??????? 0????? 0 192.168.1.75:988??????????? 192.168.1.85:1022 > ESTABLISHED > tcp??????? 0????? 0 192.168.1.75:988??????????? 192.168.1.81:1023 > ESTABLISHED > tcp??????? 0????? 0 192.168.1.75:988??????????? 192.168.1.81:1022 > ESTABLISHED > tcp??????? 0????? 0 192.168.1.75:988??????????? 192.168.1.100:1023 > ESTABLISHED > tcp??????? 0????? 0 192.168.1.75:1021?????????? 192.168.1.78:988 > ESTABLISHED > tcp??????? 0????? 0 192.168.1.75:1023?????????? 192.168.1.78:988 > ESTABLISHED > tcp??????? 0????? 0 192.168.1.75:1022?????????? 192.168.1.78:988 > ESTABLISHED > tcp??????? 0??? 560 192.168.1.75:1023?????????? 192.168.1.100:988 > ESTABLISHED > > and about 400 connection with client from internet : > > netstat -nat | grep out_wan_ip | grep EST | wc -l > 407 > > We''re currently using 2 Gigabit Ethernet card, one for 192.168.1.0/24 > network for lnet and the other as wan ip for delivering file out to internet > and about 15MB/s thoughput was "lost" somehow !!!! > > So, my question is: > > - Is there anyone have idea or hint about high load situation with our > lustre client - web server like i described above ?? I followed this link > and found out? kjournald process is the main main "culprit" ( with our ost, > it was "ll" process ) > - What makes the too high difference between receive and transit direction > in our lustre client - web server ? > > > i''m really stressed with poor performance in our storage system and hope > anyone here can help me point out some thing > > Any help would be highly appreciated > > Best regards > > > > > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss > >