thr3ads.net - Lustre discuss - [Lustre-discuss] Lustre and system cpu usage [Mar 2011]

If this information is useful, please help other people find it:
Share via:

Claudio Baeza Retamal

2011-Mar-15 00:57 UTC

[Lustre-discuss] Lustre and system cpu usage

Dears folks,

last month, I have configured lustre 1.8.5 over infiniband, before, I 
was using  Gluster 3.1.2, performance was ok but reliability was wrong, 
when 40 or more applications requested at the same time  for open a 
file, gluster servers  bounced randomly the active connections from 
clients. Lustre has not this problem, but I can see others issues, for 
example, namd appears with "system cpu"  around of 30%,  hpl benchmark
appears between  70%-80% of "system cpu", is too much high, with 
gluster, the system cpu was never exceeded 5%. I think, this is 
explained due gluster uses fuse and run in user space, but I am do not 
sure. I have some doubt:

?why Lustre uses ipoib? Before, with gluster  I do not use ipoib, I am 
thinking  that ipoib module produces bad performance in infiniband and 
disturbs the infiniband native module.
It is posible to configure lustre to  transport metada over ethernet and 
data over infiniband?

For namd and hpl benchmark, is  it normal to have system cpu to be so high?

My configuration is the following:

- Qlogic 12800-180 switch, 7 leaf (24 ports per  leaf) and 2 spines (All 
ports have QDR, 40 Gbps)
- 66 HCA mellanox connectX, two ports, QDR 40 Gbps (compute nodes)
- 1 metadata server, 96 GB RAM DDR3 optimized for performance, two Xeon 
5570, SAS 15K RPM  hard disk in Raid 1, HCA mellanox connectX with two ports
- 4 OSS with 1 OST of 2 TB in RAID 5 each one (8 TB in total). The all 
OSS have a Mellanox ConnectX with two ports

I will appreciate any help or tips.

Regards

claudio


-- 
Claudio Baeza Retamal
CTO
National Laboratory for High Performance Computing (NLHPC)
Center for Mathematical Modeling (CMM)
School of Engineering and Sciences
Universidad de Chile

Andreas Dilger

2011-Mar-15 02:05 UTC

head link

[Lustre-discuss] Lustre and system cpu usage

On 2011-01-14, at 3:57 PM, Claudio Baeza Retamal wrote:> last month, I have configured lustre 1.8.5 over infiniband, before, I 
> was using  Gluster 3.1.2, performance was ok but reliability was wrong, 
> when 40 or more applications requested at the same time  for open a 
> file, gluster servers  bounced randomly the active connections from 
> clients. Lustre has not this problem, but I can see others issues, for 
> example, namd appears with "system cpu"  around of 30%,  hpl
benchmark
> appears between  70%-80% of "system cpu", is too much high, with 
> gluster, the system cpu was never exceeded 5%. I think, this is 
> explained due gluster uses fuse and run in user space, but I am do not 
> sure.
If Gluster is using FUSE, then all of the CPU usage would appear in
"user" and not "system".  That doesn''t mean that
the CPU usage is gone, just accounted in a different place.

> I have some doubt:
> 
> ?why Lustre uses ipoib? Before, with gluster  I do not use ipoib, I am 
> thinking  that ipoib module produces bad performance in infiniband and 
> disturbs the infiniband native module.
If you are using IPoIB for data then your LNET is configured incorrectly.  IPoIB
is only needed for IB hostname resolution, and all LNET traffic can use native
IB with very low CPU overhead.  Your /etc/modprobe.conf and mount lines should
be using {addr}@o2ib0 instead of {addr} or {addr}@tcp0.
> It is posible to configure lustre to  transport metada over ethernet and 
> data over infiniband?
Yes, this should be possible, but putting the metadata on IB is much lower
latency and higher performance so you should really try to use IB for both.
> For namd and hpl benchmark, is  it normal to have system cpu to be so high?
> 
> My configuration is the following:
> 
> - Qlogic 12800-180 switch, 7 leaf (24 ports per  leaf) and 2 spines (All 
> ports have QDR, 40 Gbps)
> - 66 HCA mellanox connectX, two ports, QDR 40 Gbps (compute nodes)
> - 1 metadata server, 96 GB RAM DDR3 optimized for performance, two Xeon 
> 5570, SAS 15K RPM  hard disk in Raid 1, HCA mellanox connectX with two
ports
> - 4 OSS with 1 OST of 2 TB in RAID 5 each one (8 TB in total). The all 
> OSS have a Mellanox ConnectX with two ports
If you have IB on the MDS then you should definitely use {addr}@o2ib0 for both
OSS and MDS nodes.  That will give you much better metadata performance.

Cheers, Andreas
--
Andreas Dilger 
Principal Engineer
Whamcloud, Inc.

Peter Kjellström

2011-Mar-15 08:23 UTC

head link

[Lustre-discuss] Lustre and system cpu usage

On Saturday, January 15, 2011 12:57:18 am Claudio Baeza Retamal
wrote:> Dears folks,
> 
> last month, I have configured lustre 1.8.5 over infiniband, before, I
> was using  Gluster 3.1.2, performance was ok but reliability was wrong,
> when 40 or more applications requested at the same time  for open a
> file, gluster servers  bounced randomly the active connections from
> clients. Lustre has not this problem, but I can see others issues, for
> example, namd appears with "system cpu"  around of 30%,  hpl
benchmark
> appears between  70%-80% of "system cpu",
This is certainly an indication of something being very wrong (not necessarily 
lustre). hpl does very minimal I/O and only a completely broken filesystem 
would cause 70-80% "system cpu".

/Peter
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 198 bytes
Desc: This is a digitally signed message part.
Url :
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20110315/3c35cf51/attachment.bin

Claudio Baeza Retamal

2011-Mar-15 12:19 UTC

head link

[Lustre-discuss] Lustre and system cpu usage

Hi,


El 14-03-2011 22:05, Andreas Dilger escribi?:> On 2011-01-14, at 3:57 PM, Claudio Baeza Retamal wrote:
>> last month, I have configured lustre 1.8.5 over infiniband, before, I
>> was using  Gluster 3.1.2, performance was ok but reliability was wrong,
>> when 40 or more applications requested at the same time  for open a
>> file, gluster servers  bounced randomly the active connections from
>> clients. Lustre has not this problem, but I can see others issues, for
>> example, namd appears with "system cpu"  around of 30%,  hpl
benchmark
>> appears between  70%-80% of "system cpu", is too much high,
with
>> gluster, the system cpu was never exceeded 5%. I think, this is
>> explained due gluster uses fuse and run in user space, but I am do not
>> sure.
> If Gluster is using FUSE, then all of the CPU usage would appear in
"user" and not "system".  That doesn''t mean that
the CPU usage is gone, just accounted in a different place.
>
>
>> I have some doubt:
>>
>> ?why Lustre uses ipoib? Before, with gluster  I do not use ipoib, I am
>> thinking  that ipoib module produces bad performance in infiniband and
>> disturbs the infiniband native module.
> If you are using IPoIB for data then your LNET is configured incorrectly. 
IPoIB is only needed for IB hostname resolution, and all LNET traffic can use
native IB with very low CPU overhead.  Your /etc/modprobe.conf and mount lines
should be using {addr}@o2ib0 instead of {addr} or {addr}@tcp0.
>
For first two weeks, I was using "options lnet
networks="o2ib(ib0)",
now, I am using "options lnet networks="o2ib(ib0),tcp0(eth0)"
because I
have a node without HCA card, in both case, the system cpu usage is the 
same, the compute node without infiniband is used to run matlab only.

In the hpl benchmark case, my doubt is, why has a high system cpu 
usage?   Is posible that LustreFS disturbs  mlx4 infiniband driver and 
causes problems with  MPI?  hpl benchmark mainly does I/O for transport 
data over MPI, with glusterFS system cpu was around 5%, instead, since  
Lustre  was configured system cpu is 70%-80% and we use  o2ib(ib0) for 
LNET in modprobe.conf .
I have tried several options, following instruction from mellanox, in 
compute nodes I disable irqbalance and run smp_affinity script, but 
system cpu still so higher.
Are there any tools to study lustre performance?
>> It is posible to configure lustre to  transport metada over ethernet
and
>> data over infiniband?
> Yes, this should be possible, but putting the metadata on IB is much lower
latency and higher performance so you should really try to use IB for both.
>
>> For namd and hpl benchmark, is  it normal to have system cpu to be so
high?
>>
>> My configuration is the following:
>>
>> - Qlogic 12800-180 switch, 7 leaf (24 ports per  leaf) and 2 spines
(All
>> ports have QDR, 40 Gbps)
>> - 66 HCA mellanox connectX, two ports, QDR 40 Gbps (compute nodes)
>> - 1 metadata server, 96 GB RAM DDR3 optimized for performance, two Xeon
>> 5570, SAS 15K RPM  hard disk in Raid 1, HCA mellanox connectX with two
ports
>> - 4 OSS with 1 OST of 2 TB in RAID 5 each one (8 TB in total). The all
>> OSS have a Mellanox ConnectX with two ports
> If you have IB on the MDS then you should definitely use {addr}@o2ib0 for
both OSS and MDS nodes.  That will give you much better metadata performance.
>
> Cheers, Andreas
> --
> Andreas Dilger
> Principal Engineer
> Whamcloud, Inc.
>
>
>
>
>
regards

claudio

Andreas Dilger

2011-Mar-15 21:50 UTC

head link

[Lustre-discuss] Lustre and system cpu usage

On 2011-01-15, at 4:18 AM, Claudio Baeza Retamal wrote:> El 14-03-2011 22:05, Andreas Dilger escribi?:
>> On 2011-01-14, at 3:57 PM, Claudio Baeza Retamal wrote:
>>> last month, I have configured lustre 1.8.5 over infiniband, before,
I
>>> was using  Gluster 3.1.2, performance was ok but reliability was
wrong,
>>> when 40 or more applications requested at the same time  for open a
>>> file, gluster servers  bounced randomly the active connections from
>>> clients. Lustre has not this problem, but I can see others issues,
for
>>> example, namd appears with "system cpu"  around of 30%, 
hpl benchmark
>>> appears between  70%-80% of "system cpu", is too much
high, with
>>> gluster, the system cpu was never exceeded 5%. I think, this is
>>> explained due gluster uses fuse and run in user space, but I am do
not
>>> sure.
>>> I have some doubt:
>>> 
>>> ?why Lustre uses ipoib? Before, with gluster  I do not use ipoib, I
am
>>> thinking  that ipoib module produces bad performance in infiniband
and
>>> disturbs the infiniband native module.
>> 
>> If you are using IPoIB for data then your LNET is configured
incorrectly.  IPoIB is only needed for IB hostname resolution, and all LNET
traffic can use native IB with very low CPU overhead.  Your /etc/modprobe.conf
and mount lines should be using {addr}@o2ib0 instead of {addr} or {addr}@tcp0.
> 
> For first two weeks, I was using "options lnet
networks="o2ib(ib0)", now, I am using "options lnet
networks="o2ib(ib0),tcp0(eth0)" because I have a node without HCA
card, in both case, the system cpu usage is the same, the compute node without
infiniband is used to run matlab only.
> 
> In the hpl benchmark case, my doubt is, why has a high system cpu usage?  
Is posible that LustreFS disturbs  mlx4 infiniband driver and causes problems
with  MPI?  hpl benchmark mainly does I/O for transport data over MPI, with
glusterFS system cpu was around 5%, instead, since  Lustre  was configured
system cpu is 70%-80% and we use  o2ib(ib0) for LNET in modprobe.conf.
Have you tried disabling the Lustre kernel debug logs (lctl set_param debug=0)
and/or disabling the network data checksums (lctl set_param osc.*.checksums=0)?

Note that there is also CPU overhead in the kernel from copying data from
userspace to the kernel that is unavoidable for any filesystem, unless O_DIRECT
is used (which causes synchronous IO and has IO alignment restrictions).
> I have tried several options, following instruction from mellanox, in
compute nodes I disable irqbalance and run smp_affinity script, but system cpu
still so higher.
> Are there any tools to study lustre performance?
> 
>>> It is posible to configure lustre to  transport metada over
ethernet and
>>> data over infiniband?
>> Yes, this should be possible, but putting the metadata on IB is much
lower latency and higher performance so you should really try to use IB for
both.
>> 
>>> For namd and hpl benchmark, is  it normal to have system cpu to be
so high?
>>> 
>>> My configuration is the following:
>>> 
>>> - Qlogic 12800-180 switch, 7 leaf (24 ports per  leaf) and 2 spines
(All
>>> ports have QDR, 40 Gbps)
>>> - 66 HCA mellanox connectX, two ports, QDR 40 Gbps (compute nodes)
>>> - 1 metadata server, 96 GB RAM DDR3 optimized for performance, two
Xeon
>>> 5570, SAS 15K RPM  hard disk in Raid 1, HCA mellanox connectX with
two ports
>>> - 4 OSS with 1 OST of 2 TB in RAID 5 each one (8 TB in total). The
all
>>> OSS have a Mellanox ConnectX with two ports
>> If you have IB on the MDS then you should definitely use {addr}@o2ib0
for both OSS and MDS nodes.  That will give you much better metadata
performance.
>> 
>> Cheers, Andreas
>> --
>> Andreas Dilger
>> Principal Engineer
>> Whamcloud, Inc.
>> 
>> 
>> 
>> 
>> 
> 
> regards
> 
> claudio
> 
> 

Cheers, Andreas
--
Andreas Dilger 
Principal Engineer
Whamcloud, Inc.

Wojciech Turek

2011-Mar-15 22:07 UTC

head link

[Lustre-discuss] Lustre and system cpu usage

Hi Claudio,

When you say that during Linpack you see high system cpu usage, do you mean
the cpu usage on the clients or the servers?
Can you run for example top command and see which processes take the most of
the CPU time?

Cheers

Wojciech

On 15 January 2011 11:18, Claudio Baeza Retamal <claudio at
dim.uchile.cl>wrote:
> Hi,
>
>
> El 14-03-2011 22:05, Andreas Dilger escribi?:
> > On 2011-01-14, at 3:57 PM, Claudio Baeza Retamal wrote:
> >> last month, I have configured lustre 1.8.5 over infiniband,
before, I
> >> was using  Gluster 3.1.2, performance was ok but reliability was
wrong,
> >> when 40 or more applications requested at the same time  for open
a
> >> file, gluster servers  bounced randomly the active connections
from
> >> clients. Lustre has not this problem, but I can see others issues,
for
> >> example, namd appears with "system cpu"  around of 30%, 
hpl benchmark
> >> appears between  70%-80% of "system cpu", is too much
high, with
> >> gluster, the system cpu was never exceeded 5%. I think, this is
> >> explained due gluster uses fuse and run in user space, but I am do
not
> >> sure.
> > If Gluster is using FUSE, then all of the CPU usage would appear in
> "user" and not "system".  That doesn''t mean
that the CPU usage is gone, just
> accounted in a different place.
> >
> >
> >> I have some doubt:
> >>
> >> ?why Lustre uses ipoib? Before, with gluster  I do not use ipoib,
I am
> >> thinking  that ipoib module produces bad performance in infiniband
and
> >> disturbs the infiniband native module.
> > If you are using IPoIB for data then your LNET is configured
incorrectly.
>  IPoIB is only needed for IB hostname resolution, and all LNET traffic can
> use native IB with very low CPU overhead.  Your /etc/modprobe.conf and
mount
> lines should be using {addr}@o2ib0 instead of {addr} or {addr}@tcp0.
> >
>
> For first two weeks, I was using "options lnet
networks="o2ib(ib0)",
> now, I am using "options lnet
networks="o2ib(ib0),tcp0(eth0)" because I
> have a node without HCA card, in both case, the system cpu usage is the
> same, the compute node without infiniband is used to run matlab only.
>
> In the hpl benchmark case, my doubt is, why has a high system cpu
> usage?   Is posible that LustreFS disturbs  mlx4 infiniband driver and
> causes problems with  MPI?  hpl benchmark mainly does I/O for transport
> data over MPI, with glusterFS system cpu was around 5%, instead, since
> Lustre  was configured system cpu is 70%-80% and we use  o2ib(ib0) for
> LNET in modprobe.conf .
> I have tried several options, following instruction from mellanox, in
> compute nodes I disable irqbalance and run smp_affinity script, but
> system cpu still so higher.
> Are there any tools to study lustre performance?
>
> >> It is posible to configure lustre to  transport metada over
ethernet and
> >> data over infiniband?
> > Yes, this should be possible, but putting the metadata on IB is much
> lower latency and higher performance so you should really try to use IB for
> both.
> >
> >> For namd and hpl benchmark, is  it normal to have system cpu to be
so
> high?
> >>
> >> My configuration is the following:
> >>
> >> - Qlogic 12800-180 switch, 7 leaf (24 ports per  leaf) and 2
spines (All
> >> ports have QDR, 40 Gbps)
> >> - 66 HCA mellanox connectX, two ports, QDR 40 Gbps (compute nodes)
> >> - 1 metadata server, 96 GB RAM DDR3 optimized for performance, two
Xeon
> >> 5570, SAS 15K RPM  hard disk in Raid 1, HCA mellanox connectX with
two
> ports
> >> - 4 OSS with 1 OST of 2 TB in RAID 5 each one (8 TB in total). The
all
> >> OSS have a Mellanox ConnectX with two ports
> > If you have IB on the MDS then you should definitely use
{addr}@o2ib0for both OSS and MDS nodes.  That will give you much better metadata
> performance.
> >
> > Cheers, Andreas
> > --
> > Andreas Dilger
> > Principal Engineer
> > Whamcloud, Inc.
> >
> >
> >
> >
> >
>
> regards
>
> claudio
>
>
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20110315/46268d65/attachment.html

Lustre discuss - Mar 2011 - Lustre and system cpu usage

[Lustre-discuss] Lustre and system cpu usage

[Lustre-discuss] Lustre and system cpu usage

[Lustre-discuss] Lustre and system cpu usage

[Lustre-discuss] Lustre and system cpu usage

[Lustre-discuss] Lustre and system cpu usage

[Lustre-discuss] Lustre and system cpu usage