thr3ads.net - Lustre discuss - [Lustre-discuss] New lustre 1.8.5 over IB problem [Dec 2010]

If this information is useful, please help other people find it:
Share via:

Gary Molenkamp

2010-Dec-13 18:54 UTC

[Lustre-discuss] New lustre 1.8.5 over IB problem

I''m attempting to deploy a new lustre filesystem using lustre 1.8.5,
but
this is my first stab at incorporating an IB network.  I''ve deployed
several over tcp using 1.8.4 without issue, so I''m not sure if there is
an IB configuration or a 1.8.5 issue here. Any assistance would be
appreciated.

This new cluster has two parallel networks:
   gige:  10.27.5.0/23
   ib  :  10.27.8.0/23

On the lfs servers and clients, lnet is configured as:
   options lnet networks=o2ib0(ib0),tcp0(ib0)
The IB network is routable to 10/8 and clients mount other lustre
filesystems using 1.8.4 over tcp.

On the MDS (with an intended failover to a secondary) the mgs,mdt
filesystem is created with:

 mkfs.lustre --fsname lfs --mdt --mgs \
	--mkfsoptions=''-i 1024 -I 512'' \
	--failnode=10.27.9.133 at o2ib0 --failnode=10.27.9.132 at o2ib0  \
	--mountfsoptions=iopen_nopriv,user_xattr,errors=remount-ro,acl \
	/dev/sda

However, this mount then fails with:

mount.lustre: mount /dev/sda at /data/mds failed: Cannot assign
requested address

An lctl shows the proper nids:
 10.27.9.133 at o2ib
 10.27.9.133 at tcp

Dmesg shows a parsing error with the o2ib0 nid:

LustreError: 159-d: Can''t parse NID ''failover.node=10.27.9.133
at o2ib0''
Lustre: Denying initial registration attempt from nid 10.27.9.133 at o2ib,
specified as failover
LustreError: 9571:0:(obd_mount.c:1097:server_start_targets()) Required
registration failed for lfs-MDT0000: -99

Am I specifying the failover incorrectly?  What should it be when using
o2ib as the primary interconnect.  If I remove the failover parameters
using tunefs.lustre the mount succeeds,  but clients cannot connect to
the mdt.


-- 
Gary Molenkamp			SHARCNET
Systems Administrator		University of Western Ontario
Compute/Calcul Canada		http://www.computecanada.org
gary at sharcnet.ca		http://www.sharcnet.ca
(519) 661-2111 x88429		(519) 661-4000

Colin Faber

2010-Dec-13 19:07 UTC

head link

[Lustre-discuss] New lustre 1.8.5 over IB problem

On 12/13/2010 11:54 AM, Gary Molenkamp wrote:> I''m attempting to deploy a new lustre filesystem using lustre
1.8.5, but
> this is my first stab at incorporating an IB network.  I''ve
deployed
> several over tcp using 1.8.4 without issue, so I''m not sure if
there is
> an IB configuration or a 1.8.5 issue here. Any assistance would be
> appreciated.
>
> This new cluster has two parallel networks:
>     gige:  10.27.5.0/23
>     ib  :  10.27.8.0/23
>
> On the lfs servers and clients, lnet is configured as:
>     options lnet networks=o2ib0(ib0),tcp0(ib0)                                                                      ^^^^^
Why are you assigning two different network types to the same physical 
device?> The IB network is routable to 10/8 and clients mount other lustre
> filesystems using 1.8.4 over tcp.
>
> On the MDS (with an intended failover to a secondary) the mgs,mdt
> filesystem is created with:
>
>   mkfs.lustre --fsname lfs --mdt --mgs \
> 	--mkfsoptions=''-i 1024 -I 512'' \
> 	--failnode=10.27.9.133 at o2ib0 --failnode=10.27.9.132 at o2ib0  \
> 	--mountfsoptions=iopen_nopriv,user_xattr,errors=remount-ro,acl \
> 	/dev/sda
>
> However, this mount then fails with:
>
> mount.lustre: mount /dev/sda at /data/mds failed: Cannot assign
> requested address
>
> An lctl shows the proper nids:
>   10.27.9.133 at o2ib
>   10.27.9.133 at tcp
>
> Dmesg shows a parsing error with the o2ib0 nid:
>
> LustreError: 159-d: Can''t parse NID
''failover.node=10.27.9.133 at o2ib0''
> Lustre: Denying initial registration attempt from nid 10.27.9.133 at o2ib,
> specified as failover
> LustreError: 9571:0:(obd_mount.c:1097:server_start_targets()) Required
> registration failed for lfs-MDT0000: -99
>
> Am I specifying the failover incorrectly?  What should it be when using
> o2ib as the primary interconnect.  If I remove the failover parameters
> using tunefs.lustre the mount succeeds,  but clients cannot connect to
> the mdt.
>
>

Gary Molenkamp

2010-Dec-13 19:44 UTC

head link

[Lustre-discuss] New lustre 1.8.5 over IB problem

Colin Faber wrote:> 
> 
> On 12/13/2010 11:54 AM, Gary Molenkamp wrote:
>> I''m attempting to deploy a new lustre filesystem using lustre
1.8.5, but
>> this is my first stab at incorporating an IB network.  I''ve
deployed
>> several over tcp using 1.8.4 without issue, so I''m not sure if
there is
>> an IB configuration or a 1.8.5 issue here. Any assistance would be
>> appreciated.
>>
>> This new cluster has two parallel networks:
>>     gige:  10.27.5.0/23
>>     ib  :  10.27.8.0/23
>>
>> On the lfs servers and clients, lnet is configured as:
>>     options lnet networks=o2ib0(ib0),tcp0(ib0)
>                                                                      ^^^^^
> Why are you assigning two different network types to the same physical
> device?
My assumption was that this indicated to lnet when IPoIB was to be used
  vs native IB, but by your question, I assume that is not the case. :)

I retested with just
   options lnet networks=o2ib0(ib0)
And the resulting error conditions below still hold true.

>> The IB network is routable to 10/8 and clients mount other lustre
>> filesystems using 1.8.4 over tcp.
>>
>> On the MDS (with an intended failover to a secondary) the mgs,mdt
>> filesystem is created with:
>>
>>   mkfs.lustre --fsname lfs --mdt --mgs \
>>     --mkfsoptions=''-i 1024 -I 512'' \
>>     --failnode=10.27.9.133 at o2ib0 --failnode=10.27.9.132 at o2ib0  \
>>     --mountfsoptions=iopen_nopriv,user_xattr,errors=remount-ro,acl \
>>     /dev/sda
>>
>> However, this mount then fails with:
>>
>> mount.lustre: mount /dev/sda at /data/mds failed: Cannot assign
>> requested address
>>
>> An lctl shows the proper nids:
>>   10.27.9.133 at o2ib
>>   10.27.9.133 at tcp
>>
>> Dmesg shows a parsing error with the o2ib0 nid:
>>
>> LustreError: 159-d: Can''t parse NID
''failover.node=10.27.9.133 at o2ib0''
>> Lustre: Denying initial registration attempt from nid 10.27.9.133 at
o2ib,
>> specified as failover
>> LustreError: 9571:0:(obd_mount.c:1097:server_start_targets()) Required
>> registration failed for lfs-MDT0000: -99
>>
>> Am I specifying the failover incorrectly?  What should it be when using
>> o2ib as the primary interconnect.  If I remove the failover parameters
>> using tunefs.lustre the mount succeeds,  but clients cannot connect to
>> the mdt.
>>
>>

-- 
Gary Molenkamp			SHARCNET
Systems Administrator		University of Western Ontario
Compute/Calcul Canada		http://www.computecanada.org
gary at sharcnet.ca		http://www.sharcnet.ca
(519) 661-2111 x88429		(519) 661-4000

Malcolm Cowe

2010-Dec-13 21:17 UTC

head link

[Lustre-discuss] New lustre 1.8.5 over IB problem

On 14/12/2010 05:54, Gary Molenkamp wrote:> On the MDS (with an intended failover to a secondary) the mgs,mdt
> filesystem is created with:
>
>   mkfs.lustre --fsname lfs --mdt --mgs \
> 	--mkfsoptions=''-i 1024 -I 512'' \
> 	--failnode=10.27.9.133 at o2ib0 --failnode=10.27.9.132 at o2ib0  \
> 	--mountfsoptions=iopen_nopriv,user_xattr,errors=remount-ro,acl \
> 	/dev/sda
>
> However, this mount then fails with:
>
> mount.lustre: mount /dev/sda at /data/mds failed: Cannot assign
> requested addressShouldn''t there only be one "--failnode" flag? IIRC, failnode
should
only reference the secondary / standby server, not the primary (i.e. the 
node where the mkfs command is being executed).

Malcolm.

Kaizaad Bilimorya

2010-Dec-14 15:32 UTC

head link

[Lustre-discuss] New lustre 1.8.5 over IB problem

On Mon, 13 Dec 2010, Colin Faber wrote:
> On 12/13/2010 11:54 AM, Gary Molenkamp wrote:
>> I''m attempting to deploy a new lustre filesystem using lustre
1.8.5, but
>> this is my first stab at incorporating an IB network.  I''ve
deployed
>> several over tcp using 1.8.4 without issue, so I''m not sure if
there is
>> an IB configuration or a 1.8.5 issue here. Any assistance would be
>> appreciated.
>>
>> This new cluster has two parallel networks:
>>     gige:  10.27.5.0/23
>>     ib  :  10.27.8.0/23
>>
>> On the lfs servers and clients, lnet is configured as:
>>     options lnet networks=o2ib0(ib0),tcp0(ib0)
>                                                                      ^^^^^
> Why are you assigning two different network types to the same physical
> device?
Hello Colin,

Thanks for the reply. In answer to your question:

The same physical device has access to two different lustre filesystems 
using different protocols.

One lustre filesystem is locally available via the native ib interface 
o2ib0(ib0).

The other lustre filesystem is remotely available (via a IB to 10Gb 
switch/gateway in the local IB fabric) on the same local IB device but 
only via the tcp/ip (IPoIB) protocol, tcp0(ib0).

(not sure how good this ASCII diagram will look)

 				     ---------------------
     |-------------|	|------------| local lustre setup|
ib0 |		-----------	     ---------------------
--------	|ib fabric|
|client|	-----------
--------	 |
 		--------------
 		|ib to 10Gb gw|
 		--------------
 		 |		   eth0	--------------------
 		 |---------------------| remote lustre setup|
 					--------------------

Is this possible?

-k
>> The IB network is routable to 10/8 and clients mount other lustre
>> filesystems using 1.8.4 over tcp.
>>
>> On the MDS (with an intended failover to a secondary) the mgs,mdt
>> filesystem is created with:
>>
>>   mkfs.lustre --fsname lfs --mdt --mgs \
>> 	--mkfsoptions=''-i 1024 -I 512'' \
>> 	--failnode=10.27.9.133 at o2ib0 --failnode=10.27.9.132 at o2ib0  \
>> 	--mountfsoptions=iopen_nopriv,user_xattr,errors=remount-ro,acl \
>> 	/dev/sda
>>
>> However, this mount then fails with:
>>
>> mount.lustre: mount /dev/sda at /data/mds failed: Cannot assign
>> requested address
>>
>> An lctl shows the proper nids:
>>   10.27.9.133 at o2ib
>>   10.27.9.133 at tcp
>>
>> Dmesg shows a parsing error with the o2ib0 nid:
>>
>> LustreError: 159-d: Can''t parse NID
''failover.node=10.27.9.133 at o2ib0''
>> Lustre: Denying initial registration attempt from nid 10.27.9.133 at
o2ib,
>> specified as failover
>> LustreError: 9571:0:(obd_mount.c:1097:server_start_targets()) Required
>> registration failed for lfs-MDT0000: -99
>>
>> Am I specifying the failover incorrectly?  What should it be when using
>> o2ib as the primary interconnect.  If I remove the failover parameters
>> using tunefs.lustre the mount succeeds,  but clients cannot connect to
>> the mdt.
>>
>>
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>

Gary Molenkamp

2010-Dec-14 15:57 UTC

head link

[Lustre-discuss] New lustre 1.8.5 over IB problem

Thanx Malcolm (and Colin) for the pointers.

I have the local filesystem up and running over the o2ib(ib0) network,
using a single failnode and only the o2ib(ib0) network.

As Kaizaad mentioned though, we are in a situation where we need access
to tcp over the ib device or abandon o2ib and run the local lfs as ipoib
as well.



Malcolm Cowe wrote:> On 14/12/2010 05:54, Gary Molenkamp wrote:
>> On the MDS (with an intended failover to a secondary) the mgs,mdt
>> filesystem is created with:
>>
>>   mkfs.lustre --fsname lfs --mdt --mgs \
>> 	--mkfsoptions=''-i 1024 -I 512'' \
>> 	--failnode=10.27.9.133 at o2ib0 --failnode=10.27.9.132 at o2ib0  \
>> 	--mountfsoptions=iopen_nopriv,user_xattr,errors=remount-ro,acl \
>> 	/dev/sda
>>
>> However, this mount then fails with:
>>
>> mount.lustre: mount /dev/sda at /data/mds failed: Cannot assign
>> requested address
> Shouldn''t there only be one "--failnode" flag? IIRC,
failnode should
> only reference the secondary / standby server, not the primary (i.e. the 
> node where the mkfs command is being executed).
> 
> Malcolm.
> 
> 
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss

-- 
Gary Molenkamp			SHARCNET
Systems Administrator		University of Western Ontario
Compute/Calcul Canada		http://www.computecanada.org
gary at sharcnet.ca		http://www.sharcnet.ca
(519) 661-2111 x88429		(519) 661-4000

Heald, Nathan T.

2010-Dec-14 17:20 UTC

head link

[Lustre-discuss] Errors in output from sgpdd-survey (sgp_dd.c Cannot allocate memory)

Hi everyone,
I have been running sgpdd-survey on some DDN 9550''s and am getting some
errors. I''m using what I believe to be the latest version of the I/O
Kit
(lustre-iokit-1.2-200709210921). I''ve got 4 OSSes attached and run
sgpdd-survey against all the disk from each host one at a time. Each host is
getting these errors, but not identically. I''ve found several threads
on the
mailing list with people reporting this same error but there are no
resolutions posted. One post suggested a modification to the flags for
"sg_readcap" in the script could resolve these errors, but making the
changes did not seem to fix the issue. It looks like sgp_dd is having
intermittent problems:

16384+0 records out
sg starting in command at "sgp_dd.c":827: Cannot allocate memory
sg starting in command at "sgp_dd.c":827: Cannot allocate memory
sg starting in command at "sgp_dd.c":827: Cannot allocate memory
sg starting in command at "sgp_dd.c":827: Cannot allocate memory
sg starting in command at "sgp_dd.c":827: Cannot allocate memory
sg starting in command at "sgp_dd.c":827: Cannot allocate memory


Output from sgpdd-survey:

Wed Dec  1 10:55:55 EST 2010 sgpdd-survey on /dev/sdp /dev/sdo /dev/sdn
/dev/sdw /dev/sdv /dev/sdu /dev/sdt /dev/sds /dev/sdy /dev/sdr /dev/sdx
/dev/sdq  from oss1
... 
total_size 100663296K rsz 1024 crg   384 thr   768 write  388.20 MB/s   384
x   1.01 =  388.18 MB/s read  387.16 MB/s   384 x   1.01 =  388.18 MB/s
total_size 100663296K rsz 1024 crg   384 thr  1536 write 1 failed read
385.72 MB/s   384 x   1.01 =  388.18 MB/s
total_size 100663296K rsz 1024 crg   384 thr  3072 write 140 failed read 121
failed 
total_size 100663296K rsz 1024 crg   384 thr  6144 ENOMEM
total_size 100663296K rsz 1024 crg   768 thr   768 write 1 failed read
387.28 MB/s   768 x   0.51 =  388.18 MB/s
total_size 100663296K rsz 1024 crg   768 thr  1536 write  388.23 MB/s   768
x   0.51 =  388.18 MB/s read  386.76 MB/s   768 x   0.51 =  388.18 MB/s
total_size 100663296K rsz 1024 crg   768 thr  3072 write 42 failed read 31
failed 
total_size 100663296K rsz 1024 crg   768 thr  6144 ENOMEM
total_size 100663296K rsz 1024 crg   768 thr 12288 ENOMEM
...

Any suggestions are welcome.

Thanks,
-Nathan

Kevin Van Maren

2010-Dec-14 17:38 UTC

head link

[Lustre-discuss] Errors in output from sgpdd-survey (sgp_dd.c Cannot allocate memory)

Yep, this is a common problem.  I''ve never bothered to figure out why 
memory can''t be allocated,
although as you note the issue is in sgp_dd, not in the iokit scripts.  
Could be a resource limit of
some sort (pinned pages?).  If you have time to dig into it, I''m sure 
many people would appreciate it.

One thing to note is that Lustre limits itself to 512 total threads per 
server.  So there are never more
than that outstanding IOs when running Lustre, although additional 
client requests can be queued and
processed, which is why higher crg/thread values are interesting.  If 
you limit the sgpdd_survey
total thread count, you should not have these failures (note that 1536 
threads has one
failing write process while 3072 has 140; perhaps you could have sgp_dd 
retry the allocation).

Kevin


Heald, Nathan T. wrote:> Hi everyone,
> I have been running sgpdd-survey on some DDN 9550''s and am getting
some
> errors. I''m using what I believe to be the latest version of the
I/O Kit
> (lustre-iokit-1.2-200709210921). I''ve got 4 OSSes attached and run
> sgpdd-survey against all the disk from each host one at a time. Each host
is
> getting these errors, but not identically. I''ve found several
threads on the
> mailing list with people reporting this same error but there are no
> resolutions posted. One post suggested a modification to the flags for
> "sg_readcap" in the script could resolve these errors, but making
the
> changes did not seem to fix the issue. It looks like sgp_dd is having
> intermittent problems:
>
> 16384+0 records out
> sg starting in command at "sgp_dd.c":827: Cannot allocate memory
> sg starting in command at "sgp_dd.c":827: Cannot allocate memory
> sg starting in command at "sgp_dd.c":827: Cannot allocate memory
> sg starting in command at "sgp_dd.c":827: Cannot allocate memory
> sg starting in command at "sgp_dd.c":827: Cannot allocate memory
> sg starting in command at "sgp_dd.c":827: Cannot allocate memory
>
>
> Output from sgpdd-survey:
>
> Wed Dec  1 10:55:55 EST 2010 sgpdd-survey on /dev/sdp /dev/sdo /dev/sdn
> /dev/sdw /dev/sdv /dev/sdu /dev/sdt /dev/sds /dev/sdy /dev/sdr /dev/sdx
> /dev/sdq  from oss1
> ... 
> total_size 100663296K rsz 1024 crg   384 thr   768 write  388.20 MB/s   384
> x   1.01 =  388.18 MB/s read  387.16 MB/s   384 x   1.01 =  388.18 MB/s
> total_size 100663296K rsz 1024 crg   384 thr  1536 write 1 failed read
> 385.72 MB/s   384 x   1.01 =  388.18 MB/s
> total_size 100663296K rsz 1024 crg   384 thr  3072 write 140 failed read
121
> failed 
> total_size 100663296K rsz 1024 crg   384 thr  6144 ENOMEM
> total_size 100663296K rsz 1024 crg   768 thr   768 write 1 failed read
> 387.28 MB/s   768 x   0.51 =  388.18 MB/s
> total_size 100663296K rsz 1024 crg   768 thr  1536 write  388.23 MB/s   768
> x   0.51 =  388.18 MB/s read  386.76 MB/s   768 x   0.51 =  388.18 MB/s
> total_size 100663296K rsz 1024 crg   768 thr  3072 write 42 failed read 31
> failed 
> total_size 100663296K rsz 1024 crg   768 thr  6144 ENOMEM
> total_size 100663296K rsz 1024 crg   768 thr 12288 ENOMEM
> ...
>
> Any suggestions are welcome.
>
> Thanks,
> -Nathan
>
>
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>

Jim Shankland

2010-Dec-14 18:20 UTC

head link

[Lustre-discuss] Errors in output from sgpdd-survey (sgp_dd.c Cannot allocate memory)

Heald, Nathan T. wrote:> Hi everyone,
> I have been running sgpdd-survey on some DDN 9550''s and am getting
some
> errors. I''m using what I believe to be the latest version of the
I/O Kit
> (lustre-iokit-1.2-200709210921). I''ve got 4 OSSes attached and run
> sgpdd-survey against all the disk from each host one at a time. Each host
is
> getting these errors, but not identically. I''ve found several
threads on the
> mailing list with people reporting this same error but there are no
> resolutions posted. One post suggested a modification to the flags for
> "sg_readcap" in the script could resolve these errors, but making
the
> changes did not seem to fix the issue. It looks like sgp_dd is having
> intermittent problems:
> 
> 16384+0 records out
> sg starting in command at "sgp_dd.c":827: Cannot allocate memory
[snip]
> 
> Output from sgpdd-survey:
> 
> Wed Dec  1 10:55:55 EST 2010 sgpdd-survey on /dev/sdp /dev/sdo /dev/sdn
> /dev/sdw /dev/sdv /dev/sdu /dev/sdt /dev/sds /dev/sdy /dev/sdr /dev/sdx
> /dev/sdq  from oss1
> ... 
> total_size 100663296K rsz 1024 crg   384 thr   768 write  388.20 MB/s   384
> x   1.01 =  388.18 MB/s read  387.16 MB/s   384 x   1.01 =  388.18 MB/s
> total_size 100663296K rsz 1024 crg   384 thr  1536 write 1 failed read
> 385.72 MB/s   384 x   1.01 =  388.18 MB/s
> total_size 100663296K rsz 1024 crg   384 thr  3072 write 140 failed read
121
> failed 
> total_size 100663296K rsz 1024 crg   384 thr  6144 ENOMEM
You just don''t have enough RAM to do these particular runs.
If you look at the line ending in ENOMEM above: sgpdd-survey
is proposing to launch 384 separate sgp_dd processes for each
of 12 different devices, with each process launching 16
threads (6144 / 384), and each thread allocating at least 1 1
MiB write buffer.  That adds up to 72 GiB of RAM for write
buffers.  The ENOMEM line means that the sgpdd-survey script
looked at the amount of physical RAM you have, and estimated it
wasn''t enough to do this run.

You could try running sgpdd-survey against each block device
one at a time, which will reduce the needed RAM by a factor of
12 (in your case), but of course isn''t quite equivalent.

sg_readcap is used to determine the physical sector size and
capacity (sector count) of each block device.  I wouldn''t
think changing the flags on it would help anything.

Jim Shankland
Whamcloud, Inc.

Kevin Van Maren

2010-Dec-14 20:27 UTC

head link

[Lustre-discuss] Errors in output from sgpdd-survey (sgp_dd.c Cannot allocate memory)

Jim Shankland wrote:>> ... 
>> total_size 100663296K rsz 1024 crg   384 thr   768 write  388.20 MB/s  
384
>> x   1.01 =  388.18 MB/s read  387.16 MB/s   384 x   1.01 =  388.18 MB/s
>> total_size 100663296K rsz 1024 crg   384 thr  1536 write 1 failed read
>> 385.72 MB/s   384 x   1.01 =  388.18 MB/s
>> total_size 100663296K rsz 1024 crg   384 thr  3072 write 140 failed
read 121
>> failed 
>> total_size 100663296K rsz 1024 crg   384 thr  6144 ENOMEM
>>     
>
> You just don''t have enough RAM to do these particular runs.
> If you look at the line ending in ENOMEM above: sgpdd-survey
> is proposing to launch 384 separate sgp_dd processes for each
> of 12 different devices, with each process launching 16
> threads (6144 / 384), and each thread allocating at least 1 1
> MiB write buffer.  That adds up to 72 GiB of RAM for write
> buffers.  The ENOMEM line means that the sgpdd-survey script
> looked at the amount of physical RAM you have, and estimated it
> wasn''t enough to do this run.
>   
It''s not just the ENOMEM at 6144 total threads that is the problem, it 
is the "write X failed", etc, at the _lower_ thread counts.

 From memory, the "crg" and "thr" numbers are already
multiplied by 12
(the number of devices being tested), so "thr" should reflect the
total
number of buffers required.  For this test, it looks like crg=32 and 
SG_MAX_QUEUE is the default 16.  So the memory consumption _should not_ 
be an issue, but sgp_dd is still having problems allocating buffers.

Again, I''ve seen this even when I clearly had free memory on the node, 
so I think there is something else at work here.

Kevin

Gary Molenkamp

2010-Dec-20 14:36 UTC

head link

[Lustre-discuss] New lustre 1.8.5 over IB problem

^^^^^>> Why are you assigning two different network types to the same physical
>> device?
> 
> Hello Colin,
> 
> Thanks for the reply. In answer to your question:
> 
> The same physical device has access to two different lustre filesystems 
> using different protocols.
> 
> One lustre filesystem is locally available via the native ib interface 
> o2ib0(ib0).
> 
> The other lustre filesystem is remotely available (via a IB to 10Gb 
> switch/gateway in the local IB fabric) on the same local IB device but 
> only via the tcp/ip (IPoIB) protocol, tcp0(ib0).
> 
> (not sure how good this ASCII diagram will look)
> 
>  				     ---------------------
>      |-------------|	|------------| local lustre setup|
> ib0 |		-----------	     ---------------------
> --------	|ib fabric|
> |client|	-----------
> --------	 |
>  		--------------
>  		|ib to 10Gb gw|
>  		--------------
>  		 |		   eth0	--------------------
>  		 |---------------------| remote lustre setup|
>  					--------------------
> 
> Is this possible?
> 
> -k

I did manage to get this to work properly under the following conditions:
   remote lustre setup uses tcp(eth0)
   local lustre setup uses  o2ib(ib0)

   on the ib client
     lnet o2ib(ib0),tcp(ib0)

With this configuration,  all lustre servers are active and reachable.
If the client ordering is reversed, then the OSSs on the local lustre
always reports as temporarily unreachable.



-- 
Gary Molenkamp			SHARCNET
Systems Administrator		University of Western Ontario
Compute/Calcul Canada		http://www.computecanada.org
gary at sharcnet.ca		http://www.sharcnet.ca
(519) 661-2111 x88429		(519) 661-4000

Christopher J. Walker

2011-Feb-21 23:52 UTC

head link

[Lustre-discuss] Errors in output from sgpdd-survey (sgp_dd.c Cannot allocate memory)

On 14/12/10 20:27, Kevin Van Maren wrote:> Jim Shankland wrote:
>>> ...
>>> total_size 100663296K rsz 1024 crg   384 thr   768 write  388.20
MB/s   384
>>> x   1.01 =  388.18 MB/s read  387.16 MB/s   384 x   1.01 =  388.18
MB/s
>>> total_size 100663296K rsz 1024 crg   384 thr  1536 write 1 failed
read
>>> 385.72 MB/s   384 x   1.01 =  388.18 MB/s
>>> total_size 100663296K rsz 1024 crg   384 thr  3072 write 140 failed
read 121
>>> failed
>>> total_size 100663296K rsz 1024 crg   384 thr  6144 ENOMEM
>>>
>>
>> You just don''t have enough RAM to do these particular runs.
>> If you look at the line ending in ENOMEM above: sgpdd-survey
>> is proposing to launch 384 separate sgp_dd processes for each
>> of 12 different devices, with each process launching 16
>> threads (6144 / 384), and each thread allocating at least 1 1
>> MiB write buffer.  That adds up to 72 GiB of RAM for write
>> buffers.  The ENOMEM line means that the sgpdd-survey script
>> looked at the amount of physical RAM you have, and estimated it
>> wasn''t enough to do this run.
>>
>
> It''s not just the ENOMEM at 6144 total threads that is the
problem, it
> is the "write X failed", etc, at the _lower_ thread counts.
>
>   From memory, the "crg" and "thr" numbers are already
multiplied by 12
> (the number of devices being tested), so "thr" should reflect the
total
> number of buffers required.  For this test, it looks like crg=32 and
> SG_MAX_QUEUE is the default 16.  So the memory consumption _should not_
> be an issue, but sgp_dd is still having problems allocating buffers.
>
> Again, I''ve seen this even when I clearly had free memory on the
node,
> so I think there is something else at work here.
>
I''ve run into this problem (on a scientific linux 5.5 machine).

If I use /dev/sg1, I get the following:

[root at sn86 lustre]# sgp_dd if=/dev/zero of=/dev/sg1 seek=1024 thr=1 
count=1677721 bs=512 bpt=2048 time=1
sg starting out command at "sgp_dd.c":872: Cannot allocate memory

whereas if I use  /dev/sdb, I get:

[root at sn86 lustre]# sgp_dd if=/dev/zero of=/dev/sdb seek=1024 thr=1 
count=1677721 bs=512 bpt=2048 time=1
time to transfer data was 0.485030 secs, 1771.01 MB/sec

They correspond to the same disk:

[root at sn86 lustre]# sg_map | grep sdb
/dev/sg1  /dev/sdb

Have I just defeated the point of using sgp_dd? Is the fact that this 
really a sata disk (behind a Dell H700 controller) the problem?

Chris

Lustre discuss - Dec 2010 - New lustre 1.8.5 over IB problem

[Lustre-discuss] New lustre 1.8.5 over IB problem

[Lustre-discuss] New lustre 1.8.5 over IB problem

[Lustre-discuss] New lustre 1.8.5 over IB problem

[Lustre-discuss] New lustre 1.8.5 over IB problem

[Lustre-discuss] New lustre 1.8.5 over IB problem

[Lustre-discuss] New lustre 1.8.5 over IB problem

[Lustre-discuss] Errors in output from sgpdd-survey (sgp_dd.c Cannot allocate memory)

[Lustre-discuss] Errors in output from sgpdd-survey (sgp_dd.c Cannot allocate memory)

[Lustre-discuss] Errors in output from sgpdd-survey (sgp_dd.c Cannot allocate memory)

[Lustre-discuss] Errors in output from sgpdd-survey (sgp_dd.c Cannot allocate memory)

[Lustre-discuss] New lustre 1.8.5 over IB problem

[Lustre-discuss] Errors in output from sgpdd-survey (sgp_dd.c Cannot allocate memory)