Has anyone successfully built/run lustre on an SGI Origin? I have one that I''ve been using as a test mule for various things. I tried building up lustre on it, and while it said it all compiled, when I try to mount a test fs from an opteron box, it fails to mount. The client (origin) just hangs for a while and then complains of an IO error while attempting to mount. Looking at the server, I see errors in the log such as: Oct 6 15:27:25 localhost LustreError: 8078:0:(pack_generic.c:743:lustre_unpack_msg_v2()) message length 168 too small for 50331648 buflens My surmise is that maybe I have an endian issue (the origin is a big-ender mips) but I had believed that the endian issues were supposed to be under control in the lustre communication stuff. I haven''t debugged this very hard. Anybody have suggestions on what to look for?
On Oct 06, 2006 15:58 -0400, John R. Dunning wrote:> Has anyone successfully built/run lustre on an SGI Origin? I have one that > I''ve been using as a test mule for various things. I tried building up lustre > on it, and while it said it all compiled, when I try to mount a test fs from > an opteron box, it fails to mount.We have tested Lustre on an Origin at one time, maybe a year ago, without problems. We currently run Lustre 1.4 in production with mixed-endian systems (PPC clients, x86_64 servers) and mixed-wordsize (ia64 clients, i686 servers).> The client (origin) just hangs for a while and then complains of an IO error > while attempting to mount. Looking at the server, I see errors in the log > such as: > > Oct 6 15:27:25 localhost LustreError: 8078:0:(pack_generic.c:743:lustre_unpack_msg_v2()) message length 168 too small for 50331648 buflens > > My surmise is that maybe I have an endian issue (the origin is a big-ender > mips) but I had believed that the endian issues were supposed to be under > control in the lustre communication stuff.This is with Lustre 1.5.95? It does indeed look like an endian problem 50331648 = 0x3000000, so this field is not being swabbed correctly - it should be 3. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc.
From: Andreas Dilger <adilger@clusterfs.com>
    Date: Wed, 11 Oct 2006 10:45:29 -0600
    
    
    We have tested Lustre on an Origin at one time, maybe a year ago, without
    problems.  
That''s good to know.
	       We currently run Lustre 1.4 in production with mixed-endian
    systems (PPC clients, x86_64 servers) and mixed-wordsize (ia64 clients,
    i686 servers).
Right, that''s what I''d thought.  I don''t have access
to any big-enders
other than this origin box, so I don''t have any other data points to
work with, but I was pretty sure I remembered that mixed-endian had
been demonstrated to work.
    
    This is with Lustre 1.5.95?  
1.5.91.  I haven''t gotten around to updating to the latest yet.
				 
				 It does indeed look like an endian problem
    50331648 = 0x3000000, so this field is not being swabbed correctly - it
    should be 3.
    
Yes, that''s what it looked like to me as well.
Any hints on how to diagnose this further?  Is there something I can
turn on that will illustrate what/when is being swapped by the
incoming message code?
On Oct 11, 2006 12:50 -0400, John R. Dunning wrote:> It does indeed look like an endian problem > 50331648 = 0x3000000, so this field is not being swabbed correctly - it > should be 3. > > Yes, that''s what it looked like to me as well. > > Any hints on how to diagnose this further? Is there something I can > turn on that will illustrate what/when is being swapped by the > incoming message code?load the libcfs and lnet modules, then "sysctl -w lnet.debug=-1" and try to mount the client. After the failure, "lctl dk | sort -k4 -t: > /tmp/debug" to dump the debug logs and search that file for your error message. It will show the full callpath taken until the error message is hit. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc.