Has anyone successfully built/run lustre on an SGI Origin? I have one that I''ve been using as a test mule for various things. I tried building up lustre on it, and while it said it all compiled, when I try to mount a test fs from an opteron box, it fails to mount. The client (origin) just hangs for a while and then complains of an IO error while attempting to mount. Looking at the server, I see errors in the log such as: Oct 6 15:27:25 localhost LustreError: 8078:0:(pack_generic.c:743:lustre_unpack_msg_v2()) message length 168 too small for 50331648 buflens My surmise is that maybe I have an endian issue (the origin is a big-ender mips) but I had believed that the endian issues were supposed to be under control in the lustre communication stuff. I haven''t debugged this very hard. Anybody have suggestions on what to look for?
On Oct 06, 2006 15:58 -0400, John R. Dunning wrote:> Has anyone successfully built/run lustre on an SGI Origin? I have one that > I''ve been using as a test mule for various things. I tried building up lustre > on it, and while it said it all compiled, when I try to mount a test fs from > an opteron box, it fails to mount.We have tested Lustre on an Origin at one time, maybe a year ago, without problems. We currently run Lustre 1.4 in production with mixed-endian systems (PPC clients, x86_64 servers) and mixed-wordsize (ia64 clients, i686 servers).> The client (origin) just hangs for a while and then complains of an IO error > while attempting to mount. Looking at the server, I see errors in the log > such as: > > Oct 6 15:27:25 localhost LustreError: 8078:0:(pack_generic.c:743:lustre_unpack_msg_v2()) message length 168 too small for 50331648 buflens > > My surmise is that maybe I have an endian issue (the origin is a big-ender > mips) but I had believed that the endian issues were supposed to be under > control in the lustre communication stuff.This is with Lustre 1.5.95? It does indeed look like an endian problem 50331648 = 0x3000000, so this field is not being swabbed correctly - it should be 3. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc.
From: Andreas Dilger <adilger@clusterfs.com> Date: Wed, 11 Oct 2006 10:45:29 -0600 We have tested Lustre on an Origin at one time, maybe a year ago, without problems. That''s good to know. We currently run Lustre 1.4 in production with mixed-endian systems (PPC clients, x86_64 servers) and mixed-wordsize (ia64 clients, i686 servers). Right, that''s what I''d thought. I don''t have access to any big-enders other than this origin box, so I don''t have any other data points to work with, but I was pretty sure I remembered that mixed-endian had been demonstrated to work. This is with Lustre 1.5.95? 1.5.91. I haven''t gotten around to updating to the latest yet. It does indeed look like an endian problem 50331648 = 0x3000000, so this field is not being swabbed correctly - it should be 3. Yes, that''s what it looked like to me as well. Any hints on how to diagnose this further? Is there something I can turn on that will illustrate what/when is being swapped by the incoming message code?
On Oct 11, 2006 12:50 -0400, John R. Dunning wrote:> It does indeed look like an endian problem > 50331648 = 0x3000000, so this field is not being swabbed correctly - it > should be 3. > > Yes, that''s what it looked like to me as well. > > Any hints on how to diagnose this further? Is there something I can > turn on that will illustrate what/when is being swapped by the > incoming message code?load the libcfs and lnet modules, then "sysctl -w lnet.debug=-1" and try to mount the client. After the failure, "lctl dk | sort -k4 -t: > /tmp/debug" to dump the debug logs and search that file for your error message. It will show the full callpath taken until the error message is hit. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc.