I ran into an interesting/strange issue today. I still don''t understand what happened, but in know what fixed it. I had a situation where I could only see vblade exported devices from OFF the physical machine. It seemed that if the packets went through two ports of the bridge (instead of one port and the real interface) they got "lost". I sniffed around and failed to figure out what was going on. As soon as vblade fired up I started seeing the infamous "peth0: received packet with own address as source address" messages on dom0. I chased that for a bit, but didn''t get anywhere. So I read the ATA spec and looked at the vblade code. I could see that the vblade server was getting some packets, even if they did have the bridge MAC as the source, but it was not responding to them. They looked valid from tcpdump, so I started adding debug statements to the vblade server. It turns out that for some reason the packets were shorter than vblade expected, and it was ignoring them. I changed the check for packet length to be if < 32 instead of if < 60, and voila it works. (in aoe.c) for (;;) { n = getpkt(sfd, buf, bufsz); if (n < 0) { perror("read network"); exit(1); } // if (n < 60) { if (n < 32) { // fprintf(stderr,"skipping short read (%d<36)\n",n); continue; } I''ve got two identical systems, and why a given dom0 could only see the vblade server in a domU on the other physical machine is beyond me. I''m not a linux ethernet bridging expert, nor do I know why that 60 byte check was in the code... but I was certainly getting shorter packets.... e.g. 21:58:48.408750 fe:ff:ff:ff:ff:ff > 00:16:3e:23:f7:0b, ethertype Unknown (0x88a2), length 36: 0x0000: 10 00 0002 0100 0957 db28 0000 01ec 0000 .......W.(...... 0x0010: 00a0 0000 0000 I''ve beat on it fairly hard since, and vblade on top of a drbd "partition" seems to be working well. If it helps, this is a pair of x86_64 systems, xen-3.0.3-0, one a pentium-D and the other a dual amd 2216. -Tom ---------------------------------------------------------------------- tbrown@BareMetal.com | Courage is doing what you''re afraid to do. http://BareMetal.com/ | There can be no courage unless you''re scared. | - Eddie Rickenbacker _______________________________________________ Xen-users mailing list Xen-users@lists.xensource.com http://lists.xensource.com/xen-users
On Fri, 2007-03-09 at 00:37 -0800, Tom Brown wrote:> I ran into an interesting/strange issue today. I still don''t understand > what happened, but in know what fixed it.That was entirely useful to share and fixed a similar problem over here. I''m off to go see if it cures some of my mysterious AoE ailments.> It turns out that for some reason the packets were shorter > than vblade expected, and it was ignoring them. I changed the > check for packet length to be if < 32 instead of if < 60, and > voila it works. (in aoe.c) > > for (;;) { > n = getpkt(sfd, buf, bufsz); > if (n < 0) { > perror("read network"); > exit(1); > } > // if (n < 60) { > if (n < 32) { > // fprintf(stderr,"skipping short read (%d<36)\n",n); > continue; > } >Works very nicely on the mock-up mixed hardware jerry-rigged ocfs2 cluster I am working on, which wasn''t working correctly prior. I haven''t started hammering it yet because the drive backing it is a 6GB travelstar in an ancient PIII laptop. My only concern is I wonder why it was set so high .. oversight or ''quick fix'' to something yet to discover? If it breaks I''ll let you know :) Thanks again, that was really useful. Best, --Tim _______________________________________________ Xen-users mailing list Xen-users@lists.xensource.com http://lists.xensource.com/xen-users
On Fri, 2007-03-09 at 17:26 +0800, Tim Post wrote:> On Fri, 2007-03-09 at 00:37 -0800, Tom Brown wrote: > > I ran into an interesting/strange issue today. I still don''t understand > > what happened, but in know what fixed it. > > That was entirely useful to share and fixed a similar problem over here. > > I''m off to go see if it cures some of my mysterious AoE ailments. >On the other machines I have a newer version (vblade-12) downloaded from corraid, which I think fixes the issue. I''m not even sure what version I have on the pIII because there is no stamp in it anywhere and the source I installed with is gone. Which version did you get? My aoe.c (in newer) (code related) : for (;;) { n = getpkt(sfd, buf, bufsz); if (n < 0) { perror("read network"); exit(1); } if (n < 60) continue; p = (Aoehdr *) buf; if (ntohs(p->type) != 0x88a2) continue; if (p->flags & Resp) continue; sh = ntohs(p->maj); if (sh != shelf && sh != (ushort)~0) continue; if (p->min != slot && p->min != (uchar)~0) continue; doaoe(p); } Earlier, in aoe.c in function aoeata, I can see the default length is 60. aoeata(Ata *p) // do ATA reqeust { Ataregs r; int len = 60; So it looks like better logic to deal with funky requests has been added. But, the two are now working together, where before they were not. If it breaks I''ll let you know,I''m skeptical. What could end up happening is the pIII will get and eat packets it has no idea what to do with :) So be careful. My current AOE modules, tools vblade and kvblade are here : http://dev1.netkinetics.net/aoe/ if you want to look at my copy. Best, --Tim _______________________________________________ Xen-users mailing list Xen-users@lists.xensource.com http://lists.xensource.com/xen-users