Holger Schurig wrote:>> With 3.3.5 my first test took 5 times to produce a non "bus >> error" build. There were no 'make cleans' in between. >> >> What is going on? > > You mean you used your bsd-ports-provided gcc to compile LLVM and > you've got 4 times a bus-error during the build? In this case, > it cannot be a LLVM problem.Ok, to clarify, I have tried the OpenBSD provided gcc-3.3.5 (which is considered the least buggy version of gcc) and also with gcc-4.2 from ports. Sometimes you get a clean build of llvm, sometimes you don't and instead get a bus error.> > In the linux-community, people say that bus-error's are almost > always because of faulty hardware, e.g. problem with DRAM > timing, overheated CPU, power-supply that cannot provide enought > power during current surges, things like that.That is one reason a bus error might occur, but my more common understanding of a bus error is data not properly aligned with the byte boundaries and/or out of range memory at the physical level. The machine I am building on is my workstation which I use 9-4.30 mon-fri. I run all manner of apps without any problems, so if it were bad hardware it would have shown itself by now surely. As a test I got another developer to try on a different machine and he has the same problem. In another test he also tried a more aggressive malloc.conf (a mechanism which causes malloc to do all sorts of randomisation and page filling to test for memory based bugs) and a completely different error was encountered: SelectionDAG.cpp:2602: warning: converting of negative value `-1' to `long long Also we found that without specifying --enable-optimized, the optimisations were still present: -O3 -fomit-frame-pointer -Woverloaded-virtual -pedantic -Wall -W -Wwrite-strings -Wno-long-long -Wunused -Wno-unused-parameter -O3 :¬( -- Best Regards Edd http://students.dec.bmth.ac.uk/ebarrett
Hi, On Wednesday 18 June 2008 15:08:46 Edd Barrett wrote:> Holger Schurig wrote: > >> With 3.3.5 my first test took 5 times to produce a non "bus > >> error" build. There were no 'make cleans' in between. > >> > >> What is going on? > > > > You mean you used your bsd-ports-provided gcc to compile LLVM and > > you've got 4 times a bus-error during the build? In this case, > > it cannot be a LLVM problem. > > Ok, to clarify, > > I have tried the OpenBSD provided gcc-3.3.5 (which is considered the > least buggy version of gcc) and also with gcc-4.2 from ports. > > Sometimes you get a clean build of llvm, sometimes you don't and instead > get a bus error.if I understand right the problem is that you are unable to build LLVM because your system gcc (and another gcc you tried) tends to crash during the build?> > In the linux-community, people say that bus-error's are almost > > always because of faulty hardware, e.g. problem with DRAM > > timing, overheated CPU, power-supply that cannot provide enought > > power during current surges, things like that. > > That is one reason a bus error might occur, but my more common > understanding of a bus error is data not properly aligned with the byte > boundaries and/or out of range memory at the physical level. > > The machine I am building on is my workstation which I use 9-4.30 > mon-fri. I run all manner of apps without any problems, so if it were > bad hardware it would have shown itself by now surely.gcc is however notorious for exposing bad memory problems.> As a test I got another developer to try on a different machine and he > has the same problem. In another test he also tried a more aggressive > malloc.conf (a mechanism which causes malloc to do all sorts of > randomisation and page filling to test for memory based bugs) and a > completely different error was encountered: > > SelectionDAG.cpp:2602: warning: converting of negative value > `-1' to `long longIf I understand right, tweaking your system malloc caused the system gcc to behave differently when compiling LLVM?> Also we found that without specifying --enable-optimized, the > optimisations were still present: > > -O3 -fomit-frame-pointer -Woverloaded-virtual -pedantic > -Wall -W -Wwrite-strings -Wno-long-long -Wunused -Wno-unused-parameter > -O3--enable-optimized is not about whether or not compiler optimizations are performed when building LLVM, it is about whether the built version of LLVM performs internal checks when run. Ciao, Duncan.
Am Mittwoch, den 18.06.2008, 14:08 +0100 schrieb Edd Barrett:> Sometimes you get a clean build of llvm, sometimes you don't and instead > get a bus error.Nonreproducible behaviour in a batch application is usually a sign of hardware problems.> > In the linux-community, people say that bus-error's are almost > > always because of faulty hardware, e.g. problem with DRAM > > timing, overheated CPU, power-supply that cannot provide enought > > power during current surges, things like that. > > That is one reason a bus error might occur, but my more common > understanding of a bus error is data not properly aligned with the byte > boundaries and/or out of range memory at the physical level.Bus errors are usually the result of pointers getting corrupted. That may be due to a bug, or due to hardware problems.> The machine I am building on is my workstation which I use 9-4.30 > mon-fri. I run all manner of apps without any problems, so if it were > bad hardware it would have shown itself by now surely.Not really. gcc produces a different kind of load than most applications.> As a test I got another developer to try on a different machine and he > has the same problem.It is possible that both hardwares are faulty, though it reduces the probability considerably.> In another test he also tried a more aggressive > malloc.conf (a mechanism which causes malloc to do all sorts of > randomisation and page filling to test for memory based bugs) and a > completely different error was encountered: > > SelectionDAG.cpp:2602: warning: converting of negative value > `-1' to `long longIf you get irreproducible bus errors, that means random rare pointer corruption, and pointer corruption can cause almost arbitrary fault behaviour. So if you change the environment, pointer corruption will change the fault behaviour, regardless of whether the corruption is due to hardware or software.> Also we found that without specifying --enable-optimized, the > optimisations were still present: > > -O3 -fomit-frame-pointer -Woverloaded-virtual -pedantic > -Wall -W -Wwrite-strings -Wno-long-long -Wunused -Wno-unused-parameter > -O3 > > :¬(Can't comment on that one. Try writing a script that populates an empty directory and does the build. That way, you can guarantee identical environments (modulo machine load and filesystem storage layout, but that should not influence what batch programs like LLVM do). If you get the same bus errors after deleting the directory and starting over, it's probably a software problem. If the bus errors stay random, that would incrase the probability of a hardware problem. Regards, Jo
On Thu, Jun 19, 2008 at 12:32 AM, Duncan Sands <baldrick at free.fr> wrote:> Hi, > > On Wednesday 18 June 2008 15:08:46 Edd Barrett wrote: >> Holger Schurig wrote: >> >> With 3.3.5 my first test took 5 times to produce a non "bus >> >> error" build. There were no 'make cleans' in between. >> >> >> >> What is going on? >> > >> > You mean you used your bsd-ports-provided gcc to compile LLVM and >> > you've got 4 times a bus-error during the build? In this case, >> > it cannot be a LLVM problem. >> >> Ok, to clarify, >> >> I have tried the OpenBSD provided gcc-3.3.5 (which is considered the >> least buggy version of gcc) and also with gcc-4.2 from ports. >> >> Sometimes you get a clean build of llvm, sometimes you don't and instead >> get a bus error. > > if I understand right the problem is that you are unable to build LLVM > because your system gcc (and another gcc you tried) tends to crash during > the build?On several different systems.> >> > In the linux-community, people say that bus-error's are almost >> > always because of faulty hardware, e.g. problem with DRAM >> > timing, overheated CPU, power-supply that cannot provide enought >> > power during current surges, things like that. >> >> That is one reason a bus error might occur, but my more common >> understanding of a bus error is data not properly aligned with the byte >> boundaries and/or out of range memory at the physical level. >> >> The machine I am building on is my workstation which I use 9-4.30 >> mon-fri. I run all manner of apps without any problems, so if it were >> bad hardware it would have shown itself by now surely. > > gcc is however notorious for exposing bad memory problems.The build also stops at exactly the same point in several different *virtual* machines. (the assert() in utils/TableGen/CodeGenDAGPatterns.cpp line 932) Please stop repeating the "bad memory" mantra, that hasn't been true for years; it is much more likely to be a bug in gcc.> >> As a test I got another developer to try on a different machine and he >> has the same problem. In another test he also tried a more aggressive >> malloc.conf (a mechanism which causes malloc to do all sorts of >> randomisation and page filling to test for memory based bugs) and a >> completely different error was encountered: >> >> SelectionDAG.cpp:2602: warning: converting of negative value >> `-1' to `long long > > If I understand right, tweaking your system malloc caused the system > gcc to behave differently when compiling LLVM?Sorry about that, but I wasn't very clear when passing on some error messages to Edd, just pointing out some sloppy coding. Changing the malloc options had no effect on the build.> >> Also we found that without specifying --enable-optimized, the >> optimisations were still present: >> >> -O3 -fomit-frame-pointer -Woverloaded-virtual -pedantic >> -Wall -W -Wwrite-strings -Wno-long-long -Wunused -Wno-unused-parameter >> -O3 > > --enable-optimized is not about whether or not compiler optimizations > are performed when building LLVM, it is about whether the built version > of LLVM performs internal checks when run.Are you sure? Makefile.rules specifically includes/excludes $(OPTIMIZE_OPTION) based on whether ENABLE_OPTIMIZED ==1. The reason they are always included (on OpenBSD anyway) is this: ENABLE_OPTIMIZED is always 1 because there are some shell-script syntax problems in configure script which possibly don't show up on all shells. It uses "${foo+set}" instead of "${foo:+set}" - see man sh(1) on almost any OS. Edd: I'm fairly sure this is a bug in our gcc. After fixing the syntax errors in the configure script and doing a build with optimizations turned off it stops at the same point. I've tested it several times in vmware and virtualbox so far, I can try qemu if you like, but it isn't hardware related. Perhaps refactoring the function TreePatternNode::ApplyTypeConstraints() into several smaller functions would help. Regards, Andrew Dalgleish
On Jun 18, 2008, at 6:08 AM, Edd Barrett wrote:> Sometimes you get a clean build of llvm, sometimes you don't and > instead > get a bus error.gcc makes a excellent systems test. Try this, while :; do make boostrap && make clean; done with the FSF top of tree gcc. Let it run for 2 weeks. If it ever built once, it should never fail to build. If it does, I'd install a good linux distribution on the same hardware and try again, if it still fails, look to replace the hardware. If linux works, I'd look to replace the OS. If it failed everytime in the exact same spot in the exact same way, try the last FSF release for gcc. If it fails deterministically, that could be a gcc bug (or very bad hardware). If it fails non-deterministically, you're most likely looking at bad hardware.> That is one reason a bus error might occur, but my more common > understanding of a bus error is data not properly aligned with the > byte > boundaries and/or out of range memory at the physical level.Absent a bad version of gcc and a bad OS, the usual culprit is bad hardware.> The machine I am building on is my workstation which I use 9-4.30 > mon-fri. I run all manner of apps without any problems, so if it were > bad hardware it would have shown itself by now surely.No. I've seen machines that work flawlessly, pass all manner of memory tests including 24+ hours of standalone memtest86, right up to the point you ask them to boostrap gcc, then they fail, 100% of the time. Find someone with good hardware, same OS. See if the testcase that fails for you, fails for them. Try and use the same gcc binaries for the test. If it passes for someone else, again, probably bad hardware. If you live in California, I'd ask if you bought you memory at Fry's? They test all their memory to ensure it is bad, unless you buy the namebrand memory. :-(