Michael_E_Brown at Dell.com
2004-May-11 14:36 UTC
[syslinux] Trouble with ISOLINUX and IDE bus resets.
Hpa, Dell ships a CD called Dell OpenManage Server Assistant that, starting with version 8.0 released last November, is a Linux-based bootable CD. It uses ISOLINUX to load a linux kernel/initrd combo to start the system. From version 8.0 to 8.2 we use isolinux version 1.66. Starting with version 8.3 we have upgraded to version 2.08. First of all, I'd like to say thanks for your excellent bootloaders, of which I have used all three for various projects, they have helped immensely on this project. We have started to see a problem with the 1.66 version of isolinux on Dell PowerEdge 6650 server systems. On the newer 2.08 version, the problem happens less often, but we still get a problem from time to time. The cause of the problem seems to be that the CDROM device goes offline or has some sort of problem. A device reset is issued, but then it looks like when isolinux tries to re-read the sector, the BIOS int 13h call destination for the data is invalid and it crashes the system. We believe that the root cause is some hardware problem. But, the interesting thing is that the newer isolinux has the problem less often, and other OS bootloaders (windows NT in this case) also see the Device Reset, but they retry the read calls and continue going just fine. The newer isolinux gives "isolinux: Disk error 01, drive 82. Boot failed: press a key to retry" usually when it fails. The 1.66 version completely crashes the system with really neat video corruption. I have searched through the changelogs but did not see any likely matches for this problem. I have posted ASCII version of IDE bus traces of the problem here: http://www.michaels-house.net/~mebrown/IDE_bus_traces.tgz (4.7MB). The raw data for this is from "Bus Doctor", and the raw bus doctor data files are available upon request, but you need a Windows system to run Bus Doctor on. There is an evaluation version available you can use if you want to view the raw data. The BIOS team has stepped through the code and that is how we determined that, after the read retry, the destination buffer given is a bad address. Unfortunately I do not have any raw data from the BIOS guys at this point. If there is something specific that you need I can ask them to provide it. Have you seen this kind of problem, or is there some other data that we can provide that would help provide a software workaround for this problem? -- Michael Below I have copied several relevant entries from our internal bug tracking system =================Trouble log================= RMSD Update 4/29/04 by Villanueva, Jorge (4/29/2004 4:58:31 PM)>From the traces I am seeing that after the Reset is issue the host(Linux DSA 8.2) does not issue Read commands for a very large portion of data when compared with a passing case. This could be the cause of the corruption. When compared to the NT based DSA the host (NT) re-reads some portion of the data right before the Device Reset and then continues reading the rest. The DSA 8.2 does not appear to be reading in all the data. I have attached the traces I have used for this analysis. by Gedela, Neeraja (4/28/2004 3:32:40 PM) Hooked up an American Arium emulator and with the help of BIOS engineer Chin-Lung Chao, we reproduced the error. By stepping through the code around the point of failure, we saw the DSA is executing data transfer through the BIOS INT 13h calls by giving BIOS the CD sector from which to read data from and the memory location to which this data is to be written into. Ching-Lung says this memory location is an incorrect memory location to be written to, causing memory corruption to happen. Added Trace files by Villanueva, Jorge (4/23/2004 4:49:49 PM) Added Trace file of failing and passing cases. Both were done with same media and on the same config. Please use Bus Doctor to view system you can get this software at http://www.datatransit.com/support/demosw.html More observations and update... by Gedela, Neeraja (4/23/2004 2:25:25 PM) * Removed gasket from chassis and could not reproduce the failure on the 2M451 drive w/DSA 8.1 A01 (~15 boots). Did an additional 11 boots at a latter time and the fail occured twice. * Could not reproduce on Boxster w/2M451 or with 0R397and DSA 8.1 A01 * There is a misalignment of P1 connector (Larry Kosch from mechanical team verified the difference by measuring) on the interposer (Dell P/N: 401JX) due to the fact that there is a warping of the metallic strip (of drive carrier) to which the interposer is mounted via 2 holes causing the holes to be skewed wrt each other instead of being in line horizontally. There could be a variance in the connector contacts along with the mating J1 connector on Dell P/N 64EEC interposer card. * Seen following error once: "isolinux: Disk error 01, drive 82 Boot failed: press a key to retry" * Jorge Villaneuva from the RMSD team helped capture traces by hooking up an IDE analyzer for both passing and failing cases with DSA 8.2 and a passing case with DSA 7.5 * From the captured traces, there seems a difference in the way the NT code and Linux code handles errors. Error handling mechanism for NT code seems more robust than the linux version, because error happened in both cases but the NT code actually recovers from it. * On a general note, BIOS actually set the transfer mode to DMA if the device is DMA capable, and the
Michael_E_Brown at Dell.com wrote:> Hpa, > Dell ships a CD called Dell OpenManage Server Assistant that, > starting with version 8.0 released last November, is a Linux-based > bootable CD. It uses ISOLINUX to load a linux kernel/initrd combo to > start the system. From version 8.0 to 8.2 we use isolinux version 1.66. > Starting with version 8.3 we have upgraded to version 2.08. First of > all, I'd like to say thanks for your excellent bootloaders, of which I > have used all three for various projects, they have helped immensely on > this project. > > We have started to see a problem with the 1.66 version of > isolinux on Dell PowerEdge 6650 server systems. On the newer 2.08 > version, the problem happens less often, but we still get a problem from > time to time. The cause of the problem seems to be that the CDROM device > goes offline or has some sort of problem. A device reset is issued, but > then it looks like when isolinux tries to re-read the sector, the BIOS > int 13h call destination for the data is invalid and it crashes the > system. We believe that the root cause is some hardware problem. But, > the interesting thing is that the newer isolinux has the problem less > often, and other OS bootloaders (windows NT in this case) also see the > Device Reset, but they retry the read calls and continue going just > fine. The newer isolinux gives "isolinux: Disk error 01, drive 82. Boot > failed: press a key to retry" usually when it fails. The 1.66 version > completely crashes the system with really neat video corruption. > > I have searched through the changelogs but did not see any > likely matches for this problem. >If you could narrow down the range of versions then I might have a prayer of finding source changes. I know there was at least one BIOS on which a chunk of low memory got just plain overwritten in certain circumstances; something like that would definitely explain the problem! Other possibilities: INT 13h returns with either the wrong value in (E)SP, or with one of the segment registers corrupted. Another possibility is that the DAPA is corrupted.> I have posted ASCII version of IDE bus traces of the problem > here: http://www.michaels-house.net/~mebrown/IDE_bus_traces.tgz (4.7MB). > The raw data for this is from "Bus Doctor", and the raw bus doctor data > files are available upon request, but you need a Windows system to run > Bus Doctor on. There is an evaluation version available you can use if > you want to view the raw data.Not really useful to me, I'm afraid.> The BIOS team has stepped through the code and that is how we > determined that, after the read retry, the destination buffer given is a > bad address. Unfortunately I do not have any raw data from the BIOS guys > at this point. If there is something specific that you need I can ask > them to provide it. > > Have you seen this kind of problem, or is there some other data > that we can provide that would help provide a software workaround for > this problem?What would be useful is a dump of all the INT 13h calls including segment registers and DAPA (the 16-byte buffer pointed to by DS:SI). The easiest way to do that is to hack in some code just before the int 13h call.> * Seen following error once: > "isolinux: Disk error 01, drive 82 > Boot failed: press a key to retry" > * Jorge Villaneuva from the RMSD team helped capture traces by hooking > up an IDE analyzer for both passing and failing cases with DSA 8.2 and a > passing case with DSA 7.5 > * From the captured traces, there seems a difference in the way the NT > code and Linux code handles errors. Error handling mechanism for NT code > seems more robust than the linux version, because error happened in both > cases but the NT code actually recovers from it.This could just be pure dumb luck. I've looked at the code in ISOLINUX, and I'm pretty sure it is correct as written. Error 01 means "invalid call", it really could mean anything. -hpa