thr3ads.net - Syslinux - [syslinux] Trouble with ISOLINUX and IDE bus resets. [May 2004]

If this information is useful, please help other people find it:
Share via:

Michael_E_Brown at Dell.com

2004-May-11 14:36 UTC

[syslinux] Trouble with ISOLINUX and IDE bus resets.

Hpa,
Dell ships a CD called Dell OpenManage Server Assistant that,
starting with version 8.0 released last November, is a Linux-based
bootable CD. It uses ISOLINUX to load a linux kernel/initrd combo to
start the system. From version 8.0 to 8.2 we use isolinux version 1.66.
Starting with version 8.3 we have upgraded to version 2.08. First of
all, I'd like to say thanks for your excellent bootloaders, of which I
have used all three for various projects, they have helped immensely on
this project.

We have started to see a problem with the 1.66 version of
isolinux on Dell PowerEdge 6650 server systems. On the newer 2.08
version, the problem happens less often, but we still get a problem from
time to time. The cause of the problem seems to be that the CDROM device
goes offline or has some sort of problem. A device reset is issued, but
then it looks like when isolinux tries to re-read the sector, the BIOS
int 13h call destination for the data is invalid and it crashes the
system. We believe that the root cause is some hardware problem. But,
the interesting thing is that the newer isolinux has the problem less
often, and other OS bootloaders (windows NT in this case) also see the
Device Reset, but they retry the read calls and continue going just
fine. The newer isolinux gives "isolinux: Disk error 01, drive 82. Boot
failed: press a key to retry" usually when it fails. The 1.66 version
completely crashes the system with really neat video corruption.

I have searched through the changelogs but did not see any
likely matches for this problem.

I have posted ASCII version of IDE bus traces of the problem
here: http://www.michaels-house.net/~mebrown/IDE_bus_traces.tgz (4.7MB).
The raw data for this is from "Bus Doctor", and the raw bus doctor
data
files are available upon request, but you need a Windows system to run
Bus Doctor on. There is an evaluation version available you can use if
you want to view the raw data.

The BIOS team has stepped through the code and that is how we
determined that, after the read retry, the destination buffer given is a
bad address. Unfortunately I do not have any raw data from the BIOS guys
at this point. If there is something specific that you need I can ask
them to provide it.

Have you seen this kind of problem, or is there some other data
that we can provide that would help provide a software workaround for
this problem?
--
Michael

Below I have copied several relevant entries from our internal bug
tracking system
=================Trouble log================= RMSD Update 4/29/04 by
Villanueva, Jorge (4/29/2004 4:58:31 PM)>From the traces I am seeing that after the Reset is issue the host(Linux DSA 8.2) does not issue Read commands for a very large portion of
data when compared with a passing case. This could be the cause of the
corruption. When compared to the NT based DSA the host (NT) re-reads
some portion of the data right before the Device Reset and then
continues reading the rest. The DSA 8.2 does not appear to be reading
in all the data. I have attached the traces I have used for this
analysis.

by Gedela, Neeraja (4/28/2004 3:32:40 PM)
Hooked up an American Arium emulator and with the help of BIOS engineer
Chin-Lung Chao, we reproduced the error. By stepping through the code
around the point of failure, we saw the DSA is executing data transfer
through the BIOS INT 13h calls by giving BIOS the CD sector from which
to read data from and the memory location to which this data is to be
written into. Ching-Lung says this memory location is an incorrect
memory location to be written to, causing memory corruption to happen.

Added Trace files by Villanueva, Jorge (4/23/2004 4:49:49 PM)
Added Trace file of failing and passing cases. Both were done with same
media and on the same config. Please use Bus Doctor to view system you
can get this software at http://www.datatransit.com/support/demosw.html

More observations and update... by Gedela, Neeraja (4/23/2004
2:25:25 PM)
* Removed gasket from chassis and could not reproduce the failure on the
2M451 drive w/DSA 8.1 A01 (~15 boots). Did an additional 11 boots at a
latter time and the fail occured twice.
* Could not reproduce on Boxster w/2M451 or with 0R397and DSA 8.1 A01
* There is a misalignment of P1 connector (Larry Kosch from mechanical
team verified the difference by measuring) on the interposer (Dell P/N:
401JX) due to the fact that there is a warping of the metallic strip (of
drive carrier) to which the interposer is mounted via 2 holes causing
the holes to be skewed wrt each other instead of being in line
horizontally. There could be a variance in the connector contacts along
with the mating J1 connector on Dell P/N 64EEC interposer card.

* Seen following error once:
"isolinux: Disk error 01, drive 82
Boot failed: press a key to retry"
* Jorge Villaneuva from the RMSD team helped capture traces by hooking
up an IDE analyzer for both passing and failing cases with DSA 8.2 and a
passing case with DSA 7.5
* From the captured traces, there seems a difference in the way the NT
code and Linux code handles errors. Error handling mechanism for NT code
seems more robust than the linux version, because error happened in both
cases but the NT code actually recovers from it.
* On a general note, BIOS actually set the transfer mode to DMA if the
device is DMA capable, and the

H. Peter Anvin

2004-May-12 02:03 UTC

head link

[syslinux] Trouble with ISOLINUX and IDE bus resets.

Michael_E_Brown at Dell.com wrote:> Hpa,
> 	Dell ships a CD called Dell OpenManage Server Assistant that,
> starting with version 8.0 released last November, is a Linux-based
> bootable CD. It uses ISOLINUX to load a linux kernel/initrd combo to
> start the system. From version 8.0 to 8.2 we use isolinux version 1.66.
> Starting with version 8.3 we have upgraded to version 2.08. First of
> all, I'd like to say thanks for your excellent bootloaders, of which I
> have used all three for various projects, they have helped immensely on
> this project.
> 
> 	We have started to see a problem with the 1.66 version of
> isolinux on Dell PowerEdge 6650 server systems. On the newer 2.08
> version, the problem happens less often, but we still get a problem from
> time to time. The cause of the problem seems to be that the CDROM device
> goes offline or has some sort of problem. A device reset is issued, but
> then it looks like when isolinux tries to re-read the sector, the BIOS
> int 13h call destination for the data is invalid and it crashes the
> system. We believe that the root cause is some hardware problem. But,
> the interesting thing is that the newer isolinux has the problem less
> often, and other OS bootloaders (windows NT in this case) also see the
> Device Reset, but they retry the read calls and continue going just
> fine. The newer isolinux gives "isolinux: Disk error 01, drive 82.
Boot
> failed: press a key to retry" usually when it fails. The 1.66 version
> completely crashes the system with really neat video corruption.
> 
> 	I have searched through the changelogs but did not see any
> likely matches for this problem.
> 
If you could narrow down the range of versions then I might have a 
prayer of finding source changes.  I know there was at least one BIOS on 
which a chunk of low memory got just plain overwritten in certain 
circumstances; something like that would definitely explain the problem!

Other possibilities: INT 13h returns with either the wrong value in 
(E)SP, or with one of the segment registers corrupted.  Another 
possibility is that the DAPA is corrupted.
> 	I have posted ASCII version of IDE bus traces of the problem
> here: http://www.michaels-house.net/~mebrown/IDE_bus_traces.tgz (4.7MB).
> The raw data for this is from "Bus Doctor", and the raw bus
doctor data
> files are available upon request, but you need a Windows system to run
> Bus Doctor on. There is an evaluation version available you can use if
> you want to view the raw data.
Not really useful to me, I'm afraid.
> 	The BIOS team has stepped through the code and that is how we
> determined that, after the read retry, the destination buffer given is a
> bad address. Unfortunately I do not have any raw data from the BIOS guys
> at this point. If there is something specific that you need I can ask
> them to provide it.
> 
> 	Have you seen this kind of problem, or is there some other data
> that we can provide that would help provide a software workaround for
> this problem?
What would be useful is a dump of all the INT 13h calls including 
segment registers and DAPA (the 16-byte buffer pointed to by DS:SI). 
The easiest way to do that is to hack in some code just before the int 
13h call.
> * Seen following error once:
> "isolinux: Disk error 01, drive 82 
> Boot failed: press a key to retry"
> * Jorge Villaneuva from the RMSD team helped capture traces by hooking
> up an IDE analyzer for both passing and failing cases with DSA 8.2 and a
> passing case with DSA 7.5 
> * From the captured traces, there seems a difference in the way the NT
> code and Linux code handles errors. Error handling mechanism for NT code
> seems more robust than the linux version, because error happened in both
> cases but the NT code actually recovers from it.
This could just be pure dumb luck.  I've looked at the code in ISOLINUX, 
and I'm pretty sure it is correct as written.

Error 01 means "invalid call", it really could mean anything.

	-hpa

Seemingly Similar Threads

Search for more maybe matching threads

Syslinux - May 2004 - Trouble with ISOLINUX and IDE bus resets.

[syslinux] Trouble with ISOLINUX and IDE bus resets.

[syslinux] Trouble with ISOLINUX and IDE bus resets.

Seemingly Similar Threads