Hello, I have a remote machine with a probably defective ODD, but until r294989 (from Jan 28th) I could boot with just these warnings: (cd1:ahcich1:0:0:0): READ(10). CDB: 28 00 00 38 85 e0 00 00 01 00 (cd1:ahcich1:0:0:0): CAM status: SCSI Status Error (cd1:ahcich1:0:0:0): SCSI status: Check Condition (cd1:ahcich1:0:0:0): SCSI sense: MEDIUM ERROR asc:11,0 (Unrecovered read error) (cd1:ahcich1:0:0:0): Error 5, Unretryable error (cd1:ahcich1:0:0:0): cddone: got error 0x5 back ? beta3 doesn't boot anymore, it's hanging with ahci-timeouts: ahcich2: Timeout on slot 11 port 0 ahcich2: is 00000008 cs 00000000 ss 00000000 rs 00000800 tfd 40 derr 00000000 cmd 0004cb17 (ada1:ahcich2:0:0:0): READ_FPDMA_QUEUED. ACB: 60 01 ae a3 50 40 5d 01 00 00 00 00 ... (aprobe0:ahcich2:0:0:0) ATA_IDENTIFY. ACB eec 00 00 00 00 40 00 00 00 00 00 00 (aprobe0:ahcich2:0:0:0) CAM status: Command timeout (aprobe0:ahcich2:0:0:0) Error 5, Retry was blocked ada1 detached ... The numbers (first ACB) and also the channel varies from time to time. I couldn't track down the revision yet, checked r295124 and r295131 so far. Just noticed that probing differs between working (294989) and and non-working revisoin (r296074): The latter attaches cd past ada, the former (working) probes cd first. Will see to find out more until next weekend. Any hints welcome. Thanks, -Harry
Bez?glich Harry Schmalzbauer's Nachricht vom 28.02.2016 20:55 (localtime):> Hello, > > I have a remote machine with a probably defective ODD, but until r294989 > (from Jan 28th) I could boot with just these warnings: > (cd1:ahcich1:0:0:0): READ(10). CDB: 28 00 00 38 85 e0 00 00 01 00 > (cd1:ahcich1:0:0:0): CAM status: SCSI Status Error > (cd1:ahcich1:0:0:0): SCSI status: Check Condition > (cd1:ahcich1:0:0:0): SCSI sense: MEDIUM ERROR asc:11,0 (Unrecovered read > error) > (cd1:ahcich1:0:0:0): Error 5, Unretryable error > (cd1:ahcich1:0:0:0): cddone: got error 0x5 back > ? > > beta3 doesn't boot anymore, it's hanging with ahci-timeouts: > ahcich2: Timeout on slot 11 port 0 > ahcich2: is 00000008 cs 00000000 ss 00000000 rs 00000800 tfd 40 derr > 00000000 cmd 0004cb17 > (ada1:ahcich2:0:0:0): READ_FPDMA_QUEUED. ACB: 60 01 ae a3 50 40 5d 01 00 > 00 00 00 > ... > (aprobe0:ahcich2:0:0:0) ATA_IDENTIFY. ACB eec 00 00 00 00 40 00 00 00 00 > 00 00 > (aprobe0:ahcich2:0:0:0) CAM status: Command timeout > (aprobe0:ahcich2:0:0:0) Error 5, Retry was blocked > ada1 detached > ... > The numbers (first ACB) and also the channel varies from time to timeI could narrow it down to r295480 (https://svnweb.freebsd.org/base?view=revision&revision=295480) Reverting that lets the machine boot again. I captured verbose boot messages, finding out that problem relaxes with verbose-booting, since ahci seems to recover: ? TSC timecounter discards lower 1 bit(s) Timecounter "TSC-low" frequency 1746033500 Hz quality -100 ahcich2: Timeout on slot 12 port 0 ahcich2: is 00000008 cs 00000000 ss 00000000 rs 00001000 tfd 40 serr 00000000 cmd 0004cc17 ahcich2: AHCI reset... (ada1:ahcich2:0:0:0): READ_FPDMA_QUEUED. ACB: 60 04 71 a3 50 40 5d 01 00 00 00 00 (ada1:ahcich2:0:0:0): CAM status: Command timeout (ada1:ahcich2:0:0:0): Retrying command ahcich2: SATA connect time=100us status=00000123 ahcich2: AHCI reset: device found ahcich2: AHCI reset: device ready after 100ms ahcich1: SNTF 0x0001 ahcich1: SNTF 0x0001 ? I have checked twice that r295480 introduces boot failure here. I have absolutely no idea where/how/why/what race happens... Thanks for any hints, -Harry
On Monday, February 29, 2016 07:29:03 PM Harry Schmalzbauer wrote:> Bez?glich Harry Schmalzbauer's Nachricht vom 28.02.2016 20:55 (localtime): > > Hello, > > > > I have a remote machine with a probably defective ODD, but until r294989 > > (from Jan 28th) I could boot with just these warnings: > > (cd1:ahcich1:0:0:0): READ(10). CDB: 28 00 00 38 85 e0 00 00 01 00 > > (cd1:ahcich1:0:0:0): CAM status: SCSI Status Error > > (cd1:ahcich1:0:0:0): SCSI status: Check Condition > > (cd1:ahcich1:0:0:0): SCSI sense: MEDIUM ERROR asc:11,0 (Unrecovered read > > error) > > (cd1:ahcich1:0:0:0): Error 5, Unretryable error > > (cd1:ahcich1:0:0:0): cddone: got error 0x5 back > > ? > > > > beta3 doesn't boot anymore, it's hanging with ahci-timeouts: > > ahcich2: Timeout on slot 11 port 0 > > ahcich2: is 00000008 cs 00000000 ss 00000000 rs 00000800 tfd 40 derr > > 00000000 cmd 0004cb17 > > (ada1:ahcich2:0:0:0): READ_FPDMA_QUEUED. ACB: 60 01 ae a3 50 40 5d 01 00 > > 00 00 00 > > ... > > (aprobe0:ahcich2:0:0:0) ATA_IDENTIFY. ACB eec 00 00 00 00 40 00 00 00 00 > > 00 00 > > (aprobe0:ahcich2:0:0:0) CAM status: Command timeout > > (aprobe0:ahcich2:0:0:0) Error 5, Retry was blocked > > ada1 detached > > ... > > The numbers (first ACB) and also the channel varies from time to time > > I could narrow it down to r295480 > (https://svnweb.freebsd.org/base?view=revision&revision=295480) > > Reverting that lets the machine boot again. > > I captured verbose boot messages, finding out that problem relaxes with > verbose-booting, since ahci seems to recover: > ? > TSC timecounter discards lower 1 bit(s) > Timecounter "TSC-low" frequency 1746033500 Hz quality -100 > ahcich2: Timeout on slot 12 port 0 > ahcich2: is 00000008 cs 00000000 ss 00000000 rs 00001000 tfd 40 serr > 00000000 cmd 0004cc17 > ahcich2: AHCI reset... > (ada1:ahcich2:0:0:0): READ_FPDMA_QUEUED. ACB: 60 04 71 a3 50 40 5d 01 00 > 00 00 00 > (ada1:ahcich2:0:0:0): CAM status: Command timeout > (ada1:ahcich2:0:0:0): Retrying command > ahcich2: SATA connect time=100us status=00000123 > ahcich2: AHCI reset: device found > ahcich2: AHCI reset: device ready after 100ms > ahcich1: SNTF 0x0001 > ahcich1: SNTF 0x0001 > ? > > I have checked twice that r295480 introduces boot failure here. > > I have absolutely no idea where/how/why/what race happens... > > Thanks for any hints,That is most bizarre. Does HEAD boot fine on this machine? The change in question probably alters the timing of startup a bit since the random kthread is placed on the run queue later which might affect the relative order of kthreads as they start executing, but that would just mean it is exposting a race in some other part of the system. -- John Baldwin