All, Release testing has shown several recent problems with the 3ware (twe) driver. Attached is a patch that appears to fix these problems. I would appreciate as much testing as possible before I commit so that I can be sure that all of the problems are caught and fixed correctly. The patch applies to both RELENG_4_10 and RELENG_4 branches. Symptoms of the problems included i/o hangs under heavy load and filesystem corruption. Thanks, Scott -------------- next part -------------- Index: twe.c ==================================================================RCS file: /mnt/ncvs/src/sys/dev/twe/twe.c,v retrieving revision 1.1.2.8 diff -u -r1.1.2.8 twe.c --- twe.c 7 Apr 2004 22:18:00 -0000 1.1.2.8 +++ twe.c 3 May 2004 01:03:37 -0000 @@ -393,14 +393,14 @@ /* build a command from an outstanding bio */ if (tr == NULL) { - - /* see if there's work to be done */ - if ((bp = twe_dequeue_bio(sc)) == NULL) - break; /* get a command to handle the bio with */ - if (twe_get_request(sc, &tr)) { - twe_enqueue_bio(sc, bp); /* failed, put the bio back */ + if (twe_get_request(sc, &tr)) + break; + + /* see if there's work to be done */ + if ((bp = twe_dequeue_bio(sc)) == NULL) { + twe_release_request(tr); break; } @@ -1093,20 +1093,18 @@ { TWE_Response_Queue rq; struct twe_request *tr; - int s, found; + int s; u_int32_t status_reg; debug_called(5); /* loop collecting completed commands */ - found = 0; s = splbio(); for (;;) { status_reg = TWE_STATUS(sc); twe_check_bits(sc, status_reg); /* XXX should this fail? */ if (!(status_reg & TWE_STATUS_RESPONSE_QUEUE_EMPTY)) { - found = 1; rq = TWE_RESPONSE_QUEUE(sc); tr = sc->twe_lookup[rq.u.response_id]; /* find command */ if (tr->tr_status != TWE_CMD_BUSY) @@ -1117,6 +1115,7 @@ /* move to completed queue */ twe_remove_busy(tr); twe_enqueue_complete(tr); + sc->twe_state &= ~TWE_STATE_FRZN; } else { break; /* no response ready */ } @@ -1124,8 +1123,7 @@ splx(s); /* if we've completed any commands, try posting some more */ - if (found) - twe_startio(sc); + twe_startio(sc); /* handle completion and timeouts */ twe_complete(sc); /* XXX use deferred completion? */ Index: twe_freebsd.c ==================================================================RCS file: /mnt/ncvs/src/sys/dev/twe/twe_freebsd.c,v retrieving revision 1.2.2.8 diff -u -r1.2.2.8 twe_freebsd.c --- twe_freebsd.c 7 Apr 2004 22:18:00 -0000 1.2.2.8 +++ twe_freebsd.c 3 May 2004 01:00:32 -0000 @@ -944,8 +944,6 @@ tr->tr_flags |= TWE_CMD_MAPPED; - if (tr->tr_flags & TWE_CMD_IN_PROGRESS) - tr->tr_sc->twe_state &= ~TWE_STATE_FRZN; /* save base of first segment in command (applicable if there only one segment) */ tr->tr_dataphys = segs[0].ds_addr; @@ -1055,7 +1053,6 @@ if ((error = bus_dmamap_load(sc->twe_buffer_dmat, tr->tr_dmamap, tr->tr_data, tr->tr_length, twe_setup_data_dmamap, tr, 0) == EINPROGRESS)) { - tr->tr_flags |= TWE_CMD_IN_PROGRESS; sc->twe_state |= TWE_STATE_FRZN; error = 0; } @@ -1102,6 +1099,8 @@ free(tr->tr_data, TWE_MALLOC_CLASS); tr->tr_data = tr->tr_realdata; /* restore 'real' data pointer */ } + + tr->tr_flags &= ~TWE_CMD_MAPPED; } #ifdef TWE_DEBUG Index: twevar.h ==================================================================RCS file: /mnt/ncvs/src/sys/dev/twe/twevar.h,v retrieving revision 1.1.2.6 diff -u -r1.1.2.6 twevar.h --- twevar.h 7 Apr 2004 22:18:01 -0000 1.1.2.6 +++ twevar.h 3 May 2004 00:49:34 -0000 @@ -121,7 +121,6 @@ #define TWE_CMD_ALIGNBUF (1<<2) /* data in bio is misaligned, have to copy to/from private buffer */ #define TWE_CMD_SLEEPER (1<<3) /* owner is sleeping on this command */ #define TWE_CMD_MAPPED (1<<4) /* cmd has been mapped */ -#define TWE_CMD_IN_PROGRESS (1<<5) /* bus_dmamap_load returned EINPROGRESS */ void (* tr_complete)(struct twe_request *tr); /* completion handler */ void *tr_private; /* submitter-private data or wait channel */
What cards has this shown up with and what versions of the BIOS ? I have quite a few 3ware boxes deployed and have not seen any problems. When was the bug introduced ? ---Mike At 10:09 AM 03/05/2004, Scott Long wrote:>All, > >Release testing has shown several recent problems with the 3ware (twe) >driver. Attached is a patch that appears to fix these problems. I
On 03 May 2004, Scott Long wrote:> Release testing has shown several recent problems with the 3ware (twe) > driver. Attached is a patch that appears to fix these problems. I > would appreciate as much testing as possible before I commit so that I > can be sure that all of the problems are caught and fixed correctly. > The patch applies to both RELENG_4_10 and RELENG_4 branches. Symptoms > of the problems included i/o hangs under heavy load and filesystem > corruption.Out of curiosity, do you think this might be happening in CURRENT as of at least 5.2.1-RELEASE-p5 too? The reason I ask is because I've seen some hard system freezes (not even crashing, just locking up hard) under 5 with a 2TB twe array. And I can almost reproduce it without fail by hitting my Debian archive on that disk array from three Debian Linux clients simultaneously doing updates through dselect. With Apache grabbing at the same files for those three connections, my 5 server just stops dead. It doesn't happen every single time I do this, but a good percentage of the time (I'd say at least half the time), it will trigger whatever bug I'm seeing. Just to be clear, this box is an SMP box. I'm still running the older BSD scheduler instead of ULE. And it has an em network interface running at 100Mbps/full. The file system on the RAID array is UFS2. I just recently added all the debugging stuff back into the kenerl to see if I could get a good crash dump, but I've been unwilling to trigger the bug again since the server is pseudo-production (I know, I know...) at this point and fsck'ing that much drive space is SLOW (I've had bad luck with the whole background fsck'ing idea; it tends to just lock the machine up again). Anyway. Ignore all of this if you think that this problem shouldn't exist in CURRENT. -- Mark Nipper e-contacts: Computing and Information Services nipsy@tamu.edu Texas A&M University http://ops.tamu.edu/nipsy/ College Station, TX 77843-3142 AIM/Yahoo: texasnipsy ICQ: 66971617 (979)575-3193 MSN: nipsy@tamu.edu -----BEGIN GEEK CODE BLOCK----- GG/IT d- s++:+ a- C++$ UBL+++$ P--->+++ L+++$ E--- W++ N+ o K++ w(---) O++ M V(--) PS+++(+) PE(--) Y+ PGP++(+) t 5 X R tv b+++ DI+(++) D+ G e h r++ y+(**) ------END GEEK CODE BLOCK------ ---begin random quote of the moment--- In theory there is no difference between theory and practice. In practice there is. ----end random quote of the moment----
At 10:09 AM 03/05/2004, Scott Long wrote:>All, > >Release testing has shown several recent problems with the 3ware (twe) >driver. Attached is a patch that appears to fix these problems. I >would appreciate as much testing as possible before I commit so that I >can be sure that all of the problems are caught and fixed correctly. >The patch applies to both RELENG_4_10 and RELENG_4 branches. Symptoms >of the problems included i/o hangs under heavy load and filesystem >corruption.I was never able to recreate the problem, however I did test the patches on a couple of machines. Note, in the past, a LONG time ago (back when msmith@freebsd.org wrote the drivers) the following with just 10 could hang the twe system where all disk I/O would be blocking in some race condition. #!/bin/sh i=1 while [ $i -le 50 ] do i=`expr $i + 1` bonnie -s 100 -d /usr/ & done Seems to get through it no problem, and disk io is consistent, despite 50 processes blocking. Also did make buildworld make -j2 buildworld make -j3 buildworld ... all the way to -j11 make -j11 buildworld > /var/log/build.out all with seemingly no problems. I tested on a Monitor version: ME7X 1.01.00.038 Firmware version: FE7S 1.05.00.063 BIOS version: BE7X 1.08.00.048 PCB version: Rev5 Achip version: 3.20 Pchip version: 1.30-66 Model: 8006-2LP Unit count: 1 and a Monitor version: ME6X 1.01.00.028 Firmware version: FE6X 1.02.28.053 BIOS version: BE6X 1.07.02.005 PCB version: Rev2 Achip version: V4.40 Pchip version: V5.70 Model: 6400 Unit count: 1 both running 4.10-PRERELEASE Also, would not these same issues crop up in -HEAD ? ---Mike
Am Montag, 3. Mai 2004 16:09 schrieb Scott Long:> All, > > Release testing has shown several recent problems with the 3ware (twe) > driver. Attached is a patch that appears to fix these problems. I > would appreciate as much testing as possible before I commit so that I > can be sure that all of the problems are caught and fixed correctly. > The patch applies to both RELENG_4_10 and RELENG_4 branches. Symptoms > of the problems included i/o hangs under heavy load and filesystem > corruption.Is it possible/advisable for me to try this patch against -current? I have symptoms where the controler LED is on, the system shows me constant transfer to (or from) the disk with about 4MB/s and a disk utilization of 99%, but no single bit goes over the wire and there's also no other process which could cause disk activity (also not swapping!). This happens when I transfer large files over NFS to the server (with the 3ware). I thought it was NFS but nobody on -current beliefed my description and a buggy 3ware driver could be a possible explanation for this very strange bug. But o.t.o.h the system was full usable, just NFS locked up. Thank you, -Harry> > Thanks, > > Scott-------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 187 bytes Desc: signature Url : http://lists.freebsd.org/pipermail/freebsd-stable/attachments/20040505/ce2d9d8d/attachment.bin