Patrick M. Hausen wrote:> Hi, all!
>
> We just set a brand new Intel SSR212CC box into production.
> This is basically a standard server with 2 LSI SATA RAID
> controllers and 12 drive bays in 2 rack units height.
>
> Intel sells it as a storage product. There's a variant
> of Windows 2003 server that turns this box into an iSCSI
> target.
>
> We want to use it for disk based backup with Amanda.
> The system runs 6-STABLE at the moment.
>
> amr0: <LSILogic MegaRAID 1.53> mem 0xfbef0000-0xfbefffff,
> 0xfcd00000-0xfcdfffff irq 72 at device 14.0 on pci6
> amr0: <LSILogic Intel(R) RAID Controller SRCS28X>
> Firmware 814C, BIOS H431, 128MB RAM
> amr1: <LSILogic MegaRAID 1.53> mem 0xfbff0000-0xfbffffff,
> 0xfcf00000-0xfcffffff irq 96 at device 14.0 on pci8
> amr1: <LSILogic Intel(R) RAID Controller SRCS28X>
> Firmware 814C, BIOS H431, 128MB RAM
> amrd0: <LSILogic MegaRAID logical drive> on amr0
> amrd0: 1907348MB (3906248704 sectors) RAID 5 (optimal)
>
> Since the two RAID controllers come with a battery backup for
> their cache memory, I configured the logical drive with write
> back cache policy and the individual disk drives' write caches
> off.
>
> After cvsup and build/installworld, I noticed strange
> Sendmail failures (signal 11) on the box.
>
> Reinstalling Sendmail fixed the problem. Just to make sure
> I did installworld again, rebooted - Sendmail signal 11.
>
> Then it dawned at me that Sendmail is the last binary installed
> and written to the logical drive in the installworld process.
> I can reproduce the problem any time: installworld, reboot,
> Sendmail broken. Installworld or just reinstall Sendmail, don't
> reboot, everything's fine. No matter if I use "reboot" or
> "shutdown -r".
>
> Is it possible that the amr driver does not issue the necessary
> flush command to the controller (probably first part of the
> problem) and additionally the controller loses it's cache
> content at the following system reset despite it's BBU
> (second part of problem - iir controllers by ICP Vortex handle
> a system reset just fine, syncing the drives during boot)?
>
> And ideas? I don't have a different explanation. A coworker
> suggested a possible yet unknown UFS2 problem with large
> filesystems, but /usr is not large on this box. /var is.
>
> The last couple of writes before a system reboot are lost.
> Reliably. I will set the controller's cache policy back to
> "write through", but I'm still not sleeping well ...
>
>
> Thanks,
> Patrick
>
> P.S. As a side note: no problems at all with the em(4) driver so
> far on this one.
It is very arguably a bug in the LSI firmware if it is actually dumping
its cache when a PCI reset occurs, especially if a battery unit is
present. However, I seriously doubt that you will get anyone at LSI to
listen to this problem. Do you get any messages on the console at
shutdown about the amr driver flushing the cache? Also, check the cache
setting on the drives itself. Maybe the drives are loosing power or
getting reset while data is in their cache. It's bad practice to enable
the write cache on a drive in an RAID array for just this very reason,
but some vendors do it anyways in an attempt to cover up poor
performance.
Scott