We recently moved a Mysql database from NFS (Netapp) to a local disk array (J4200 with SAS disks). Shortly after moving production, the system effectively hung. CPU was at 100%, and one disk drive was at 100%. I had tried to follow the tuning recommendations for Mysql mostly: * recordsize set to 16K * primarycache=metadata * zfs_prefetch_disable=1 The theory of primarycache=metadata is that Mysql will do a better job of caching internally than ZFS does. However continuous read of a disk suggested to me that perhaps Mysql was reading something repeatedly. Thus (after restarting everything) I put primarycache back to the default. I haven''t seen the problem again. But there''s no way to know whether I actually fixed it or whether it was just a fluke. At this point load on the storage is low enough that further tuning doesn''t seem worth it. We average less than 1 MB / sec read and write. -- This message posted from opensolaris.org
+------------------------------------------------------------------------------ | On 2010-02-20 08:12:53, Charles Hedrick wrote: | | We recently moved a Mysql database from NFS (Netapp) to a local disk array (J4200 with SAS disks). Shortly after moving production, the system effectively hung. CPU was at 100%, and one disk drive was at 100%. If one disk is stuck at 100% busy, constantly, it usually means that disk is dying, dead, or there''s a problem with its backplane. Check sw/hw/trn errors with iostat. Check fmdump. See zpool status -v for chksum errors. Did you test the disks before deploying? A 12 way mirror and filebench/bonnie++/iozone is a nice way to quickly stress them. A hung disk can affect system performance in extremely obnoxious ways. Offline the disk and see if performance improves. Depending on the driver this can take many minutes. (May be faster to just pull the disk and let the kernel notice it''s actually really gone instead of just maybe not talking anymore.) -- bda cyberpunk is dead. long live cyberpunk.
We had been using the same pool for a backup Mysql server for 6 months before using it for the primary server. Neither zpool status -v nor fmdump shows any signs of problems. -- This message posted from opensolaris.org
I hadn''t considered stress testing the disks. Obviously that''s a good idea. We''ll look at doing something in May, when we have the next opportunity to take down the database. I doubt that doing testing during production is a good idea... -- This message posted from opensolaris.org
+------------------------------------------------------------------------------ | On 2010-02-20 08:45:23, Charles Hedrick wrote: | | I hadn''t considered stress testing the disks. Obviously that''s a good idea. We''ll look at doing something in May, when we have the next opportunity to take down the database. I doubt that doing testing during production is a good idea... Indeed. :) I''d again suggest offlining/pulling the 100% disk to see if that helps. -- bda cyberpunk is dead. long live cyberpunk.