Jeffrey Johnson
2010-Mar-02  22:41 UTC
[zfs-discuss] Problems with raidz2 resilvering with a TON of files
Hi Folks, We have put together a 25T ZFS raidz2 zpool (16x2TB 5900 RPM 32MB Cache SATA 3.0Gb/s drives with 2x LSI SAS3081E-R SAS RAID Controllers presenting the drives as JBOD straight thru to the backplane) with 2 hot-spares on OpenSolaris snv_133. The pool contains roughly 800 Million files which are all very small (~10-200k map tiles). We had a hiccup with one of the drives and the resilvering process was initiated ... the problem is that zpool status is estimating something like 650 hours currently. This estimate has varied from 400 to 1800 as it has run over the last couple of days, but it seems to have settled around 650 now. That is just WAY too long ... we fear that if the end user of this device ever has to replace a drive in the pool, it will take this long to rebuild again. So, we are wondering if a) there is some way we can optimize or tune the pool to deal with this number of small files better and speed up the resilvering process or b) some way we can tweak the resilvering code to handle for this type of situation better. One of our engineers is looking at setting up a VM on another machine and using dtrace to find out where the bottleneck is, but we thought we might have more luck on this list. Thanks, Jeff
Bob Friesenhahn
2010-Mar-02  22:54 UTC
[zfs-discuss] Problems with raidz2 resilvering with a TON of files
On Tue, 2 Mar 2010, Jeffrey Johnson wrote:> > We have put together a 25T ZFS raidz2 zpool (16x2TB 5900 RPM 32MB > Cache SATA 3.0Gb/s drives with 2x LSI SAS3081E-R SAS RAID Controllers > presenting the drives as JBOD straight thru to the backplane) with 2 > hot-spares on OpenSolaris snv_133. The pool contains roughly 800 > Million files which are all very small (~10-200k map tiles). We had a > hiccup with one of the drives and the resilvering process was > initiated ... the problem is that zpool status is estimating something > like 650 hours currently. This estimate has varied from 400 to 1800 asOh, dear! 16 slow drives in one raidz2 vdev is just plain too many! It should be perhaps half that (at most) per raidz2 vdev. With the super-huge drives you will want to dial down the number of drives per vdev. The slow seek times and long rotational delay is a killer.> it has run over the last couple of days, but it seems to have settled > around 650 now. That is just WAY too long ... we fear that if the end > user of this device ever has to replace a drive in the pool, it will > take this long to rebuild again.This fear is well founded. Regardless, it is wise to use ''iostat -x 30'' to see if you have a slow drive in the mix. The drives should be pretty uniformly loaded. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Richard Elling
2010-Mar-02  22:58 UTC
[zfs-discuss] Problems with raidz2 resilvering with a TON of files
On Mar 2, 2010, at 2:41 PM, Jeffrey Johnson wrote:> Hi Folks, > > We have put together a 25T ZFS raidz2 zpool (16x2TB 5900 RPM 32MB > Cache SATA 3.0Gb/s drives with 2x LSI SAS3081E-R SAS RAID Controllers > presenting the drives as JBOD straight thru to the backplane) with 2 > hot-spares on OpenSolaris snv_133. The pool contains roughly 800 > Million files which are all very small (~10-200k map tiles). We had a > hiccup with one of the drives and the resilvering process was > initiated ... the problem is that zpool status is estimating something > like 650 hours currently. This estimate has varied from 400 to 1800 as > it has run over the last couple of days, but it seems to have settled > around 650 now. That is just WAY too long ... we fear that if the end > user of this device ever has to replace a drive in the pool, it will > take this long to rebuild again. > > So, we are wondering if a) there is some way we can optimize or tune > the pool to deal with this number of small files better and speed up > the resilvering process or b) some way we can tweak the resilvering > code to handle for this type of situation better. > > One of our engineers is looking at setting up a VM on another machine > and using dtrace to find out where the bottleneck is, but we thought > we might have more luck on this list.Those are slow drives, so it will take a while to resilver. To verify the I/O bottleneck, use iostat and observe the svc_t. If it is more than 5ms or so, then just be patient. AFAIK, there is no current rebuild characterization effort or data. I have/had data from several years ago, but it is not useful today. -- richard ZFS storage and performance consulting at http://www.RichardElling.com ZFS training on deduplication, NexentaStor, and NAS performance http://nexenta-atlanta.eventbrite.com (March 16-18, 2010)
Erik Trimble
2010-Mar-02  23:26 UTC
[zfs-discuss] Problems with raidz2 resilvering with a TON of files
Richard Elling wrote:> On Mar 2, 2010, at 2:41 PM, Jeffrey Johnson wrote: > >> Hi Folks, >> >> We have put together a 25T ZFS raidz2 zpool (16x2TB 5900 RPM 32MB >> Cache SATA 3.0Gb/s drives with 2x LSI SAS3081E-R SAS RAID Controllers >> presenting the drives as JBOD straight thru to the backplane) with 2 >> hot-spares on OpenSolaris snv_133. The pool contains roughly 800 >> Million files which are all very small (~10-200k map tiles). We had a >> hiccup with one of the drives and the resilvering process was >> initiated ... the problem is that zpool status is estimating something >> like 650 hours currently. This estimate has varied from 400 to 1800 as >> it has run over the last couple of days, but it seems to have settled >> around 650 now. That is just WAY too long ... we fear that if the end >> user of this device ever has to replace a drive in the pool, it will >> take this long to rebuild again. >> >> So, we are wondering if a) there is some way we can optimize or tune >> the pool to deal with this number of small files better and speed up >> the resilvering process or b) some way we can tweak the resilvering >> code to handle for this type of situation better. >> >> One of our engineers is looking at setting up a VM on another machine >> and using dtrace to find out where the bottleneck is, but we thought >> we might have more luck on this list. >> > > Those are slow drives, so it will take a while to resilver. To verify the > I/O bottleneck, use iostat and observe the svc_t. If it is more than 5ms > or so, then just be patient. > > AFAIK, there is no current rebuild characterization effort or data. I have/had > data from several years ago, but it is not useful today. > -- richarI''m still assuming that a resilver a disk under RAIDZ[123] which contains a large number of very small files is the Worst Case scenario for resilver rates, correct? Or has something significant changed recently? That, and the 2TB/5900RPM drives are /horribly/ slow. They max out at 50 IOPS on a good day, and can''t even do sustained streaming write/read above 100MB/s. I would be surprised if they can even make 10MB/s doing typical random I/O, and it''s going to be even worse doing small random I/O chunks. Which boils down to me considering you lucky if you get 1-2 MB/s performance out of them. Estimate of 650 hours to do 2TB = a bit under 1MB/s. -- Erik Trimble Java System Support Mailstop: usca22-123 Phone: x17195 Santa Clara, CA