thr3ads.net - zfs discuss - [zfs-discuss] scrub performance [Mar 2008]

If this information is useful, please help other people find it:
Share via:

Stuart Anderson

2008-Mar-06 19:51 UTC

[zfs-discuss] scrub performance

I currently have an X4500 running S10U4 with the latest ZFS uber patch
(127729-07) for which "zpool scrub" is making very slow progress even
though the necessary resources are apparently available. Currently it has
been running for 3 days to reach 75% completion, however, in the last 12
hours this only advanced by 3%. At times this server is busy running NFSD
and it is understandable that the scrub to take a lower priority, however,
I have observed interestingly long time intervals when neither prstat
nor iostat show any obvious bottlenecks, e.g., disks at <10% busy.
Is there a throttle on scrub resource allocation that does not readily
open up again after being limited due to other system activity?

For comparison, an identical system (same OS/zpool config, and roughly
the same number of filesystems and files) finished a scrub in 2 days.

This is not a critical problem, but at least initially it was clear
from iostat that scrub was pegging all the disk IOPS/BW as available,
but I am curious why it has backed off from that after a few days
of running.

P.S. I realize it is not a user command and that the last event can be
found in "zpool status", but I would find it convenient if the scrub
completion event was also logged in the zpool history along with the
initiation event.

Thanks.

-- 
Stuart Anderson  anderson at ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson

Stuart Anderson

2008-Mar-06 23:18 UTC

head link

[zfs-discuss] scrub performance

On Thu, Mar 06, 2008 at 11:51:00AM -0800, Stuart Anderson
wrote:> I currently have an X4500 running S10U4 with the latest ZFS uber patch
> (127729-07) for which "zpool scrub" is making very slow progress
even
> though the necessary resources are apparently available. Currently it has
It is also interesting to note that this system is now making negative
progress. I can understand the remaining time estimate going up with
time, but what does it mean for the % complete number to go down after
6 hours of work?

Thanks.


# zpool status | egrep -e "progress|errors" ; date
 scrub: scrub in progress, 75.49% done, 28h51m to go
errors: No known data errors
Thu Mar  6 08:50:59 PST 2008

# zpool status | egrep -e "progress|errors" ; date
 scrub: scrub in progress, 75.24% done, 31h20m to go
errors: No known data errors
Thu Mar  6 15:15:39 PST 2008

-- 
Stuart Anderson  anderson at ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson

Marion Hakanson

2008-Mar-07 01:55 UTC

head link

[zfs-discuss] scrub performance

anderson at ligo.caltech.edu said:> It is also interesting to note that this system is now making negative
> progress. I can understand the remaining time estimate going up with time,
> but what does it mean for the % complete number to go down after 6 hours of
> work? 
Sorry I don''t have any helpful experience in this area.  It occurs to
me
that perhaps you are detecting a gravity wave of some sort -- Thumpers
are pretty heavy, and thus may be more affected than the average server.
Or the guys at SLAC have, unbeknownst to you, somehow accelerated your
Thumper to near the speed of light.

(:-)

Regards,

Marion

Stuart Anderson

2008-Mar-07 02:11 UTC

head link

[zfs-discuss] scrub performance

On Thu, Mar 06, 2008 at 05:55:53PM -0800, Marion Hakanson
wrote:> anderson at ligo.caltech.edu said:
> > It is also interesting to note that this system is now making negative
> > progress. I can understand the remaining time estimate going up with
time,
> > but what does it mean for the % complete number to go down after 6
hours of
> > work? 
> 
> Sorry I don''t have any helpful experience in this area.  It occurs
to me
> that perhaps you are detecting a gravity wave of some sort -- Thumpers
> are pretty heavy, and thus may be more affected than the average server.
> Or the guys at SLAC have, unbeknownst to you, somehow accelerated your
> Thumper to near the speed of light.
> 
> (:-)
> 
If true, that would certainly help, since we actually are using these
thumpers to help detect gravitational waves! See, http://www.ligo.caltech.edu.

Thanks.



-- 
Stuart Anderson  anderson at ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson

David Pacheco

2008-Mar-07 02:25 UTC

head link

[zfs-discuss] scrub performance

Stuart Anderson wrote:> On Thu, Mar 06, 2008 at 11:51:00AM -0800, Stuart Anderson wrote:
>   
>> I currently have an X4500 running S10U4 with the latest ZFS uber patch
>> (127729-07) for which "zpool scrub" is making very slow
progress even
>> though the necessary resources are apparently available. Currently it
has
>>     
>
> It is also interesting to note that this system is now making negative
> progress. I can understand the remaining time estimate going up with
> time, but what does it mean for the % complete number to go down after
> 6 hours of work?
>
> Thanks.
>
>
> # zpool status | egrep -e "progress|errors" ; date
>  scrub: scrub in progress, 75.49% done, 28h51m to go
> errors: No known data errors
> Thu Mar  6 08:50:59 PST 2008
>
> # zpool status | egrep -e "progress|errors" ; date
>  scrub: scrub in progress, 75.24% done, 31h20m to go
> errors: No known data errors
> Thu Mar  6 15:15:39 PST 2008
>
>   There are a few things which may cause the scrub to restart. See:

6655927 zpool status causes a resilver or scrub to restart
6343667 scrub/resilver has to start over when a snapshot is taken

Sorry the latter doesn''t have a useful description, but the synopsis 
says it all: taking snapshots causes scrubs to restart. Either of these 
may explain the "negative progress."

-- 
David Pacheco, Sun Microsystems

Stuart Anderson

2008-Mar-07 20:44 UTC

head link

[zfs-discuss] scrub performance

On Thu, Mar 06, 2008 at 06:25:21PM -0800, David Pacheco
wrote:> Stuart Anderson wrote:
> >
> >It is also interesting to note that this system is now making negative
> >progress. I can understand the remaining time estimate going up with
> >time, but what does it mean for the % complete number to go down after
> >6 hours of work?
> >
> >Thanks.
> >
> >
> ># zpool status | egrep -e "progress|errors" ; date
> > scrub: scrub in progress, 75.49% done, 28h51m to go
> >errors: No known data errors
> >Thu Mar  6 08:50:59 PST 2008
> >
> ># zpool status | egrep -e "progress|errors" ; date
> > scrub: scrub in progress, 75.24% done, 31h20m to go
> >errors: No known data errors
> >Thu Mar  6 15:15:39 PST 2008
> >
> >  
> There are a few things which may cause the scrub to restart. See:
> 
> 6655927 zpool status causes a resilver or scrub to restart
> 6343667 scrub/resilver has to start over when a snapshot is taken
> 
> Sorry the latter doesn''t have a useful description, but the
synopsis
> says it all: taking snapshots causes scrubs to restart. Either of these 
> may explain the "negative progress."
> 
I have confirmed that "zpool status" run as root does not reset the
counter
on this system, and no snapshots where made since the scrub was started.


For what it is worth, this system has now made it to,

# /usr/sbin/zpool status | egrep -e "progress|errors" ; date
 scrub: scrub in progress, 92.05% done, 10h4m to go
errors: No known data errors
Fri Mar  7 12:41:44 PST 2008

The kernel is at 1% CPU utilization and the disks in the pool are only
10% busy (iostat). So it is making progress, but I am still confused about
what resource is blocking faster progress.

Thanks.

-- 
Stuart Anderson  anderson at ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson

zfs discuss - Mar 2008 - scrub performance

[zfs-discuss] scrub performance

[zfs-discuss] scrub performance

[zfs-discuss] scrub performance

[zfs-discuss] scrub performance

[zfs-discuss] scrub performance

[zfs-discuss] scrub performance