thr3ads.net - Ocfs2 devel - [Ocfs2-users] How long for an fsck? [Apr 2011]

If this information is useful, please help other people find it:
Share via:

Josep Guerrero

2011-Apr-21 13:43 UTC

[Ocfs2-users] How long for an fsck?

I have a cluster with 8 nodes, all of them running Debian Lenny (plus some 
additions so multipath and Infiniband works), which share an array of 48 1TB 
disks. Those disks form 22 pairs of hardware RAID1, plus 4 spares). The first 
21 pairs are organized in two striped LVM logical volumes, of 16 and 3 TB, 
both formatted with ocfs2. The kernel is the version supplied with the 
distribution (2.6.26-2-amd64).

I wanted to run an fsck on both volumes because of some errors I was getting 
(probably unrelated to the filesystems, but I wanted to check). On the 3TB 
volume (around 10% full) the check worked perfectly, and finished in less than 
an hour (this was run with the fsck.ocfs2 provided by Lenny ocfs2-tools, 
version 1.4.1):

=============root at hidra0:/usr/local/src# fsck.ocfs2 -f /dev/hidrahome/lvol1
Checking OCFS2 filesystem in /dev/hidrahome/lvol1:
  label:              <NONE>
  uuid:               ab 76 a9 41 fa df 4c ac a3 9f 26 c5 ae 34 1a 3f 
  number of blocks:   959809536
  bytes per block:    4096
  number of clusters: 959809536
  bytes per cluster:  4096
  max slots:          8

/dev/hidrahome/lvol1 was run with -f, check forced.
Pass 0a: Checking cluster allocation chains
Pass 0b: Checking inode allocation chains
Pass 0c: Checking extent block allocation chains
Pass 1: Checking inodes and blocks.
Pass 2: Checking directory entries.
Pass 3: Checking directory connectivity.
Pass 4a: checking for orphaned inodes
Pass 4b: Checking inodes link counts.
All passes succeeded.
===========
but the check for the second filesystem (around 40% full) did this:

===========hidra0:/usr/local/src# fsck.ocfs2 -f /dev/hidrahome/lvol0
Checking OCFS2 filesystem in /dev/hidrahome/lvol0:
  label:              <NONE>
  uuid:               6a a9 0e aa cf 33 45 4c b4 72 3a b6 7c 3b 8d 57
  number of blocks:   4168098816
  bytes per block:    4096
  number of clusters: 4168098816
  bytes per cluster:  4096
  max slots:          8

/dev/hidrahome/lvol0 was run with -f, check forced.
Pass 0a: Checking cluster allocation chains
============
and stayed there for 8 hours (all the time keeping one core around 100% CPU 
usage and with a light load on the disks; this was consistent with the same 
step in the previous run, but of course it didn't take so long). I thought 
that maybe I had run into some bug, so I interrupted the process, downloaded 
ocfs2-tools 1.4.4 sources, compiled them, and tried with that fsck, obtaining 
similar results, since it's been running for almost 7 hours like this:

============hidra0:/usr/local/src/ocfs2-tools-1.4.4/fsck.ocfs2# ./fsck.ocfs2 -f 
/dev/hidrahome/lvol0
fsck.ocfs2 1.4.4
Checking OCFS2 filesystem in /dev/hidrahome/lvol0:
  Label:              <NONE>
  UUID:               6AA90EAACF33454CB4723AB67C3B8D57
  Number of blocks:   4168098816
  Block size:         4096
  Number of clusters: 4168098816
  Cluster size:       4096
  Number of slots:    8

/dev/hidrahome/lvol0 was run with -f, check forced.
Pass 0a: Checking cluster allocation chains

============
and with one core CPU at 100%. 

Could someone tell me if this is normal? I've been searching the web and 
checking manuals for information on how long this checks should take, and 
apart from one message in this list mentioning that 3 days in a 8 TB filesystem 
with 300 GB was too long, I haven't been able to find anything. 

If this is normal, is there any way to estimate, taking into account that the 
first filesystem uses exactly the same disks and took less than an hour to 
check, how long it should take for this other filesystem?

Thanks!

Josep Guerrero

Karim Alkhayer

2011-Apr-21 15:22 UTC

head link

[Ocfs2-users] How long for an fsck?

What is the block size? 

-----Original Message-----
From: ocfs2-users-bounces at oss.oracle.com
[mailto:ocfs2-users-bounces at oss.oracle.com] On Behalf Of Josep Guerrero
Sent: Thursday, April 21, 2011 4:43 PM
To: ocfs2-users at oss.oracle.com
Subject: [Ocfs2-users] How long for an fsck?

I have a cluster with 8 nodes, all of them running Debian Lenny (plus some
additions so multipath and Infiniband works), which share an array of 48 1TB
disks. Those disks form 22 pairs of hardware RAID1, plus 4 spares). The
first

21 pairs are organized in two striped LVM logical volumes, of 16 and 3 TB,
both formatted with ocfs2. The kernel is the version supplied with the
distribution (2.6.26-2-amd64).

I wanted to run an fsck on both volumes because of some errors I was getting
(probably unrelated to the filesystems, but I wanted to check). On the 3TB
volume (around 10% full) the check worked perfectly, and finished in less
than an hour (this was run with the fsck.ocfs2 provided by Lenny
ocfs2-tools, version 1.4.1):

=============
root at hidra0:/usr/local/src# fsck.ocfs2 -f /dev/hidrahome/lvol1 Checking
OCFS2 filesystem in /dev/hidrahome/lvol1:

  label:              <NONE>

  uuid:               ab 76 a9 41 fa df 4c ac a3 9f 26 c5 ae 34 1a 3f 

  number of blocks:   959809536

  bytes per block:    4096

  number of clusters: 959809536

  bytes per cluster:  4096

  max slots:          8

/dev/hidrahome/lvol1 was run with -f, check forced.

Pass 0a: Checking cluster allocation chains Pass 0b: Checking inode
allocation chains Pass 0c: Checking extent block allocation chains Pass 1:
Checking inodes and blocks.

Pass 2: Checking directory entries.

Pass 3: Checking directory connectivity.

Pass 4a: checking for orphaned inodes

Pass 4b: Checking inodes link counts.

All passes succeeded.

===========

but the check for the second filesystem (around 40% full) did this:

===========
hidra0:/usr/local/src# fsck.ocfs2 -f /dev/hidrahome/lvol0 Checking OCFS2
filesystem in /dev/hidrahome/lvol0:

  label:              <NONE>

  uuid:               6a a9 0e aa cf 33 45 4c b4 72 3a b6 7c 3b 8d 57

  number of blocks:   4168098816

  bytes per block:    4096

  number of clusters: 4168098816

  bytes per cluster:  4096

  max slots:          8

/dev/hidrahome/lvol0 was run with -f, check forced.

Pass 0a: Checking cluster allocation chains ============

and stayed there for 8 hours (all the time keeping one core around 100% CPU
usage and with a light load on the disks; this was consistent with the same
step in the previous run, but of course it didn't take so long). I thought
that maybe I had run into some bug, so I interrupted the process, downloaded
ocfs2-tools 1.4.4 sources, compiled them, and tried with that fsck,
obtaining similar results, since it's been running for almost 7 hours like
this:

============
hidra0:/usr/local/src/ocfs2-tools-1.4.4/fsck.ocfs2# ./fsck.ocfs2 -f

/dev/hidrahome/lvol0

fsck.ocfs2 1.4.4

Checking OCFS2 filesystem in /dev/hidrahome/lvol0:

  Label:              <NONE>

  UUID:               6AA90EAACF33454CB4723AB67C3B8D57

  Number of blocks:   4168098816

  Block size:         4096

  Number of clusters: 4168098816

  Cluster size:       4096

  Number of slots:    8

/dev/hidrahome/lvol0 was run with -f, check forced.

Pass 0a: Checking cluster allocation chains

============

and with one core CPU at 100%. 

Could someone tell me if this is normal? I've been searching the web and
checking manuals for information on how long this checks should take, and
apart from one message in this list mentioning that 3 days in a 8 TB
filesystem with 300 GB was too long, I haven't been able to find anything. 

If this is normal, is there any way to estimate, taking into account that
the first filesystem uses exactly the same disks and took less than an hour
to check, how long it should take for this other filesystem?

Thanks!

Josep Guerrero

_______________________________________________

Ocfs2-users mailing list

 <mailto:Ocfs2-users at oss.oracle.com> Ocfs2-users at oss.oracle.com

 <http://oss.oracle.com/mailman/listinfo/ocfs2-users>
http://oss.oracle.com/mailman/listinfo/ocfs2-users

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://oss.oracle.com/pipermail/ocfs2-users/attachments/20110421/2c74f085/attachment.html

Sunil Mushran

2011-Apr-21 16:50 UTC

head link

[Ocfs2-users] How long for an fsck?

On 04/21/2011 06:43 AM, Josep Guerrero wrote:> I have a cluster with 8 nodes, all of them running Debian Lenny (plus some
> additions so multipath and Infiniband works), which share an array of 48
1TB
> disks. Those disks form 22 pairs of hardware RAID1, plus 4 spares). The
first
> 21 pairs are organized in two striped LVM logical volumes, of 16 and 3 TB,
> both formatted with ocfs2. The kernel is the version supplied with the
> distribution (2.6.26-2-amd64).
>
> I wanted to run an fsck on both volumes because of some errors I was
getting
> (probably unrelated to the filesystems, but I wanted to check). On the 3TB
> volume (around 10% full) the check worked perfectly, and finished in less
than
> an hour (this was run with the fsck.ocfs2 provided by Lenny ocfs2-tools,
> version 1.4.1):
><snip>
> but the check for the second filesystem (around 40% full) did this:
>
> ===========> hidra0:/usr/local/src# fsck.ocfs2 -f /dev/hidrahome/lvol0
> Checking OCFS2 filesystem in /dev/hidrahome/lvol0:
>    label:<NONE>
>    uuid:               6a a9 0e aa cf 33 45 4c b4 72 3a b6 7c 3b 8d 57
>    number of blocks:   4168098816
>    bytes per block:    4096
>    number of clusters: 4168098816
>    bytes per cluster:  4096
>    max slots:          8
>
> /dev/hidrahome/lvol0 was run with -f, check forced.
> Pass 0a: Checking cluster allocation chains
> ============>
> and stayed there for 8 hours (all the time keeping one core around 100% CPU
> usage and with a light load on the disks; this was consistent with the same
> step in the previous run, but of course it didn't take so long). I
thought
> that maybe I had run into some bug, so I interrupted the process,
downloaded
> ocfs2-tools 1.4.4 sources, compiled them, and tried with that fsck,
obtaining
> similar results, since it's been running for almost 7 hours like this:
>
> ============> hidra0:/usr/local/src/ocfs2-tools-1.4.4/fsck.ocfs2#
./fsck.ocfs2 -f
> /dev/hidrahome/lvol0
> fsck.ocfs2 1.4.4
> Checking OCFS2 filesystem in /dev/hidrahome/lvol0:
>    Label:<NONE>
>    UUID:               6AA90EAACF33454CB4723AB67C3B8D57
>    Number of blocks:   4168098816
>    Block size:         4096
>    Number of clusters: 4168098816
>    Cluster size:       4096
>    Number of slots:    8
>
> /dev/hidrahome/lvol0 was run with -f, check forced.
> Pass 0a: Checking cluster allocation chains
>
> ============>
> and with one core CPU at 100%.
>
> Could someone tell me if this is normal? I've been searching the web
and
> checking manuals for information on how long this checks should take, and
> apart from one message in this list mentioning that 3 days in a 8 TB
filesystem
> with 300 GB was too long, I haven't been able to find anything.
>
> If this is normal, is there any way to estimate, taking into account that
the
> first filesystem uses exactly the same disks and took less than an hour to
> check, how long it should take for this other filesystem?
Do:
# debugfs.ocfs2 -R "stat //global_bitmap" /dev/hidrahome/lvol0

Does this hang too? Redirect the output to a file. That will give us some clues.

Josep Guerrero

2011-Apr-21 17:46 UTC

head link

[Ocfs2-users] How long for an fsck?

Hello again,

It just finished. The output file is almost 9 MB long, but compressed is less 
than 1 MB. I attach it to the message.
> Do:
> # debugfs.ocfs2 -R "stat //global_bitmap" /dev/hidrahome/lvol0
> 
> Does this hang too? Redirect the output to a file. That will give us some
> clues.
Josep Guerrero
-------------- next part --------------
A non-text attachment was scrubbed...
Name: fit.bz2
Type: application/x-bzip
Size: 903876 bytes
Desc: not available
Url :
http://oss.oracle.com/pipermail/ocfs2-users/attachments/20110421/cc82c559/attachment-0001.bin

Sunil Mushran

2011-Apr-22 21:33 UTC

head link

[Ocfs2-users] How long for an fsck?

On 04/21/2011 10:46 AM, Josep Guerrero wrote:> Hello again,
>
> It just finished. The output file is almost 9 MB long, but compressed is
less
> than 1 MB. I attach it to the message.
>
>> Do:
>> # debugfs.ocfs2 -R "stat //global_bitmap"
/dev/hidrahome/lvol0
>>
>> Does this hang too? Redirect the output to a file. That will give us
some
>> clues.
>
How long did the debugfs output take?
Did fsck eventually finish?
If so, how long did that take?
Approximately.

I have a theory as to why it is slow. But I would like some confirmation.

Thanks
Sunil

Sunil Mushran

2011-Apr-22 21:41 UTC

head link

[Ocfs2-users] How long for an fsck?

On 04/22/2011 02:33 PM, Sunil Mushran wrote:> On 04/21/2011 10:46 AM, Josep Guerrero wrote:
>> Hello again,
>>
>> It just finished. The output file is almost 9 MB long, but compressed
is less
>> than 1 MB. I attach it to the message.
>>
>>> Do:
>>> # debugfs.ocfs2 -R "stat //global_bitmap"
/dev/hidrahome/lvol0
>>>
>>> Does this hang too? Redirect the output to a file. That will give
us some
>>> clues.
> How long did the debugfs output take?
> Did fsck eventually finish?
> If so, how long did that take?
> Approximately.
>
> I have a theory as to why it is slow. But I would like some confirmation.
BTW, you said one of the cores was at 100%. What does top show?
Is fsck the main contributor or is some other process spinning?

My theory had fsck have high wait%. I seem to be missing something.

Sunil Mushran

2011-Apr-23 00:22 UTC

head link

[Ocfs2-devel] [Ocfs2-users] How long for an fsck?

On 04/22/2011 03:24 PM, Josep Guerrero wrote:>> How long did the debugfs output take?
> I think about 30 minutes. No more than 50 for sure (just by looking at the
> times of the mails).
>
>> Did fsck eventually finish?
> No. I had to cancel it after it stayed 24 hours in the same state, showing
the
> same message. It never moved beyond "Pass 0a", and always was
using 100% CPU
> in one core. I don't know if it would have finished on its own.
>
>> BTW, you said one of the cores was at 100%. What does top show?
>> Is fsck the main contributor or is some other process spinning?
> It was fsck (I kept a top opened the whole time, and fsck always was around
> 99% CPU usage).
>
>> I have a theory as to why it is slow. But I would like some
confirmation.
>> My theory had fsck have high wait%. I seem to be missing something.
> I didn't look at the wait%, but I checked the physical disk load with
iotop
> and it was very low, so it didn't look like fsck was being slow because
of the
> disk. In the filesystem I successfully "fscked" before (the 3 TB
one that took
> less than 60 minutes), it started doing something similar (very high CPU
> usage, low disk load) but after several minutes (when the rest of the
messages
> after "Pass 0a" appeared), it did just the opposite: low CPU use,
high disk
> load. Both filesystems are physically on the same set of disks (the 16TB
> logical volume is an striped LVM volume that fills about 75% of the 21
physical
> disks and the 3TB is another striped LVM volume filling the remaining space
of
> the same disks) so I don't think it's a problem with the physical
devices (of
> course, I could be wrong).
File a bz. This will need some investigation.

BTW, how much memory does your box have?

Goldwyn Rodrigues

2011-May-11 18:14 UTC

head link

[Ocfs2-devel] [Ocfs2-users] How long for an fsck?

Hi,

On Sat, Apr 23, 2011 at 10:57 AM, Sunil Mushran
<sunil.mushran at oracle.com> wrote:> On 04/23/2011 07:56 AM, Tao Ma wrote:
>>
>> So what is your version of fsck? I have met with some issue like that
>> when fsck is allocating a large number of memories and it stucks for
>> quite a long time of because of the swapping.
>
> It is not that issue. It is in pass0. I assumed there was a problem
> is in cluster allocation chains. But debugfs managed to scan the
> chain. No loops. Looks ok. So unsure where it could be spinning.
>
> Note it is a 16T, ?4k/4k fs.


We had a similar problem which was fixed by
commit 2d741da9367b33f559802dfabe62d96f6adc7777

Version number would be helpful.

Regards,

-- 
Goldwyn

Sunil Mushran

2011-May-11 18:21 UTC

head link

[Ocfs2-devel] [Ocfs2-users] How long for an fsck?

On 05/11/2011 11:14 AM, Goldwyn Rodrigues wrote:> Hi,
>
> On Sat, Apr 23, 2011 at 10:57 AM, Sunil Mushran
> <sunil.mushran at oracle.com>  wrote:
>> On 04/23/2011 07:56 AM, Tao Ma wrote:
>>> So what is your version of fsck? I have met with some issue like
that
>>> when fsck is allocating a large number of memories and it stucks
for
>>> quite a long time of because of the swapping.
>> It is not that issue. It is in pass0. I assumed there was a problem
>> is in cluster allocation chains. But debugfs managed to scan the
>> chain. No loops. Looks ok. So unsure where it could be spinning.
>>
>> Note it is a 16T,  4k/4k fs.
>
>
> We had a similar problem which was fixed by
> commit 2d741da9367b33f559802dfabe62d96f6adc7777
>
> Version number would be helpful.
Thanks for that. Josep was on 1.4.4.

Fixed in 1.6.4.
http://oss.oracle.com/bugzilla/show_bug.cgi?id=1323

Ocfs2 devel - Apr 2011 - How long for an fsck?

[Ocfs2-users] How long for an fsck?

[Ocfs2-users] How long for an fsck?

[Ocfs2-users] How long for an fsck?

[Ocfs2-users] How long for an fsck?

[Ocfs2-users] How long for an fsck?

[Ocfs2-users] How long for an fsck?

[Ocfs2-devel] [Ocfs2-users] How long for an fsck?

[Ocfs2-devel] [Ocfs2-users] How long for an fsck?

[Ocfs2-devel] [Ocfs2-users] How long for an fsck?