thr3ads.net - freebsd stable - newfs locks entire machine for 20seconds [Jan 2008]

If this information is useful, please help other people find it:
Share via:

Steven Hartland

2008-Jan-30 13:54 UTC

newfs locks entire machine for 20seconds

----- Original Message ----- 
From: "Ivan Voras" <ivoras@freebsd.org>>> The machine is running with ULE on 7.0 as mention using an Areca 1220
>> controller over 8 disks in RAID 6 + Hotspare.
> 
> I'd suggest you first try to reproduce the stall without ULE, while
> keeping all other parameters exactly the same.
Ok tried with an updated 7 world / kernel as of this afternoon and with 4BSD
instead of ULE and no difference the machine still locks up with no activity
for anywhere from 20 to 30 seconds.

Here's a snapshot from top under cpu and io modes when the stall has occured
[top]
last pid:  1102;  load averages:  0.02,  0.08,  0.07                            
up 0+00:09:37  21:39:13
162 processes: 4 running, 145 sleeping, 13 waiting
CPU states:  0.0% user,  0.0% nice,  0.4% system,  0.0% interrupt, 99.6% idle
Mem: 60M Active, 19M Inact, 54M Wired, 56K Cache, 27M Buf, 3809M Free
Swap: 4096M Total, 4096M Free

  PID USERNAME  THR PRI NICE   SIZE    RES STATE  C   TIME   WCPU COMMAND
   12 root        1 171 ki31     0K    16K RUN    0   8:59 97.90% idle: cpu0
   11 root        1 171 ki31     0K    16K RUN    1   8:57 95.80% idle: cpu1
 1102 root        1  -8    0  4752K  1256K physrd 1   0:01 19.64% newfs
    4 root        1  -8    -     0K    16K -      0   0:00  0.10% g_down
 1048 root        1  96    0  7656K  2544K CPU0   0   0:01  0.00% top
 1054 root        1  96    0  7656K  2348K CPU1   1   0:01  0.00% top
  863 root        1  96    0   131M 15768K select 0   0:00  0.00% httpd
 1055 root        1  96    0 32928K  4656K select 0   0:00  0.00% sshd


last pid:  1102;  load averages:  0.02,  0.08,  0.07                            
up 0+00:09:37  21:39:13
162 processes: 4 running, 145 sleeping, 13 waiting
CPU states:  0.0% user,  0.0% nice,  0.4% system,  0.0% interrupt, 99.6% idle
Mem: 60M Active, 19M Inact, 54M Wired, 56K Cache, 27M Buf, 3809M Free
Swap: 4096M Total, 4096M Free

  PID USERNAME   VCSW  IVCSW   READ  WRITE  FAULT  TOTAL PERCENT COMMAND
   12 root          9    154      0      0      0      0   0.00% idle: cpu0
   11 root         28      5      0      0      0      0   0.00% idle: cpu1
 1102 root          5      0      0      0      0      0   0.00% newfs
    4 root         14      0      0      0      0      0   0.00% g_down
 1048 root          1      0      0      0      0      0   0.00% top
 1054 root          1      0      0      0      0      0   0.00% top
  863 root          1      0      0      0      0      0   0.00% httpd
[/top]

===============================================This e.mail is private and
confidential between Multiplay (UK) Ltd. and the person or entity to whom it is
addressed. In the event of misdirection, the recipient is prohibited from using,
copying, printing or otherwise disseminating it or any information contained in
it.

In the event of misdirection, illegible or incomplete transmission please
telephone +44 845 868 1337
or return the E.mail to postmaster@multiplay.co.uk.

Dieter

2008-Jan-30 15:54 UTC

head link

newfs locks entire machine for 20seconds

In message <008201c86388$fd159010$b6db87d4@multiplay.co.uk>, "Steven
Hartland" writes:
> From: "Ivan Voras" <ivoras@freebsd.org>
> >> The machine is running with ULE on 7.0 as mention using an Areca
1220
> >> controller over 8 disks in RAID 6 + Hotspare.
> > 
> > I'd suggest you first try to reproduce the stall without ULE,
while
> > keeping all other parameters exactly the same.
> 
> Ok tried with an updated 7 world / kernel as of this afternoon and with
4BSD
> instead of ULE and no difference the machine still locks up with no
activity
> for anywhere from 20 to 30 seconds.
> 
> Here's a snapshot from top under cpu and io modes when the stall has
occured
> [top]
> last pid:  1102;  load averages:  0.02,  0.08,  0.07                       
up 0+00:09:37  21:39:13
> 162 processes: 4 running, 145 sleeping, 13 waiting
> CPU states:  0.0% user,  0.0% nice,  0.4% system,  0.0% interrupt, 99.6%
idle
> Mem: 60M Active, 19M Inact, 54M Wired, 56K Cache, 27M Buf, 3809M Free
> Swap: 4096M Total, 4096M Free
> 
>   PID USERNAME  THR PRI NICE   SIZE    RES STATE  C   TIME   WCPU COMMAND
>    12 root        1 171 ki31     0K    16K RUN    0   8:59 97.90% idle:
cpu0
>    11 root        1 171 ki31     0K    16K RUN    1   8:57 95.80% idle:
cpu1
>  1102 root        1  -8    0  4752K  1256K physrd 1   0:01 19.64% newfs
>     4 root        1  -8    -     0K    16K -      0   0:00  0.10% g_down
>  1048 root        1  96    0  7656K  2544K CPU0   0   0:01  0.00% top
>  1054 root        1  96    0  7656K  2348K CPU1   1   0:01  0.00% top
>   863 root        1  96    0   131M 15768K select 0   0:00  0.00% httpd
>  1055 root        1  96    0 32928K  4656K select 0   0:00  0.00% sshd
> 
> 
> last pid:  1102;  load averages:  0.02,  0.08,  0.07                       
up 0+00:09:37  21:39:13
> 162 processes: 4 running, 145 sleeping, 13 waiting
> CPU states:  0.0% user,  0.0% nice,  0.4% system,  0.0% interrupt, 99.6%
idle
> Mem: 60M Active, 19M Inact, 54M Wired, 56K Cache, 27M Buf, 3809M Free
> Swap: 4096M Total, 4096M Free
> 
>   PID USERNAME   VCSW  IVCSW   READ  WRITE  FAULT  TOTAL PERCENT COMMAND
>    12 root          9    154      0      0      0      0   0.00% idle: cpu0
>    11 root         28      5      0      0      0      0   0.00% idle: cpu1
>  1102 root          5      0      0      0      0      0   0.00% newfs
>     4 root         14      0      0      0      0      0   0.00% g_down
>  1048 root          1      0      0      0      0      0   0.00% top
>  1054 root          1      0      0      0      0      0   0.00% top
>   863 root          1      0      0      0      0      0   0.00% httpd
> [/top]
What *exactly* do you mean by
> machine still locks up with no activity for anywhere from 20 to 30 seconds.
Is there disk activity? (e.g. activity light(s) flashing if you have them)

Does top continue to update the screen during the 20-30 seconds?

I'm thinking that newfs has queued up a bunch of disk i/o, and other
disk i/o gets locked out, but activities that don't require any disk i/o
(like top, once it is up and running) could continue.  Is that what is
happening?

Steven Hartland

2008-Jan-30 17:27 UTC

head link

newfs + gstat locks entire machine for 20seconds

The plot thickens.... This stall is not just related to newfs you have to
have gstat running as well. If I do the newfs without gstat running then
no stall occurs. As soon as Im running gstat while doing the newfs then
everything locks as described.

Running truss on gstat shows the issue / cause I believe but I dont
know what it means:-
[truss -o t.txt -p 61629 -d]
9.008933817 nanosleep({1.000000000})         = 0 (0x0)
9.008969017 gettimeofday({1201742426.147393},0x0) = 0 (0x0)
9.009009804 poll({0/POLLIN},1,0)         = 0 (0x0)
9.009040534 gettimeofday({1201742426.147465},0x0) = 0 (0x0)
9.009076852 clock_gettime(0,{1201742426.147501706}) = 0 (0x0)
9.009294477 sigaction(SIGTSTP,{ SIG_IGN SA_RESTART ss_t },{ 0x800cb2470
SA_RESTART ss_t }) = 0 (0x0)
9.009335823 poll({0/POLLIN},1,0)         = 0 (0x0)
9.009387785 poll({0/POLLIN},1,0)         = 0 (0x0)
9.009457626 write(1,"\^[[4;11H 5\^[[6C2     32  467.8"...,213) = 213
(0xd5)
9.009488636 sigaction(SIGTSTP,{ 0x800cb2470 SA_RESTART ss_t },0x0) = 0 (0x0)
10.009930312 nanosleep({1.000000000})        = 0 (0x0)
10.009963836 gettimeofday({1201742427.148388},0x0) = 0 (0x0)
10.010005182 poll({0/POLLIN},1,0)        = 0 (0x0)
10.010036192 gettimeofday({1201742427.148461},0x0) = 0 (0x0)
10.010073068 clock_gettime(0,{1201742427.148497922}) = 0 (0x0)
10.010292369
mmap(0x801000000,1048576,PROT_READ|PROT_WRITE,MAP_PRIVATE|MAP_ANON,-1,0x0) =
34376515584 (0x801000000)
10.010327569
__sysctl(0x7fffffffe6c0,0x2,0x7fffffffe650,0x7fffffffe6b8,0x800844970,0x11) = 0
(0x0)
25.052947791 __sysctl(0x7fffffffe650,0x3,0x801000000,0x7fffffffe720,0x0,0x0) = 0
(0x0)
25.054030610 munmap(0x801000000,1048576)     = 0 (0x0)
25.055022356 sigaction(SIGTSTP,{ SIG_IGN SA_RESTART ss_t },{ 0x800cb2470
SA_RESTART ss_t }) = 0 (0x0)
25.055067892 poll({0/POLLIN},1,0)        = 0 (0x0)
25.055130470 poll({0/POLLIN},1,0)        = 0 (0x0)
25.055230203 write(1,"\^[[4;11H1\^[[7C4     64  203.4"...,203) = 203
(0xcb)
25.055263448 sigaction(SIGTSTP,{ 0x800cb2470 SA_RESTART ss_t },0x0) = 0 (0x0)
26.055866597 nanosleep({1.000000000})        = 0 (0x0)
26.055900400 gettimeofday({1201742443.194324},0x0) = 0 (0x0)
26.055940070 poll({0/POLLIN},1,0)        = 0 (0x0)
26.055969962 gettimeofday({1201742443.194394},0x0) = 0 (0x0)
26.056009073 clock_gettime(0,{1201742443.194433649}) = 0 (0x0)
26.056240388 sigaction(SIGTSTP,{ SIG_IGN SA_RESTART ss_t },{ 0x800cb2470
SA_RESTART ss_t }) = 0 (0x0)
26.056280896 poll({0/POLLIN},1,0)        = 0 (0x0)
26.056334534 poll({0/POLLIN},1,0)        = 0 (0x0)
26.056420299 poll({0/POLLIN},1,0)        = 0 (0x0)
26.056485112 write(1,"\^[[1;6H6.046s  w: 1.000s\^[[4;5"...,305) = 305
(0x131)
26.056516121 sigaction(SIGTSTP,{ 0x800cb2470 SA_RESTART ss_t },0x0) = 0 (0x0)
27.056863372 nanosleep({1.000000000})        = 0 (0x0)
[/truss -o t.txt -p 61629 -d]

===============================================This e.mail is private and
confidential between Multiplay (UK) Ltd. and the person or entity to whom it is
addressed. In the event of misdirection, the recipient is prohibited from using,
copying, printing or otherwise disseminating it or any information contained in
it.

In the event of misdirection, illegible or incomplete transmission please
telephone +44 845 868 1337
or return the E.mail to postmaster@multiplay.co.uk.

Mike Tancsa

2008-Jan-31 11:58 UTC

head link

newfs + gstat locks entire machine for 20seconds

At 08:24 PM 1/30/2008, Steven Hartland wrote:>The plot thickens.... This stall is not just related to newfs you have to
>have gstat running as well. If I do the newfs without gstat running then
>no stall occurs. As soon as Im running gstat while doing the newfs then
>everything locks as described.
Strange, I see the same thing sometimes.

While running

while true
do
date
sleep .5
done


Thu Jan 31 14:55:42 EST 2008
Thu Jan 31 14:55:42 EST 2008
Thu Jan 31 14:55:43 EST 2008
Thu Jan 31 14:55:43 EST 2008
Thu Jan 31 14:55:44 EST 2008
Thu Jan 31 14:55:44 EST 2008
Thu Jan 31 14:55:50 EST 2008
Thu Jan 31 14:55:50 EST 2008
Thu Jan 31 14:55:51 EST 2008
Thu Jan 31 14:55:51 EST 2008
Thu Jan 31 14:55:52 EST 2008
Thu Jan 31 14:55:52 EST 2008
Thu Jan 31 14:55:53 EST 2008
Thu Jan 31 14:55:53 EST 2008
Thu Jan 31 14:55:54 EST 2008

You can see it from 44 to 50 seconds,

This is AMD64

da0 at arcmsr0 bus 0 target 0 lun 0
da0: <Areca ARC-1210-VOL#00 R001> Fixed Direct Access SCSI-5 device
da0: 166.666MB/s transfers (83.333MHz DT, offset 32, 16bit)
da0: 305175MB (624999424 512 byte sectors: 255H 63S/T 38904C)

arcmsr0: <Areca SATA Host Adapter RAID Controller
 > mem 0xe8600000-0xe8600fff,0xe8000000-0xe83fffff irq 18 at device 
14.0 on pci2
ARECA RAID ADAPTER0: Driver Version 1.20.00.15 2007-10-07
ARECA RAID ADAPTER0: FIRMWARE VERSION V1.43 2007-4-17
arcmsr0: [ITHREAD]


>Running truss on gstat shows the issue / cause I believe but I dont
>know what it means:-
>[truss -o t.txt -p 61629 -d]
>9.008933817 nanosleep({1.000000000})         = 0 (0x0)
>9.008969017 gettimeofday({1201742426.147393},0x0) = 0 (0x0)
>9.009009804 poll({0/POLLIN},1,0)         = 0 (0x0)
>9.009040534 gettimeofday({1201742426.147465},0x0) = 0 (0x0)
>9.009076852 clock_gettime(0,{1201742426.147501706}) = 0 (0x0)
>9.009294477 sigaction(SIGTSTP,{ SIG_IGN SA_RESTART ss_t },{ 
>0x800cb2470 SA_RESTART ss_t }) = 0 (0x0)
>9.009335823 poll({0/POLLIN},1,0)         = 0 (0x0)
>9.009387785 poll({0/POLLIN},1,0)         = 0 (0x0)
>9.009457626 write(1,"\^[[4;11H 5\^[[6C2     32  467.8"...,213) =
213 (0xd5)
>9.009488636 sigaction(SIGTSTP,{ 0x800cb2470 SA_RESTART ss_t },0x0) = 0 (0x0)
>10.009930312 nanosleep({1.000000000})        = 0 (0x0)
>10.009963836 gettimeofday({1201742427.148388},0x0) = 0 (0x0)
>10.010005182 poll({0/POLLIN},1,0)        = 0 (0x0)
>10.010036192 gettimeofday({1201742427.148461},0x0) = 0 (0x0)
>10.010073068 clock_gettime(0,{1201742427.148497922}) = 0 (0x0)
>10.010292369 
>mmap(0x801000000,1048576,PROT_READ|PROT_WRITE,MAP_PRIVATE|MAP_ANON,-1,0x0) 
>= 34376515584 (0x801000000)
>10.010327569 
>__sysctl(0x7fffffffe6c0,0x2,0x7fffffffe650,0x7fffffffe6b8,0x800844970,0x11) 
>= 0 (0x0)
>25.052947791 
>__sysctl(0x7fffffffe650,0x3,0x801000000,0x7fffffffe720,0x0,0x0) = 0 (0x0)
>25.054030610 munmap(0x801000000,1048576)     = 0 (0x0)
>25.055022356 sigaction(SIGTSTP,{ SIG_IGN SA_RESTART ss_t },{ 
>0x800cb2470 SA_RESTART ss_t }) = 0 (0x0)
>25.055067892 poll({0/POLLIN},1,0)        = 0 (0x0)
>25.055130470 poll({0/POLLIN},1,0)        = 0 (0x0)
>25.055230203 write(1,"\^[[4;11H1\^[[7C4     64  203.4"...,203) =
203 (0xcb)
>25.055263448 sigaction(SIGTSTP,{ 0x800cb2470 SA_RESTART ss_t },0x0) = 0
(0x0)
>26.055866597 nanosleep({1.000000000})        = 0 (0x0)
>26.055900400 gettimeofday({1201742443.194324},0x0) = 0 (0x0)
>26.055940070 poll({0/POLLIN},1,0)        = 0 (0x0)
>26.055969962 gettimeofday({1201742443.194394},0x0) = 0 (0x0)
>26.056009073 clock_gettime(0,{1201742443.194433649}) = 0 (0x0)
>26.056240388 sigaction(SIGTSTP,{ SIG_IGN SA_RESTART ss_t },{ 
>0x800cb2470 SA_RESTART ss_t }) = 0 (0x0)
>26.056280896 poll({0/POLLIN},1,0)        = 0 (0x0)
>26.056334534 poll({0/POLLIN},1,0)        = 0 (0x0)
>26.056420299 poll({0/POLLIN},1,0)        = 0 (0x0)
>26.056485112 write(1,"\^[[1;6H6.046s  w: 1.000s\^[[4;5"...,305) =
305 (0x131)
>26.056516121 sigaction(SIGTSTP,{ 0x800cb2470 SA_RESTART ss_t },0x0) = 0
(0x0)
>27.056863372 nanosleep({1.000000000})        = 0 (0x0)
>[/truss -o t.txt -p 61629 -d]
>
>===============================================>This e.mail is private
and confidential between Multiplay (UK) Ltd.
>and the person or entity to whom it is addressed. In the event of 
>misdirection, the recipient is prohibited from using, copying, 
>printing or otherwise disseminating it or any information contained in it.
>In the event of misdirection, illegible or incomplete transmission 
>please telephone +44 845 868 1337
>or return the E.mail to postmaster@multiplay.co.uk.
>
>_______________________________________________
>freebsd-performance@freebsd.org mailing list
>http://lists.freebsd.org/mailman/listinfo/freebsd-performance
>To unsubscribe, send any mail to
"freebsd-performance-unsubscribe@freebsd.org"

Reasonably Related Threads

Search for more reasonably related threads

freebsd stable - Jan 2008 - newfs locks entire machine for 20seconds

newfs locks entire machine for 20seconds

newfs locks entire machine for 20seconds

newfs + gstat locks entire machine for 20seconds

newfs + gstat locks entire machine for 20seconds

Reasonably Related Threads