Paul van den Bogaard
2007-Dec-07 16:42 UTC
[dtrace-discuss] /proc/<pid>/map access is sloooow when using large shared memory segments
In an earlier post (USDT probes from PostgreSQL) I mentioned that the pread
syscall did take approx 350 msecs on a /proc/<pid>/map file where a pid is
that from a process that is attached to a large shared memory segment. Same goes
for fstat, and lstat64.
Since starting a dtrace process on processes that are attached to large shm segs
for mulitple processes means all their map-files are read in sequential order.
This grows to rather unfriendly "compile" times (to almost an hour)
for dtrace when attaching to my PostgreSQL processes.
Initially I thought it would be related to my PostgreSQL processes.
I do not think this anymore.
BTW these tests are done on the "problem machine", a T2000 8 core
@1000MHz, 32GB intern.
A
timex ls /proc/pid/map
already shows this amount of time being spend (where pid is the pid of a process
executing the image of a self made program that create a 16G shared mem segment)
Using dtrace on this ls process show lstat64 consuming the time I am looking
for.
Ok, created the following script to dive deeper (can only hope it does what I
think it does :-)
Script is started like
dtrace -qs script.d -c "ls /prob/00000/map"
BEGIN {
printf("Target = %d\n", $target);
self->depth = 0;
started = timestamp;
}
syscall::lstat64:entry
/pid == $target/
{
self->filename = basename(copyinstr(arg0) );
}
syscall::lstat64:entry
/self->filename == "map"/
{
self->flagIt = 1; /* OK the show begins */
self->depth = 0;
}
syscall::lstat64:return
/self->flagIt /
{
self->flagIt = 0; /* And the show stops */
}
/* Already found some things in a previous version. Want to know the caller of a
function
that takes time. This is what the self->parent is used for.
*/
fbt:::entry
/self->flagIt/
{
self->depth++;
self->ts[self->depth] = timestamp;
self->parent[self->depth] = probefunc;
}
fbt:::return
/self->depth /
{
@[self->parent[self->depth-1], self->depth, probefunc] =
sum(timestamp - self->ts[ self->depth ] );
self->parent[self->depth] = 0;
self->depth--;
}
END {
printf("WALL: %d\n", (timestamp - started) / 1000 );
normalize(@, 1000);
printa("%20s %3d %20s %@20d\n", @);
}
This little magic gives me the following output (stripped many lines of calls
that do not consume time)
WALL: 113,6 seconds
lookuppnat 5 lookuppnvp
5,540
lookupnameat 4 lookuppnat
5,589
cstatat_getvp 3 lookupnameat
5,686
cstatat64_32 2 cstatat_getvp
5,782
pr_pagev_fill 8 segspt_shmgetprot 5,961
pr_getprot 7 pr_pagev_fill
6,081
pr_pagev_fill 9 segspt_shmgetprot 39,266
pr_pagev_nextprot 8 pr_pagev_fill 39,646
getwatchprot 9 avl_find
29,973,247
pr_pagev_nextprot 8 getwatchprot 87,968,115
pr_getprot 7 pr_pagev_nextprot 113,528,006
prnsegs 6 pr_getprot
113,537,387
prgetattr 5 prnsegs
113,539,929
fop_getattr 4 prgetattr
113,540,258
cstat64_32 3 fop_getattr
113,540,307
cstatat64_32 2 cstat64_32 113,540,480
1 cstatat64_32
113,547,874
avl_find seems not to call any children, and therefore consumes quite some time
(30 secs)
getwatchprot used 88, of which 30 are used by this avl_find.
The source shows
(http://cvs.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/fs/proc/prsubr.c):
3449 /*
3450 * Return the original protections for the specified page.
3451 */
3452 static void
3453 getwatchprot(struct as *as, caddr_t addr, uint_t *prot)
3454 {
3455 struct watched_page *pwp;
3456 struct watched_page tpw;
3457
3458 ASSERT(AS_LOCK_HELD(as, &as->a_lock));
3459
3460 tpw.wp_vaddr = (caddr_t)((uintptr_t)addr & (uintptr_t)PAGEMASK);
3461 if ((pwp = avl_find(&as->a_wpage, &tpw, NULL)) != NULL)
3462 *prot = pwp->wp_oprot;
3463 }
My idea therefore is the AS_LOCK_HELD is responsible for approx 58 second of
times spend waiting.
Unfortunately this URL points to nothing valid. Neither does avl_find.
Now reducing my shared memory segment to 1GB (arbitrary size way smaller than
16 GB) the results are:
getwatchprot 9 avl_find 1,950,850
pr_pagev_nextprot 8 getwatchprot 5,737,536
pr_getprot 7 pr_pagev_nextprot 7,412,876
prnsegs 6 pr_getprot 7,419,244
prgetattr 5 prnsegs
7,421,813
fop_getattr 4 prgetattr 7,422,144
cstat64_32 3 fop_getattr 7,422,193
cstatat64_32 2 cstat64_32 7,422,366
1 cstatat64_32
7,430,020
Looks like there is a lineair relationship between segment size and time spend
in getwatchprot.
Clues why this takes so long?
BTW did the same test on a V890, 16*1350MHz. Did not take that long, but still
quite some time (approx 2 times faster than T2000).
A pmap on this process shows it uses many 256MB pages and of course a couple of
4M, 64K and 8K. shared segment is shmat-ted with SHM_SHARE_MMU flag set.
Now a recompile of this program without the SHM_SHARE_MMU flag does miracles
The results now are:
pr_getprot 7 pr_pagev_nextprot 6,414
prnsegs 6 pr_getprot 8,759
prgetattr 5 prnsegs 11,447
fop_getattr 4 prgetattr 11,776
cstat64_32 3 fop_getattr 11,823
cstatat64_32 2 cstat64_32 11,996
1 cstatat64_32 19,479
pmap-ping this process shows the largest segment is 4M. Could this be due to
256MB pages than?
--
This message posted from opensolaris.org
Mike Shapiro
2007-Dec-07 19:35 UTC
[dtrace-discuss] /proc/<pid>/map access is sloooow when using large shared memory segments
> Looks like there is a lineair relationship between segment size and > time spend in getwatchprot. > > Clues why this takes so long?Because there is a linear relationship between segment size and the number of iterations of pr_getprot() :) Basically this is all about disgusting details of the current VM system. The address space is divided into the segment ranges internally. Within those segments there are a variety of optimizations for dealing with having differing protections and watchpoints. /proc has a nice model of expressing the address space in its map, rmap, and xmap files where a "mapping" is defined as some range of the address space that has a consistent set of attributes, in particular a particular set of permission bits expressed in the prmap_t structure as a single bitmap. So when you read the map file, we have to walk the segments, and poke at their underlying attributes to deal with the notion that some segments have to be sliced and diced into multiple prmap_t''s. And by definition this is a linear process given the current design. If you want to make it go faster, what you need to do is to figure out what aspect of your segment attributes are causing pr_getprot() to be called multiple times per segment, i.e. why we''re not getting into one the many fast-paths in there. Depending on what you find we may be able to add yet another one. -Mike -- Mike Shapiro, Solaris Kernel Development. blogs.sun.com/mws/