thr3ads.net - dtrace discuss - [dtrace-discuss] /proc/<pid>/map access is sloooow when using large shared memory segments [Dec 2007]

If this information is useful, please help other people find it:
Share via:

Paul van den Bogaard

2007-Dec-07 16:42 UTC

[dtrace-discuss] /proc/<pid>/map access is sloooow when using large shared memory segments

In an earlier post (USDT probes from PostgreSQL) I mentioned that the pread
syscall did take approx 350 msecs on a /proc/<pid>/map file where a pid is
that from a process that is attached to a large shared memory segment. Same goes
for fstat, and lstat64.

Since starting a dtrace process on processes that are attached to large shm segs
for mulitple processes means all their map-files are read in sequential order.
This grows to rather unfriendly "compile" times (to almost an hour)
for dtrace when attaching to my PostgreSQL processes.

Initially I thought it would be related to my PostgreSQL processes. 
I do not think this anymore.
BTW these tests are done on the "problem machine", a T2000 8 core
@1000MHz, 32GB intern.

A 

timex ls /proc/pid/map 

already shows this amount of time being spend (where pid is the pid of a process
executing the image of a self made program that create a 16G shared mem segment)

Using dtrace on this ls process show lstat64 consuming the time I am looking
for.

Ok, created the following script to dive deeper (can only hope it does what I
think it does :-)

Script is started like

dtrace -qs script.d  -c  "ls /prob/00000/map"


BEGIN {
        printf("Target = %d\n", $target);
        self->depth = 0;
        started = timestamp;
}
 
syscall::lstat64:entry
/pid == $target/
{
        self->filename = basename(copyinstr(arg0) );
}
 
syscall::lstat64:entry
/self->filename == "map"/
{
        self->flagIt = 1;     /* OK the show begins */
        self->depth = 0;
}
 
syscall::lstat64:return
/self->flagIt /
{
        self->flagIt = 0;     /* And the show stops */
}
 
/* Already found some things in a previous version. Want to know the caller of a
function
    that takes time. This is what the self->parent is used for.
*/
fbt:::entry
/self->flagIt/
{
        self->depth++;
        self->ts[self->depth] = timestamp;
        self->parent[self->depth] = probefunc;
}
 
fbt:::return
/self->depth /
{
        @[self->parent[self->depth-1], self->depth, probefunc] =
sum(timestamp - self->ts[ self->depth ] );
        self->parent[self->depth] = 0;
        self->depth--;
}
 
 
END {
        printf("WALL: %d\n", (timestamp - started) / 1000 );
        normalize(@, 1000);
        printa("%20s %3d %20s %@20d\n", @);
}


This little magic gives me the following output (stripped many lines of calls
that do not consume time)
   
WALL: 113,6 seconds

               lookuppnat   5                 lookuppnvp                     
5,540
          lookupnameat   4                  lookuppnat                     
5,589
           cstatat_getvp   3              lookupnameat                     
5,686
            cstatat64_32   2               cstatat_getvp                     
5,782
           pr_pagev_fill   8        segspt_shmgetprot                      5,961
               pr_getprot   7                pr_pagev_fill                     
6,081
           pr_pagev_fill   9         segspt_shmgetprot                    39,266
   pr_pagev_nextprot   8                pr_pagev_fill                     39,646
           getwatchprot   9                       avl_find              
29,973,247
   pr_pagev_nextprot   8                getwatchprot              87,968,115
               pr_getprot   7        pr_pagev_nextprot            113,528,006
                    prnsegs   6                    pr_getprot           
113,537,387
                 prgetattr   5                         prnsegs           
113,539,929
             fop_getattr   4                       prgetattr           
113,540,258
              cstat64_32   3                   fop_getattr           
113,540,307
           cstatat64_32   2                    cstat64_32            113,540,480
                                 1                 cstatat64_32           
113,547,874

avl_find seems not to call any children, and therefore consumes quite some time
(30 secs)
getwatchprot used 88, of which 30 are used by this avl_find.

The source shows
(http://cvs.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/fs/proc/prsubr.c):
   3449 /*
   3450  * Return the original protections for the specified page.
   3451  */
   3452 static void
   3453 getwatchprot(struct as *as, caddr_t addr, uint_t *prot)
   3454 {
   3455 	struct watched_page *pwp;
   3456 	struct watched_page tpw;
   3457 
   3458 	ASSERT(AS_LOCK_HELD(as, &as->a_lock));
   3459 
   3460 	tpw.wp_vaddr = (caddr_t)((uintptr_t)addr & (uintptr_t)PAGEMASK);
   3461 	if ((pwp = avl_find(&as->a_wpage, &tpw, NULL)) != NULL)
   3462 		*prot = pwp->wp_oprot;
   3463 }

My idea therefore is the AS_LOCK_HELD is responsible for approx 58 second of
times spend waiting.
Unfortunately this URL points to nothing valid. Neither does avl_find.

Now reducing my shared memory segment to 1GB (arbitrary size  way smaller than
16 GB) the results are:

           getwatchprot   9                    avl_find              1,950,850
   pr_pagev_nextprot   8            getwatchprot              5,737,536
               pr_getprot   7    pr_pagev_nextprot              7,412,876
                    prnsegs   6                pr_getprot              7,419,244
                 prgetattr   5                     prnsegs             
7,421,813
             fop_getattr   4                   prgetattr              7,422,144
              cstat64_32   3               fop_getattr              7,422,193
           cstatat64_32   2                cstat64_32              7,422,366
                                 1             cstatat64_32             
7,430,020

Looks like there is a lineair relationship between segment size and time spend
in getwatchprot.

Clues why this takes so long?

BTW did the same  test on a V890, 16*1350MHz. Did not take that long, but still
quite some time (approx 2 times faster than T2000).

A pmap on this process shows it uses many 256MB pages and of course a couple of
4M, 64K and 8K. shared segment is shmat-ted with SHM_SHARE_MMU flag set.

Now a recompile of this program without the SHM_SHARE_MMU flag does miracles

The results now are:

            pr_getprot   7    pr_pagev_nextprot                 6,414
                 prnsegs   6           pr_getprot                 8,759
              prgetattr   5              prnsegs                11,447
          fop_getattr   4            prgetattr                11,776
           cstat64_32   3          fop_getattr                11,823
        cstatat64_32   2           cstat64_32                11,996
                             1         cstatat64_32                19,479

pmap-ping this process shows the largest segment is 4M. Could this be due to
256MB pages than?


--
This message posted from opensolaris.org

Mike Shapiro

2007-Dec-07 19:35 UTC

head link

[dtrace-discuss] /proc/<pid>/map access is sloooow when using large shared memory segments

> Looks like there is a lineair relationship between segment size and
> time spend in getwatchprot.
> 
> Clues why this takes so long?
Because there is a linear relationship between segment size
and the number of iterations of pr_getprot() :)

Basically this is all about disgusting details of the current VM system.
The address space is divided into the segment ranges internally.
Within those segments there are a variety of optimizations for
dealing with having differing protections and watchpoints.
/proc has a nice model of expressing the address space in its map,
rmap, and xmap files where a "mapping" is defined as some range of
the address space that has a consistent set of attributes, in
particular a particular set of permission bits expressed in the
prmap_t structure as a single bitmap.

So when you read the map file, we have to walk the segments,
and poke at their underlying attributes to deal with the notion
that some segments have to be sliced and diced into multiple prmap_t''s.
And by definition this is a linear process given the current design.

If you want to make it go faster, what you need to do is to figure
out what aspect of your segment attributes are causing pr_getprot()
to be called multiple times per segment, i.e. why we''re not getting
into one the many fast-paths in there.  Depending on what you find
we may be able to add yet another one.

-Mike

-- 
Mike Shapiro, Solaris Kernel Development. blogs.sun.com/mws/

dtrace discuss - Dec 2007 - /proc/<pid>/map access is sloooow when using large shared memory segments

[dtrace-discuss] /proc/<pid>/map access is sloooow when using large shared memory segments

[dtrace-discuss] /proc/<pid>/map access is sloooow when using large shared memory segments