Paul van den Bogaard
2007-Dec-07 16:42 UTC
[dtrace-discuss] /proc/<pid>/map access is sloooow when using large shared memory segments
In an earlier post (USDT probes from PostgreSQL) I mentioned that the pread syscall did take approx 350 msecs on a /proc/<pid>/map file where a pid is that from a process that is attached to a large shared memory segment. Same goes for fstat, and lstat64. Since starting a dtrace process on processes that are attached to large shm segs for mulitple processes means all their map-files are read in sequential order. This grows to rather unfriendly "compile" times (to almost an hour) for dtrace when attaching to my PostgreSQL processes. Initially I thought it would be related to my PostgreSQL processes. I do not think this anymore. BTW these tests are done on the "problem machine", a T2000 8 core @1000MHz, 32GB intern. A timex ls /proc/pid/map already shows this amount of time being spend (where pid is the pid of a process executing the image of a self made program that create a 16G shared mem segment) Using dtrace on this ls process show lstat64 consuming the time I am looking for. Ok, created the following script to dive deeper (can only hope it does what I think it does :-) Script is started like dtrace -qs script.d -c "ls /prob/00000/map" BEGIN { printf("Target = %d\n", $target); self->depth = 0; started = timestamp; } syscall::lstat64:entry /pid == $target/ { self->filename = basename(copyinstr(arg0) ); } syscall::lstat64:entry /self->filename == "map"/ { self->flagIt = 1; /* OK the show begins */ self->depth = 0; } syscall::lstat64:return /self->flagIt / { self->flagIt = 0; /* And the show stops */ } /* Already found some things in a previous version. Want to know the caller of a function that takes time. This is what the self->parent is used for. */ fbt:::entry /self->flagIt/ { self->depth++; self->ts[self->depth] = timestamp; self->parent[self->depth] = probefunc; } fbt:::return /self->depth / { @[self->parent[self->depth-1], self->depth, probefunc] = sum(timestamp - self->ts[ self->depth ] ); self->parent[self->depth] = 0; self->depth--; } END { printf("WALL: %d\n", (timestamp - started) / 1000 ); normalize(@, 1000); printa("%20s %3d %20s %@20d\n", @); } This little magic gives me the following output (stripped many lines of calls that do not consume time) WALL: 113,6 seconds lookuppnat 5 lookuppnvp 5,540 lookupnameat 4 lookuppnat 5,589 cstatat_getvp 3 lookupnameat 5,686 cstatat64_32 2 cstatat_getvp 5,782 pr_pagev_fill 8 segspt_shmgetprot 5,961 pr_getprot 7 pr_pagev_fill 6,081 pr_pagev_fill 9 segspt_shmgetprot 39,266 pr_pagev_nextprot 8 pr_pagev_fill 39,646 getwatchprot 9 avl_find 29,973,247 pr_pagev_nextprot 8 getwatchprot 87,968,115 pr_getprot 7 pr_pagev_nextprot 113,528,006 prnsegs 6 pr_getprot 113,537,387 prgetattr 5 prnsegs 113,539,929 fop_getattr 4 prgetattr 113,540,258 cstat64_32 3 fop_getattr 113,540,307 cstatat64_32 2 cstat64_32 113,540,480 1 cstatat64_32 113,547,874 avl_find seems not to call any children, and therefore consumes quite some time (30 secs) getwatchprot used 88, of which 30 are used by this avl_find. The source shows (http://cvs.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/fs/proc/prsubr.c): 3449 /* 3450 * Return the original protections for the specified page. 3451 */ 3452 static void 3453 getwatchprot(struct as *as, caddr_t addr, uint_t *prot) 3454 { 3455 struct watched_page *pwp; 3456 struct watched_page tpw; 3457 3458 ASSERT(AS_LOCK_HELD(as, &as->a_lock)); 3459 3460 tpw.wp_vaddr = (caddr_t)((uintptr_t)addr & (uintptr_t)PAGEMASK); 3461 if ((pwp = avl_find(&as->a_wpage, &tpw, NULL)) != NULL) 3462 *prot = pwp->wp_oprot; 3463 } My idea therefore is the AS_LOCK_HELD is responsible for approx 58 second of times spend waiting. Unfortunately this URL points to nothing valid. Neither does avl_find. Now reducing my shared memory segment to 1GB (arbitrary size way smaller than 16 GB) the results are: getwatchprot 9 avl_find 1,950,850 pr_pagev_nextprot 8 getwatchprot 5,737,536 pr_getprot 7 pr_pagev_nextprot 7,412,876 prnsegs 6 pr_getprot 7,419,244 prgetattr 5 prnsegs 7,421,813 fop_getattr 4 prgetattr 7,422,144 cstat64_32 3 fop_getattr 7,422,193 cstatat64_32 2 cstat64_32 7,422,366 1 cstatat64_32 7,430,020 Looks like there is a lineair relationship between segment size and time spend in getwatchprot. Clues why this takes so long? BTW did the same test on a V890, 16*1350MHz. Did not take that long, but still quite some time (approx 2 times faster than T2000). A pmap on this process shows it uses many 256MB pages and of course a couple of 4M, 64K and 8K. shared segment is shmat-ted with SHM_SHARE_MMU flag set. Now a recompile of this program without the SHM_SHARE_MMU flag does miracles The results now are: pr_getprot 7 pr_pagev_nextprot 6,414 prnsegs 6 pr_getprot 8,759 prgetattr 5 prnsegs 11,447 fop_getattr 4 prgetattr 11,776 cstat64_32 3 fop_getattr 11,823 cstatat64_32 2 cstat64_32 11,996 1 cstatat64_32 19,479 pmap-ping this process shows the largest segment is 4M. Could this be due to 256MB pages than? -- This message posted from opensolaris.org
Mike Shapiro
2007-Dec-07 19:35 UTC
[dtrace-discuss] /proc/<pid>/map access is sloooow when using large shared memory segments
> Looks like there is a lineair relationship between segment size and > time spend in getwatchprot. > > Clues why this takes so long?Because there is a linear relationship between segment size and the number of iterations of pr_getprot() :) Basically this is all about disgusting details of the current VM system. The address space is divided into the segment ranges internally. Within those segments there are a variety of optimizations for dealing with having differing protections and watchpoints. /proc has a nice model of expressing the address space in its map, rmap, and xmap files where a "mapping" is defined as some range of the address space that has a consistent set of attributes, in particular a particular set of permission bits expressed in the prmap_t structure as a single bitmap. So when you read the map file, we have to walk the segments, and poke at their underlying attributes to deal with the notion that some segments have to be sliced and diced into multiple prmap_t''s. And by definition this is a linear process given the current design. If you want to make it go faster, what you need to do is to figure out what aspect of your segment attributes are causing pr_getprot() to be called multiple times per segment, i.e. why we''re not getting into one the many fast-paths in there. Depending on what you find we may be able to add yet another one. -Mike -- Mike Shapiro, Solaris Kernel Development. blogs.sun.com/mws/