I have been trying to determine why scp runs slower on T1000 and T2000
systems as compared to other SPARC systems. I have been using the
profile provider to help me determine what instructions run "hot" on
the T1000 I have available for my to test.
Here is the script I am using:
#!/usr/sbin/dtrace -s
int cnter;
profile-997
/arg1 && pid == $target/
{
@aes[ustack(1)]=count();
cnter++;
}
profile-997
/cnter >=40000 /
{
exit(0);
}
I then take this data to find the hot functions and the hot
instructions within those functions.
Looking at the data, the hottest function is AES_encrypt (no surprise
there). It has 50% more hits on the T1000 than a V220 I am testing on.
This is interesting, because using timex, scp takes about 50% longer
on the T1000 than the V220, overall, even though the T2000 has a clock
of 1GHz and the V220 has only 450MHz.
Even more interesting, the AES_encrypt function shows less hotspots on
the T1000 than the V220. Here is a bit of output from the listing of
AES_encrypt from both systems. The numbers are the number of hits from
5 runs of the profile script above. Blanks are "no hits".
V220:
311 AES_encrypt+0x110: e0 02 c0 01 ld [%o3 + %g1], %l0
2 AES_encrypt+0x114: 97 34 60 0e srl %l1, 0xe, %o3
AES_encrypt+0x118: 84 04 fc 00 add %l3, -0x400, %g2
342 AES_encrypt+0x11c: c6 02 80 16 ld [%o2 + %l6], %g3
1 AES_encrypt+0x120: 90 0a 63 fc and %o1, 0x3fc, %o0
4 AES_encrypt+0x124 88 0d 20 ff and %l4, 0xff, %g4
338 AES_encrypt+0x128: b1 29 20 02 sll %g4, 0x2, %i0
AES_encrypt+0x12c: f8 02 00 02 ld [%o0 + %g2], %i4
AES_encrypt+0x130: 9e 0b 7f fc and %o5, -0x4, %o7
314 AES_encrypt+0x134 93 35 20 06 srl %l4, 0x6, %o1
AES_encrypt+0x138 fa 06 00 13 ld [%i0 + %l3], %i5
AES_encrypt+0x13c b6 1c 00 03 xor %l0, %g3, %i3
317 AES_encrypt+0x140 c6 03 c0 01 ld [%o7 + %g1], %g3
11 AES_encrypt+0x144 94 0a e3 fc and %o3, 0x3fc, %o2
AES_encrypt+0x148 98 1e c0 1c xor %i3, %i4, %o4
T1000:
123 AES_encrypt+0x110: e0 02 c0 01 ld [%o3 + %g1], %l0
257 AES_encrypt+0x114: 97 34 60 0e srl %l1, 0xe, %o3
109 AES_encrypt+0x118: 84 04 fc 00 add %l3, -0x400, %g2
123 AES_encrypt+0x11c: c6 02 80 16 ld [%o2 + %l6], %g3
285 AES_encrypt+0x120: 90 0a 63 fc and %o1, 0x3fc, %o0
137 AES_encrypt+0x124: 88 0d 20 ff and %l4, 0xff, %g4
116 AES_encrypt+0x128: b1 29 20 02 sll %g4, 0x2, %i0
122 AES_encrypt+0x12c: f8 02 00 02 ld [%o0 + %g2], %i4
257 AES_encrypt+0x130: 9e 0b 7f fc and %o5, -0x4, %o7
123 AES_encrypt+0x134: 93 35 20 06 srl %l4, 0x6, %o1
109 AES_encrypt+0x138: fa 06 00 13 ld [%i0 + %l3], %i5
309 AES_encrypt+0x13c: b6 1c 00 03 xor %l0, %g3, %i3
136 AES_encrypt+0x140: c6 03 c0 01 ld [%o7 + %g1], %g3
196 AES_encrypt+0x144: 94 0a e3 fc and %o3, 0x3fc, %o2
116 AES_encrypt+0x148: 98 1e c0 1c xor %i3, %i4, %o4
Notice that there are particular instructions on the V220 that take
awhile, and then the next couple have no hits. On the T1000, they are
all hit, in a range of 100 to 300. I suspect this has something to do
with the CMT, that where a cache miss on the V220 stalls the
instruction, on the T1000 it causes a thread switch, and when the
profile probe fires the script I have above won''t record it. Does the
profile provider fire for the virtual CPUs?
Anyway, I am at a loss. Is there any way to cause the profile provider
to record the PC of a process even if that process is not on the CPU?
Perhaps the tick provider would be better?
Any ideas how to figure this out?
--
blu
There are two rules in life:
Rule 1- Don''t tell people everything you know
----------------------------------------------------------------------
Brian Utterback - Solaris RPE, Sun Microsystems, Inc.
Ph:877-259-7345, Em:brian.utterback-at-ess-you-enn-dot-kom