./Configure hpux-parisc2-cc
will pull in asm/pa-risc2.o
I'll copy Chris (author of that code) in case he has
any thoughts.
On Fri, Jul 12, 2002 at 03:54:29AM -0400, Deron Meranda
wrote:> I think I finally figured out the problem that many people have been
> having with extremely long login times under HP-UX 11.x. The problem
> is really in OpenSSL, and in particular the Diffie-Hellman parameter
> generation routines under the PA-RISC processor. I suspect this may
> not be a problem with the IA64 (Itanium) processors. This especially
> shows up if you use the gcc compiler. Fortunately I have access to
> Rational Quantify, a very powerful profiler which led me down to just
> a few lines of assembly code causing almost the whole delay.
>
> I finally have an ssh/sshd executable under HP which logs in almost
> instantaneously. I wouldn't consider this a complete solution yet,
> especially if you don't have access to HP's ANSI C compiler, and I
> haven't thoroughly tested this whole configuration. But this
> information may still prove quite useful.
>
> I'm using the latest of everything...
>
> OpenSSH 3.4p1
> OpenSSL 0.9.7 Beta 2
> libz 1.1.4
> gcc 3.1 (using gas from binutils 2.12.1)
> HP ANSI C compiler (version B.11.01.06)
>
> Although this is a 64-bit OS, I'm compiling everything in 32-bit mode.
>
> I'm running on an 9000/L2000-44 under HP-UX 11.0. This is a
> two-processor 440MHz PA-RISC 2.0 system. If you only have a PA-RISC
> 1.x processor I think you may still be out of luck?? You can check
> your processor version by running the command "getconf
CPU_VERSION".
> If it returns 532 or higher you have a 2.0 processor.
>
> There are basically two extremely slow routines in OpenSSL which show
> up if you compile it "out of the box": RSA operations and DH
parameter
> generation. You can test how fast these are with the following...
>
> $ openssl speed rsa # tests all RSA operations
> $ openssl dhparam -text 128 # generates DH parameters (128-bit)
>
> The RSA test is pretty accurate--you can compare this with other
> systems like Linux on a PC. The DH test is unfortunately very
> random..some runs will be quick and others slow. You'll have to run
> it many times and with different bit sizes to guage how slow it is.
> Again, comparing to a Linux box may be useful. You will almost
> defintely see the HP version being much slower than Linux/Intel (on
> Pentium3/Athlon). This is because in practice the Intel chips seem to
> have much faster integer performance; whereas the PA-RISC is much
> faster with floating point. Unfortunately for you, most crypto is
> integer based. Just to give you a comparison point, here's my numbers
> (after optimizing it as described below)...
>
> sign verify sign/s verify/s
> rsa 512 bits 0.0023s 0.0002s 432.2 5402.4
> rsa 1024 bits 0.0094s 0.0005s 106.8 2132.8
> rsa 2048 bits 0.0519s 0.0014s 19.3 690.2
> rsa 4096 bits 0.3258s 0.0049s 3.1 203.4
>
> Without my changes, even with gcc -O3, my speeds were about 100 times
> slower! The DH speed is much harder to measure, but it was definitely
> real slow with the gcc compiled version.
>
> Okay, what's going on inside the OpenSSL code.... there are two small
> functions which are responsible for about 95% of the CPU clock cycles.
> These are bn_mul_add_words() in the file crypto/bn/bn_asm.c and the
> function BN_mod_word() in the file crypto/bn/bn_word.c. The first is
> responsible for the miserable RSA speeds, and the later for the
> horrible DH speeds. I'll discuss how to speed each of these up
> separately.
>
> The bn_mul_add_words() function is by default implemented in the file
> bn_asm.c. However, neither the gcc or HP C compiler seem to be able
> to optimize that implementation very well. As that function can be
> called thousands if not millions of times, every last clock cycle is
> extremely important. Fortunately there is some hand-crafted assembly
> code in an alternate implementation. It can be found in the OpenSSL
> distribution in the file crypto/bn/asm/pa-risc2.s. You need to use
> that file instead of the generic bn_asm.c file. However, there are
> some restrictions...that file only works with HP's assembler (not
> gas), only on PA-RISC 2.0 systems, and it is not relocatable/PIC
> (can't be used in a shared library).
>
> I haven't completely figured out OpenSSL's non-standard configure
> scripts. But it is easy enough to just assemble it yourself and then
> replace that object in the libcrypto.a library.
>
> ar d libcrypto.a bn_asm.o
> ar r libcrypto.a pa-risc2.o
> ranlib libcrypt.a
>
> Then relink the openssl executable. Rerun your RSA speed
> test..hopefully the results should be very pleasant.
>
>
> Now, for the Diffie-Hellman part (the primary reason for SSH
> slowness). There is no assembly version of the bn_word.c file. And
> unfortunately gcc's optimizer, even with gcc 3.1 and with -O3 and
> -march=2.0, is pretty poor. This basically is because gcc invokes
> some millicode routines to do the 64-bit modulus "%" operation.
I've
> found though that HP's ANSI C compiler with the correct optimization
> arguments is able to produce some PA-RISC 2.0 specific instructions
> which make it very fast in comparison (say by 100 clock cycles).
>
> cc +O3 +ESlit +DA2.0 +DS2.0 -Ae \
> -DOPENSSL_THREADS -D_REENTRANT -DDSO_DL -DOPENSSL_NO_KRB5 \
> -I/opt/gnu/include \
> -DOPENSSL_NO_RC5 -DOPENSSL_NO_IDEA -D_REENTRANT \
> -DB_ENDIAN -DMD32_XARRAY -c bn_word.c -o bn_word.o
>
> Also throw in +Z if you're trying to make a shared library (but see
> note about pa-risc2.s file above).
>
> Except for those two files (pa-risc2.s and bn_word.c), you can use gcc
> for everything else. I've been using gcc 3.1, with -O3 -march=2.0
>
> Now, if all goes well, you'll have a new libcrypto.a. Compile and
> link OpenSSH against that one and you should see fast logins, finally!
> Note, both the server (sshd) and the client (ssh) need to be
> recompiled/relinked, as both generate their half of the DH parameters.
>
> Deron Meranda
> _______________________________________________
> openssh-unix-dev at mindrot.org mailing list
> http://www.mindrot.org/mailman/listinfo/openssh-unix-dev