Re: Re: CRC - Mailing list pgsql-hackers

From Tom Lane
Subject Re: Re: CRC
Date
Msg-id 17602.976405583@sss.pgh.pa.us
Whole thread Raw
In response to Re: Re: CRC  (Bruce Guenter <bruceg@em.ca>)
Responses Re: Re: CRC  (Bruce Guenter <bruceg@em.ca>)
Re: Re: CRC  (ncm@zembu.com (Nathan Myers))
List pgsql-hackers
Bruce Guenter <bruceg@em.ca> writes:
>> This is a lot closer than I'd have expected, but it sure ain't
>> "MD5 40% faster" as you reported.  I wonder why the difference
>> in results between your platform and mine?

> The difference is likely because PA-RISC (like most other RISC
> architectures) lack a "roll" opcode that is very prevalent in the MD5
> algorithm.

A good theory, but unfortunately not a correct theory.  PA-RISC can do a
circular shift in one cycle using the "shift right double" instruction,
with the same register specified as both high and low halves of the
64-bit operand.  And gcc does know about that.

After some groveling through assembly code, it seems that the CRC-32
implementation is about as tight as it could get: two loads, two XORs,
and two EXTRU's per byte (one used to implement the right shift, the
other to implement masking with 0xFF).  And the wall clock timing does
indeed come out to just about six cycles per byte.  The MD5 code also
looks pretty tight.  Each basic OP requires either two or three logical
operations (and/or/xor/not) depending on which round you're looking at,
plus four additions and a circular shift.  PA-RISC needs two cycles to
load an arbitrary 32-bit constant, but other than that I see no wasted
cycles here:
ldil L'-1444681467,%r20xor %r3,%r14,%r19ldo R'-1444681467(%r20),%r20and %r1,%r19,%r19addl %r15,%r20,%r20xor
%r14,%r19,%r19addl%r19,%r26,%r19addl %r20,%r19,%r15shd %r15,%r15,27,%r15addl %r15,%r3,%r15
 

Note gcc has been smart enough to assign all the correct_words[] array
elements to registers, else we'd lose another cycle to a load operation
--- fortunately PA-RISC has lots of registers.

There are 64 of these basic OPs needed in each round, and each round
processes 64 input bytes, so basically you can figure one OP per byte.
Ignoring loop overhead and so forth, it's nine or ten cycles per byte
for MD5 versus six for CRC.

I'm at a loss to see how a Pentium would arrive at a better result for
MD5 than for CRC.  For one thing, it's going to be at a disadvantage
because it hasn't got enough registers.  I'd be interested to see the
assembly code...
        regards, tom lane


pgsql-hackers by date:

Previous
From: "Hiroshi Inoue"
Date:
Subject: RE: [COMMITTERS] pgsql/src/backend/commands (command.c vacuum.c)
Next
From: Tom Lane
Date:
Subject: Re: [COMMITTERS] pgsql/src/backend/commands (command.c vacuum.c)