Thread: What exactly is our CRC algorithm?

What exactly is our CRC algorithm?

From
Heikki Linnakangas
Date:
Our CRC algorithm is a bit weird. It's supposedly CRC-32, with the same
polynomial as used in Ethernet et al, but it actually is not. The
comments refer to "Painless Guide to CRC Error Detection Algorithms" by
Ross N. Williams [1] (http://www.ross.net/crc/download/crc_v3.txt), but
I think it was implemented incorrectly.

As a test case, I used an input of a single zero byte. I calculated the
CRC using Postgres' INIT_CRC32+COMP_CRC32+FIN_CRC32, and compared with
various online CRC calculation tools and C snippets. The Postgres
algorithm produces the value 2D02EF72, while the correct one is
D202EF8D. The first and last byte are inverted. For longer inputs, the
values diverge, and I can't see any obvious pattern between the Postgres
and correct values.

There are many variants of CRC calculations, as explained in Ross's
guide. But ours doesn't seem to correspond to the reversed or reflected
variants either.

I compiled the code from Ross's document, and built a small test program
to test it. I used Ross's "reverse" lookup table, which is the same
table we use in Postgres. It produces this output:

Calculating CRC-32 (polynomial 04C11DB7) for a single zero byte:
D202EF8D 11010010000000101110111110001101 (simple)
2D02EF72 10101101000000101110111101110010 (lookup)
D202EF8D 11010010000000101110111110001101 (lookup reflected)

Hmm. So the simple, non-table driven, calculation gives the same result
as using the lookup table using the reflected lookup code. That's
expected; the lookup method is supposed to be the same, just faster.
However, using the "normal" lookup code, but with a "reflected" lookup
table, produces the same result as Postgres' algorithm. Indeed, that's
what we do in PostgreSQL. But AFAICS, that's an incorrect combination.
You're supposed to the non-reflected lookup table with the non-reflected
lookup code; you can't mix and match.


As far as I can tell, PostgreSQL's so-called CRC algorithm doesn't
correspond to any bit-by-bit CRC variant and polynomial. My math skills
are not strong enough to determine what the consequences of that are. It
might still be a decent checksum. Or not. I couldn't tell if the good
error detection properties of the normal CRC-32 polynomial apply to our
algorithm or not.

Thoughts? Attached is the test program I used for this.

- Heikki

Attachment

Re: What exactly is our CRC algorithm?

From
Andres Freund
Date:
On 2014-10-08 22:13:46 +0300, Heikki Linnakangas wrote:
> Hmm. So the simple, non-table driven, calculation gives the same result as
> using the lookup table using the reflected lookup code. That's expected; the
> lookup method is supposed to be the same, just faster. However, using the
> "normal" lookup code, but with a "reflected" lookup table, produces the same
> result as Postgres' algorithm. Indeed, that's what we do in PostgreSQL. But
> AFAICS, that's an incorrect combination. You're supposed to the
> non-reflected lookup table with the non-reflected lookup code; you can't mix
> and match.
> 
> As far as I can tell, PostgreSQL's so-called CRC algorithm doesn't
> correspond to any bit-by-bit CRC variant and polynomial. My math skills are
> not strong enough to determine what the consequences of that are. It might
> still be a decent checksum. Or not. I couldn't tell if the good error
> detection properties of the normal CRC-32 polynomial apply to our algorithm
> or not.

Additional interesting datapoints are that hstore and ltree contain the
same tables - but properly use the reflected computation.

> Thoughts?

It clearly seems like a bad idea to continue with this - I don't think
anybody here knows which guarantees this gives us.

The question is how can we move away from this. There's unfortunately
two places that embed PGC32 that are likely to prove problematic when
fixing the algorithm: pg_trgm and tsgist both seem to include crc's in
their logic in a persistent way.  I think we should provide
INIT/COMP/FIN_PG32 using the current algorithm for these.

If we're switching to a saner computation, we should imo also switch to
a better polynom - CRC-32C has better error detection capabilities than
CRC32 and is available in hardware. As we're paying the price pf
breaking compat anyway...

Arguably we could also say that given that there's been little evident
problems with the borked computation we could also switch to a much
faster hash instead of continuing to use crc...

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: What exactly is our CRC algorithm?

From
Gavin Flower
Date:
On 09/10/14 10:13, Andres Freund wrote:
> On 2014-10-08 22:13:46 +0300, Heikki Linnakangas wrote:
>> Hmm. So the simple, non-table driven, calculation gives the same result as
>> using the lookup table using the reflected lookup code. That's expected; the
>> lookup method is supposed to be the same, just faster. However, using the
>> "normal" lookup code, but with a "reflected" lookup table, produces the same
>> result as Postgres' algorithm. Indeed, that's what we do in PostgreSQL. But
>> AFAICS, that's an incorrect combination. You're supposed to the
>> non-reflected lookup table with the non-reflected lookup code; you can't mix
>> and match.
>>
>> As far as I can tell, PostgreSQL's so-called CRC algorithm doesn't
>> correspond to any bit-by-bit CRC variant and polynomial. My math skills are
>> not strong enough to determine what the consequences of that are. It might
>> still be a decent checksum. Or not. I couldn't tell if the good error
>> detection properties of the normal CRC-32 polynomial apply to our algorithm
>> or not.
> Additional interesting datapoints are that hstore and ltree contain the
> same tables - but properly use the reflected computation.
>
>> Thoughts?
> It clearly seems like a bad idea to continue with this - I don't think
> anybody here knows which guarantees this gives us.
>
> The question is how can we move away from this. There's unfortunately
> two places that embed PGC32 that are likely to prove problematic when
> fixing the algorithm: pg_trgm and tsgist both seem to include crc's in
> their logic in a persistent way.  I think we should provide
> INIT/COMP/FIN_PG32 using the current algorithm for these.
>
> If we're switching to a saner computation, we should imo also switch to
> a better polynom - CRC-32C has better error detection capabilities than
> CRC32 and is available in hardware. As we're paying the price pf
> breaking compat anyway...
>
> Arguably we could also say that given that there's been little evident
> problems with the borked computation we could also switch to a much
> faster hash instead of continuing to use crc...
>
> Greetings,
>
> Andres Freund
>
Could a 64 bit variant of some kind be useful as an option - in addition 
to a correct 32 bit?

As most people have 64 bit processors and storage is less constrained 
now-a-days, as well as we tend to store much larger chunks of data.


Cheers,
Gavin



Re: What exactly is our CRC algorithm?

From
Heikki Linnakangas
Date:
On 10/09/2014 01:23 AM, Gavin Flower wrote:
> On 09/10/14 10:13, Andres Freund wrote:
>> If we're switching to a saner computation, we should imo also switch to
>> a better polynom - CRC-32C has better error detection capabilities than
>> CRC32 and is available in hardware. As we're paying the price pf
>> breaking compat anyway...
>>
>> Arguably we could also say that given that there's been little evident
>> problems with the borked computation we could also switch to a much
>> faster hash instead of continuing to use crc...
>
> Could a 64 bit variant of some kind be useful as an option - in addition
> to a correct 32 bit?

More bits allows you to detect more errors. That's the only advantage. 
I've never heard that being a problem, so no, I don't think that's a 
good idea.

> As most people have 64 bit processors and storage is less constrained
> now-a-days, as well as we tend to store much larger chunks of data.

That's irrelevant to the CRC in the WAL. Each WAL record is CRC'd 
separately, and they tend to be very small (less than 8k typically) 
regardless of how "large chunks of data" you store in the database.

- Heikki




Re: What exactly is our CRC algorithm?

From
Heikki Linnakangas
Date:
On 10/09/2014 12:13 AM, Andres Freund wrote:
> On 2014-10-08 22:13:46 +0300, Heikki Linnakangas wrote:
>> As far as I can tell, PostgreSQL's so-called CRC algorithm doesn't
>> correspond to any bit-by-bit CRC variant and polynomial. My math skills are
>> not strong enough to determine what the consequences of that are. It might
>> still be a decent checksum. Or not. I couldn't tell if the good error
>> detection properties of the normal CRC-32 polynomial apply to our algorithm
>> or not.
>
> Additional interesting datapoints are that hstore and ltree contain the
> same tables - but properly use the reflected computation.
>
>> Thoughts?
>
> It clearly seems like a bad idea to continue with this - I don't think
> anybody here knows which guarantees this gives us.
>
> The question is how can we move away from this. There's unfortunately
> two places that embed PGC32 that are likely to prove problematic when
> fixing the algorithm: pg_trgm and tsgist both seem to include crc's in
> their logic in a persistent way.  I think we should provide
> INIT/COMP/FIN_PG32 using the current algorithm for these.

Agreed, it's not worth breaking pg_upgrade for this.

> If we're switching to a saner computation, we should imo also switch to
> a better polynom - CRC-32C has better error detection capabilities than
> CRC32 and is available in hardware. As we're paying the price pf
> breaking compat anyway...

Agreed.

> Arguably we could also say that given that there's been little evident
> problems with the borked computation we could also switch to a much
> faster hash instead of continuing to use crc...

I don't feel like taking the leap. Once we switch to slice-by-4/8 and/or
use a hardware instruction when available, CRC is fast enough.

I came up with the attached patches. They do three things:

1. Get rid of the 64-bit CRC code. It's not used for anything, and
haven't been for years, so it doesn't seem worth spending any effort to
fix them.

2. Switch to CRC-32C (Castagnoli) for WAL and other places that don't
need to remain compatible across major versions.

3. Use the same lookup table for hstore and ltree, as used for the
legacy "almost CRC-32" algorithm. The tables are identical, so might as
well.

Any objections?

- Heikki


Attachment

Re: What exactly is our CRC algorithm?

From
Heikki Linnakangas
Date:
On 10/27/2014 06:02 PM, Heikki Linnakangas wrote:
> I came up with the attached patches. They do three things:
>
> 1. Get rid of the 64-bit CRC code. It's not used for anything, and
> haven't been for years, so it doesn't seem worth spending any effort to
> fix them.
>
> 2. Switch to CRC-32C (Castagnoli) for WAL and other places that don't
> need to remain compatible across major versions.
>
> 3. Use the same lookup table for hstore and ltree, as used for the
> legacy "almost CRC-32" algorithm. The tables are identical, so might as
> well.
>
> Any objections?

I hear none, so committed with some small fixes.
- Heikki




Re: What exactly is our CRC algorithm?

From
Robert Haas
Date:
On Tue, Nov 4, 2014 at 4:47 AM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:
> I hear none, so committed with some small fixes.

Are you going to get the slice-by-N stuff working next, to speed this up?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: What exactly is our CRC algorithm?

From
Andres Freund
Date:
On 2014-11-04 08:21:13 -0500, Robert Haas wrote:
> On Tue, Nov 4, 2014 at 4:47 AM, Heikki Linnakangas
> <hlinnakangas@vmware.com> wrote:
> > I hear none, so committed with some small fixes.
> 
> Are you going to get the slice-by-N stuff working next, to speed this up?

I don't plan to do anything serious with it, but I've hacked up the crc
code to use the hardware instruction. The results are quite good - crc
vanishes entirely from the profile for most workloads. It's still
visible for bulk copy, but that's pretty much it.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: What exactly is our CRC algorithm?

From
Heikki Linnakangas
Date:
On 11/04/2014 03:21 PM, Robert Haas wrote:
> On Tue, Nov 4, 2014 at 4:47 AM, Heikki Linnakangas
> <hlinnakangas@vmware.com> wrote:
>> I hear none, so committed with some small fixes.
>
> Are you going to get the slice-by-N stuff working next, to speed this up?

I don't have any concrete plans, but yeah, that would be great. So 
definitely maybe.

- Heikki



Re: What exactly is our CRC algorithm?

From
Amit Kapila
Date:
On Tue, Nov 4, 2014 at 3:17 PM, Heikki Linnakangas <hlinnakangas@vmware.com> wrote:
On 10/27/2014 06:02 PM, Heikki Linnakangas wrote:
I came up with the attached patches. They do three things:

1. Get rid of the 64-bit CRC code. It's not used for anything, and
haven't been for years, so it doesn't seem worth spending any effort to
fix them.

2. Switch to CRC-32C (Castagnoli) for WAL and other places that don't
need to remain compatible across major versions.


Will this change allow database created before this commit to be
started after this commit? 


With Regards,
Amit Kapila.

Re: What exactly is our CRC algorithm?

From
Heikki Linnakangas
Date:
On 11/07/2014 07:08 AM, Amit Kapila wrote:
> On Tue, Nov 4, 2014 at 3:17 PM, Heikki Linnakangas <hlinnakangas@vmware.com>
> wrote:
>
>> On 10/27/2014 06:02 PM, Heikki Linnakangas wrote:
>>
>>> I came up with the attached patches. They do three things:
>>>
>>> 1. Get rid of the 64-bit CRC code. It's not used for anything, and
>>> haven't been for years, so it doesn't seem worth spending any effort to
>>> fix them.
>>>
>>> 2. Switch to CRC-32C (Castagnoli) for WAL and other places that don't
>>> need to remain compatible across major versions.
>>>
>>>
> Will this change allow database created before this commit to be
> started after this commit?

No. You could use pg_resetxlog to fix the WAL, but I think at least 
relmap files would still prevent you from starting up. You could use 
pg_upgrade.

- Heikki



Re: What exactly is our CRC algorithm?

From
Abhijit Menon-Sen
Date:
At 2014-11-04 14:40:36 +0100, andres@2ndquadrant.com wrote:
>
> On 2014-11-04 08:21:13 -0500, Robert Haas wrote:
> > 
> > Are you going to get the slice-by-N stuff working next, to speed
> > this up?
> 
> I don't plan to do anything serious with it, but I've hacked up the
> crc code to use the hardware instruction.

I'm working on this (first speeding up the default calculation using
slice-by-N, then adding support for the SSE4.2 CRC instruction on top).

-- Abhijit



Re: What exactly is our CRC algorithm?

From
Robert Haas
Date:
On Tue, Nov 11, 2014 at 6:26 AM, Abhijit Menon-Sen <ams@2ndquadrant.com> wrote:
> At 2014-11-04 14:40:36 +0100, andres@2ndquadrant.com wrote:
>> On 2014-11-04 08:21:13 -0500, Robert Haas wrote:
>> > Are you going to get the slice-by-N stuff working next, to speed
>> > this up?
>>
>> I don't plan to do anything serious with it, but I've hacked up the
>> crc code to use the hardware instruction.
>
> I'm working on this (first speeding up the default calculation using
> slice-by-N, then adding support for the SSE4.2 CRC instruction on top).

Great!

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: What exactly is our CRC algorithm?

From
Abhijit Menon-Sen
Date:
At 2014-11-11 16:56:00 +0530, ams@2ndQuadrant.com wrote:
>
> I'm working on this (first speeding up the default calculation using
> slice-by-N, then adding support for the SSE4.2 CRC instruction on
> top).

I've done the first part in the attached patch, and I'm working on the
second (especially the bits to issue CPUID at startup and decide which
implementation to use).

As a benchmark, I ran pg_xlogdump --stats against 11GB of WAL data (674
segments) generated by running a total of 2M pgbench transactions on a
db initialised with scale factor 25. The tests were run on my i5-3230
CPU, and the code in each case was compiled with "-O3 -msse4.2" (and
without --enable-debug). The profile was dominated by the CRC
calculation in ValidXLogRecord.

With HEAD's CRC code:

bin/pg_xlogdump --stats wal/000000010000000000000001   29.81s user 3.56s system 77% cpu 43.274 total
bin/pg_xlogdump --stats wal/000000010000000000000001   29.59s user 3.85s system 75% cpu 44.227 total

With slice-by-4 (a minor variant of the attached patch; the results are
included only for curiosity's sake, but I can post the code if needed):

bin/pg_xlogdump --stats wal/000000010000000000000001   13.52s user 3.82s system 48% cpu 35.808 total
bin/pg_xlogdump --stats wal/000000010000000000000001   13.34s user 3.96s system 48% cpu 35.834 total

With slice-by-8 (i.e. the attached patch):

bin/pg_xlogdump --stats wal/000000010000000000000001   7.88s user 3.96s system 34% cpu 34.414 total
bin/pg_xlogdump --stats wal/000000010000000000000001   7.85s user 4.10s system 34% cpu 35.001 total

(Note the progressive reduction in user time from ~29s to ~8s.)

Finally, just for comparison, here's what happens when we use the
hardware instruction via gcc's __builtin_ia32_crc32xx intrinsics
(i.e. the additional patch I'm working on):

bin/pg_xlogdump --stats wal/000000010000000000000001   3.33s user 4.79s system 23% cpu 34.832 total

There are a number of potential micro-optimisations, I just wanted to
submit the obvious thing first and explore more possibilities later.

-- Abhijit

Attachment

Re: What exactly is our CRC algorithm?

From
Heikki Linnakangas
Date:
On 11/19/2014 05:58 PM, Abhijit Menon-Sen wrote:
> At 2014-11-11 16:56:00 +0530, ams@2ndQuadrant.com wrote:
>>
>> I'm working on this (first speeding up the default calculation using
>> slice-by-N, then adding support for the SSE4.2 CRC instruction on
>> top).
>
> I've done the first part in the attached patch, and I'm working on the
> second (especially the bits to issue CPUID at startup and decide which
> implementation to use).
>
> As a benchmark, I ran pg_xlogdump --stats against 11GB of WAL data (674
> segments) generated by running a total of 2M pgbench transactions on a
> db initialised with scale factor 25.

That's an interesting choice of workload. That sure is heavy on the CRC 
calculation, but the speed of pg_xlogdump hardly matters in real life.

> With HEAD's CRC code:
>
> bin/pg_xlogdump --stats wal/000000010000000000000001   29.81s user 3.56s system 77% cpu 43.274 total
> bin/pg_xlogdump --stats wal/000000010000000000000001   29.59s user 3.85s system 75% cpu 44.227 total
>
> With slice-by-4 (a minor variant of the attached patch; the results are
> included only for curiosity's sake, but I can post the code if needed):
>
> bin/pg_xlogdump --stats wal/000000010000000000000001   13.52s user 3.82s system 48% cpu 35.808 total
> bin/pg_xlogdump --stats wal/000000010000000000000001   13.34s user 3.96s system 48% cpu 35.834 total
>
> With slice-by-8 (i.e. the attached patch):
>
> bin/pg_xlogdump --stats wal/000000010000000000000001   7.88s user 3.96s system 34% cpu 34.414 total
> bin/pg_xlogdump --stats wal/000000010000000000000001   7.85s user 4.10s system 34% cpu 35.001 total
>
> (Note the progressive reduction in user time from ~29s to ~8s.)
>
> Finally, just for comparison, here's what happens when we use the
> hardware instruction via gcc's __builtin_ia32_crc32xx intrinsics
> (i.e. the additional patch I'm working on):
>
> bin/pg_xlogdump --stats wal/000000010000000000000001   3.33s user 4.79s system 23% cpu 34.832 total

Impressive!

It would be good to see separate benchmarks on WAL generation, and WAL 
replay. pg_xlogdump is probably close to WAL replay, but the WAL 
generation case is somewhat different, as WAL is generated in small 
pieces, and each piece is accumulated to the CRC separately. The 
slice-by-X algorithm might be less effective in that scenario. I have no 
doubt that it's still a lot faster than the byte-at-a-time operation, 
but would be nice to have numbers on it.

- Heikki




Re: What exactly is our CRC algorithm?

From
Robert Haas
Date:
On Wed, Nov 19, 2014 at 11:44 AM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:
> That's an interesting choice of workload. That sure is heavy on the CRC
> calculation, but the speed of pg_xlogdump hardly matters in real life.

But isn't a workload that is heavy on CRC calculation exactly what we
want here?  That way we can see clearly how much benefit we're getting
on that particular part of the computation.  It'll still speed up
other workloads, too, just not as much.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: What exactly is our CRC algorithm?

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> On Wed, Nov 19, 2014 at 11:44 AM, Heikki Linnakangas
> <hlinnakangas@vmware.com> wrote:
>> That's an interesting choice of workload. That sure is heavy on the CRC
>> calculation, but the speed of pg_xlogdump hardly matters in real life.

> But isn't a workload that is heavy on CRC calculation exactly what we
> want here?  That way we can see clearly how much benefit we're getting
> on that particular part of the computation.  It'll still speed up
> other workloads, too, just not as much.

Heikki's point is that it's an unrealistic choice of CRC chunk size.
Maybe that doesn't matter very much, but it's unproven.
        regards, tom lane



Re: What exactly is our CRC algorithm?

From
Heikki Linnakangas
Date:
On 11/19/2014 06:49 PM, Robert Haas wrote:
> On Wed, Nov 19, 2014 at 11:44 AM, Heikki Linnakangas
> <hlinnakangas@vmware.com> wrote:
>> That's an interesting choice of workload. That sure is heavy on the CRC
>> calculation, but the speed of pg_xlogdump hardly matters in real life.
>
> But isn't a workload that is heavy on CRC calculation exactly what we
> want here?  That way we can see clearly how much benefit we're getting
> on that particular part of the computation.  It'll still speed up
> other workloads, too, just not as much.

Sure. But pg_xlogdump's way of using the CRC isn't necessarily 
representative of how the backend uses it. It's probably pretty close to 
WAL replay in the server, but even there the server might be hurt more 
by the extra cache used by the lookup tables. And a backend generating 
the WAL computes the CRC on smaller pieces than pg_xlogdump and WAL redo 
does.

That said, the speedup is so large that I'm sure this is a big win in 
the server too, despite those factors.

- Heikki



Re: What exactly is our CRC algorithm?

From
Andres Freund
Date:
On 2014-11-19 19:12:22 +0200, Heikki Linnakangas wrote:
> On 11/19/2014 06:49 PM, Robert Haas wrote:
> >On Wed, Nov 19, 2014 at 11:44 AM, Heikki Linnakangas
> ><hlinnakangas@vmware.com> wrote:
> >>That's an interesting choice of workload. That sure is heavy on the CRC
> >>calculation, but the speed of pg_xlogdump hardly matters in real life.
> >
> >But isn't a workload that is heavy on CRC calculation exactly what we
> >want here?  That way we can see clearly how much benefit we're getting
> >on that particular part of the computation.  It'll still speed up
> >other workloads, too, just not as much.
> 
> Sure. But pg_xlogdump's way of using the CRC isn't necessarily
> representative of how the backend uses it. It's probably pretty close to WAL
> replay in the server, but even there the server might be hurt more by the
> extra cache used by the lookup tables. And a backend generating the WAL
> computes the CRC on smaller pieces than pg_xlogdump and WAL redo does.

Right. Although it hugely depends on the checkpoint settings - if
there's many FPWs it doesn't matter much.

Obviously it won't be a fourfold performance improvement in the server,
but given the profiles I've seen in the past I'm pretty sure it'll bee a
benefit.

> That said, the speedup is so large that I'm sure this is a big win in the
> server too, despite those factors.

Yep. I've done some fast and loose benchmarking in the past and it was
quite noticeable. Made XLogInsert() nearly entirely drop from profiles.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: What exactly is our CRC algorithm?

From
Abhijit Menon-Sen
Date:
At 2014-11-19 19:12:22 +0200, hlinnakangas@vmware.com wrote:
>
> But pg_xlogdump's way of using the CRC isn't necessarily
> representative of how the backend uses it. It's probably pretty close
> to WAL replay in the server, but even there the server might be hurt
> more by the extra cache used by the lookup tables.

Sure. As Robert said, my initial benchmark was designed to show the CRC
improvements in isolation. I would be happy to conduct other tests and
post the numbers.

If I understand correctly, I need to demonstrate two things that are
"probably fine", but we don't have proof of:

(a) that the improvements in pg_xlogdump performance translate to an   improvement in the server when reading WAL.
(b) that the slice-by-8 code doesn't hurt performance for writing WAL.

To address (a), I am timing a standby restoring the same 11GB of WAL via
restore_command with and without the CRC patch. My earlier tests showed
that this time can vary quite a bit between runs even with no changes,
but I expect to see an improvement anyway.

Suggestions for how to address (b) are welcome.

-- Abhijit



Re: What exactly is our CRC algorithm?

From
Abhijit Menon-Sen
Date:
At 2014-11-20 09:52:01 +0530, ams@2ndQuadrant.com wrote:
>
> To address (a), I am timing a standby restoring the same 11GB of WAL
> via restore_command with and without the CRC patch.

I ran "perf record -F 100 --call-graph=dwarf bin/postgres -D backup",
where:

(a) bin/postgres was compiled, as before, with CFLAGS="-O3 -msse4.2" and   without --enable-debug.
(b) backup is a basebackup of the original data directory before I ran   pgbench; it has a recovery.conf with a
restore_commandto fetch the   WAL generated during the pgbench run.
 

I ran the test "frice" (as my daughter would say) with HEAD and the
slice-by-8 patch, and counted the time between the first and last
«restored log file "NNN" from archive» log messages. The number in
parentheses shows the order in which I ran the tests.

HEAD: 6m47s (1), 6m23s (2), 3m12s (6), 3m04s (7)
8CRC: 5m16s (3), 3m13s (4), 2m57s (5), 2m46s (8)

In the unpatched server, the profile shows ValidXLogRecord at >50% in
every case (54.81%, 59.41%, 58.82%, 59.05%).

In the patched server, pg_comp_crc32c was at the top of the profile in
three cases, with 10.29%, 12.01%, and 10.99%. In test (4), however, it
was at 6.38%, and first place went to copy_user_enhanced_fast_string
with 10.03% (which was in second place the other three times).

I believe the wallclock runtime in these tests was influenced more by IO
than CPU, but that the profile shows a worthwhile improvement anyway. I
have the perf.data from each run, and I can rerun the tests if anyone
wants to suggest a different strategy.

> Suggestions for how to address (b) are welcome.

I'll think about how to do this, and work on the SSE4.2 patch meanwhile.

-- Abhijit



Re: What exactly is our CRC algorithm?

From
Abhijit Menon-Sen
Date:
At 2014-11-20 13:47:00 +0530, ams@2ndQuadrant.com wrote:
>
> > Suggestions for how to address (b) are welcome.

With help from Andres, I set up a workload where XLogInsert* was at the
top of my profiles: server with fsync and synchronous_commit off, and
pgbench running a multiple-row insert into a single-text-column table
with -M prepared -c 25 -t 250000 -f script.

Unfortunately I can't see much difference despite running things with
slightly different parameters a few dozen times. For example, here are
real/user/sys times from three runs each with HEAD and slice-by-8 on an
otherwise-idle i7-3770 server with a couple of undistinguished Toshiba
7200rpm SATA disks in RAID-1:

HEAD:   2m24.822s/0m18.776s/0m23.156s   3m34.586s/0m18.784s/0m24.324s   3m41.557s/0m18.640s/0m23.920s

Slice-by-8:   2m26.977s/0m18.420s/0m22.884s   3m36.664s/0m18.376s/0m24.232s   3m43.930s/0m18.580s/0m24.560s

I don't know how to interpret these results (especially the tendency for
the tests to slow down as time passes, with every version). At best, it
shows that the new CRC code doesn't hurt, at worst it's just irrelevant.
Supporting the latter interpretation, using the hardware-CRC patch also
gives similar results (but XLogInsert is not at the top of the profile).

Hardware-CRC:   2m29.090s/0m18.652s/0m23.764s   3m30.188s/0m18.692s/0m25.332s   3m38.110s/0m20.424s/0m24.532s

If anyone has other suggestions, I'm all ears.

-- Abhijit



Re: What exactly is our CRC algorithm?

From
Heikki Linnakangas
Date:
On 11/21/2014 12:11 PM, Abhijit Menon-Sen wrote:
> At 2014-11-20 13:47:00 +0530, ams@2ndQuadrant.com wrote:
>>
>>> Suggestions for how to address (b) are welcome.
>
> With help from Andres, I set up a workload where XLogInsert* was at the
> top of my profiles: server with fsync and synchronous_commit off, and
> pgbench running a multiple-row insert into a single-text-column table
> with -M prepared -c 25 -t 250000 -f script.

How wide is the row, in terms of bytes? You should see bigger 
improvement with wider rows, as you get longer contiguous chunks of data 
to CRC that way. With very narrow rows, you might not see much 
difference because the chunks are so small.

Did you run these tests with a fresh checkout, including the WAL format 
patch I committed yesterday? That would make some difference to how many 
XLogRecData chunks there are for each insertion record.

If that's the problem, it might be beneficial to memcpy() all the data 
to a temporary buffer, and calculate the CRC over the whole, instead of 
CRC'ing each XLogRecData chunk separately. XLogRecordAssemble() uses a 
scratch area, hdr_scratch, for building all the headers. You could check 
how much rmgr-specific data there is, and if there isn't much, just 
append the data to that scratch area too.

> Unfortunately I can't see much difference despite running things with
> slightly different parameters a few dozen times. For example, here are
> real/user/sys times from three runs each with HEAD and slice-by-8 on an
> otherwise-idle i7-3770 server with a couple of undistinguished Toshiba
> 7200rpm SATA disks in RAID-1:
>
> HEAD:
>      2m24.822s/0m18.776s/0m23.156s
>      3m34.586s/0m18.784s/0m24.324s
>      3m41.557s/0m18.640s/0m23.920s
>
> Slice-by-8:
>      2m26.977s/0m18.420s/0m22.884s
>      3m36.664s/0m18.376s/0m24.232s
>      3m43.930s/0m18.580s/0m24.560s
>
> I don't know how to interpret these results (especially the tendency for
> the tests to slow down as time passes, with every version).

The user/sys times are very stable, but the real time is much higher and 
increases. Does that mean that it's blocked on I/O?

- Heikki




Re: What exactly is our CRC algorithm?

From
Andres Freund
Date:
On 2014-11-21 13:01:50 +0200, Heikki Linnakangas wrote:
> On 11/21/2014 12:11 PM, Abhijit Menon-Sen wrote:
> >At 2014-11-20 13:47:00 +0530, ams@2ndQuadrant.com wrote:
> >>
> >>>Suggestions for how to address (b) are welcome.
> >
> >With help from Andres, I set up a workload where XLogInsert* was at the
> >top of my profiles: server with fsync and synchronous_commit off, and
> >pgbench running a multiple-row insert into a single-text-column table
> >with -M prepared -c 25 -t 250000 -f script.
> 
> How wide is the row, in terms of bytes? You should see bigger improvement
> with wider rows, as you get longer contiguous chunks of data to CRC that
> way. With very narrow rows, you might not see much difference because the
> chunks are so small.

The primary goal, as I understood it, was to test very short records to
test whether there's a regression due to slice-by-8. There doesn't seem
to be any.

It's, IIRC, trivial to reproduce significant performance benefits in
workloads with lots of FPWs or large records (like COPY), but those
weren't what you and Tom were concerned about...

> If that's the problem, it might be beneficial to memcpy() all the data to a
> temporary buffer, and calculate the CRC over the whole, instead of CRC'ing
> each XLogRecData chunk separately. XLogRecordAssemble() uses a scratch area,
> hdr_scratch, for building all the headers. You could check how much
> rmgr-specific data there is, and if there isn't much, just append the data
> to that scratch area too.

I think that might very well make sense - but it's imo something better
done separately from the smarter CRC algorithm.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: What exactly is our CRC algorithm?

From
Heikki Linnakangas
Date:
On 11/21/2014 01:06 PM, Andres Freund wrote:
> On 2014-11-21 13:01:50 +0200, Heikki Linnakangas wrote:
>> On 11/21/2014 12:11 PM, Abhijit Menon-Sen wrote:
>>> At 2014-11-20 13:47:00 +0530, ams@2ndQuadrant.com wrote:
>>>>
>>>>> Suggestions for how to address (b) are welcome.
>>>
>>> With help from Andres, I set up a workload where XLogInsert* was at the
>>> top of my profiles: server with fsync and synchronous_commit off, and
>>> pgbench running a multiple-row insert into a single-text-column table
>>> with -M prepared -c 25 -t 250000 -f script.
>>
>> How wide is the row, in terms of bytes? You should see bigger improvement
>> with wider rows, as you get longer contiguous chunks of data to CRC that
>> way. With very narrow rows, you might not see much difference because the
>> chunks are so small.
>
> The primary goal, as I understood it, was to test very short records to
> test whether there's a regression due to slice-by-8. There doesn't seem
> to be any.

Ah, OK. Mission accomplished, then.

> It's, IIRC, trivial to reproduce significant performance benefits in
> workloads with lots of FPWs or large records (like COPY), but those
> weren't what you and Tom were concerned about...

Yeah.

>> If that's the problem, it might be beneficial to memcpy() all the data to a
>> temporary buffer, and calculate the CRC over the whole, instead of CRC'ing
>> each XLogRecData chunk separately. XLogRecordAssemble() uses a scratch area,
>> hdr_scratch, for building all the headers. You could check how much
>> rmgr-specific data there is, and if there isn't much, just append the data
>> to that scratch area too.
>
> I think that might very well make sense - but it's imo something better
> done separately from the smarter CRC algorithm.

Agreed.

- Heikki



Re: What exactly is our CRC algorithm?

From
Ants Aasma
Date:
On Fri, Nov 21, 2014 at 12:11 PM, Abhijit Menon-Sen <ams@2ndquadrant.com> wrote:
> If anyone has other suggestions, I'm all ears.

Do you have a WIP patch I could take a look at and tweak? Maybe
there's something about the compilers code generation that could be
improved.

Regards,
Ants Aasma
--
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de



Re: What exactly is our CRC algorithm?

From
"ktm@rice.edu"
Date:
On Fri, Nov 21, 2014 at 03:41:45PM +0530, Abhijit Menon-Sen wrote:
> At 2014-11-20 13:47:00 +0530, ams@2ndQuadrant.com wrote:
> >
> > > Suggestions for how to address (b) are welcome.
> 
> With help from Andres, I set up a workload where XLogInsert* was at the
> top of my profiles: server with fsync and synchronous_commit off, and
> pgbench running a multiple-row insert into a single-text-column table
> with -M prepared -c 25 -t 250000 -f script.
> 
> Unfortunately I can't see much difference despite running things with
> slightly different parameters a few dozen times. For example, here are
> real/user/sys times from three runs each with HEAD and slice-by-8 on an
> otherwise-idle i7-3770 server with a couple of undistinguished Toshiba
> 7200rpm SATA disks in RAID-1:
> 
> HEAD:
>     2m24.822s/0m18.776s/0m23.156s
>     3m34.586s/0m18.784s/0m24.324s
>     3m41.557s/0m18.640s/0m23.920s
> 
> Slice-by-8:
>     2m26.977s/0m18.420s/0m22.884s
>     3m36.664s/0m18.376s/0m24.232s
>     3m43.930s/0m18.580s/0m24.560s
> 
> I don't know how to interpret these results (especially the tendency for
> the tests to slow down as time passes, with every version). At best, it
> shows that the new CRC code doesn't hurt, at worst it's just irrelevant.
> Supporting the latter interpretation, using the hardware-CRC patch also
> gives similar results (but XLogInsert is not at the top of the profile).
> 
> Hardware-CRC:
>     2m29.090s/0m18.652s/0m23.764s
>     3m30.188s/0m18.692s/0m25.332s
>     3m38.110s/0m20.424s/0m24.532s
> 
> If anyone has other suggestions, I'm all ears.
> 
> -- Abhijit

Hi,

This indicates that another part of the system is the resource limit,
not the CRC calculation. My money is on the I/O system. Try it using
an in memory filesystem and see if that matters. Even if it is still
the same results, at least the new version of CRC is no worse than
the current version.

Regards,
Ken



Re: What exactly is our CRC algorithm?

From
Bruce Momjian
Date:
On Fri, Nov 21, 2014 at 07:49:45AM -0600, ktm@rice.edu wrote:
> Hi,
> 
> This indicates that another part of the system is the resource limit,
> not the CRC calculation. My money is on the I/O system. Try it using
> an in memory filesystem and see if that matters. Even if it is still
> the same results, at least the new version of CRC is no worse than
> the current version.

I would test it with fsync=off to remove the fsync delay.  That will
simulate an SSD or BBU controller.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + Everyone has their own god. +



Re: What exactly is our CRC algorithm?

From
Abhijit Menon-Sen
Date:
At 2014-11-26 18:56:52 -0500, bruce@momjian.us wrote:
>
> I would test it with fsync=off to remove the fsync delay.  That will
> simulate an SSD or BBU controller.

The earlier tests were run with fsync=off.

I ran another round of the replay tests on the i7 server, this time
replaying 30GB of WAL segments.

master
   16:08:03 16:16:23 08:20   16:19:33 16:28:03 08:30   16:32:20 16:41:17 08:57   16:42:42 16:51:22 08:40

8crc
   16:52:11 16:59:48 07:37   17:01:07 17:08:30 07:23   17:22:16 17:30:11 07:55   17:31:29 17:39:06 07:37

These are just the results of the last four runs with each branch, but
the difference was consistent over many more runs.

To summarise: the slice-by-8 patch (a) increases pure CRC performance by
a large margin, (b) has a positive effect on reading WAL during replay,
(c) doesn't hurt XLogInsert in typical cases (which was the main worry).

-- Abhijit



Re: What exactly is our CRC algorithm?

From
Bruce Momjian
Date:
On Thu, Nov 27, 2014 at 11:19:51AM +0530, Abhijit Menon-Sen wrote:
> At 2014-11-26 18:56:52 -0500, bruce@momjian.us wrote:
> >
> > I would test it with fsync=off to remove the fsync delay.  That will
> > simulate an SSD or BBU controller.
> 
> The earlier tests were run with fsync=off.
> 
> I ran another round of the replay tests on the i7 server, this time
> replaying 30GB of WAL segments.
> 
> master
> 
>     16:08:03 16:16:23 08:20
>     16:19:33 16:28:03 08:30
>     16:32:20 16:41:17 08:57
>     16:42:42 16:51:22 08:40
> 
> 8crc
> 
>     16:52:11 16:59:48 07:37
>     17:01:07 17:08:30 07:23
>     17:22:16 17:30:11 07:55
>     17:31:29 17:39:06 07:37
> 
> These are just the results of the last four runs with each branch, but
> the difference was consistent over many more runs.
> 
> To summarise: the slice-by-8 patch (a) increases pure CRC performance by
> a large margin, (b) has a positive effect on reading WAL during replay,
> (c) doesn't hurt XLogInsert in typical cases (which was the main worry).

Thanks, these are good numbers.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + Everyone has their own god. +



Re: What exactly is our CRC algorithm?

From
Abhijit Menon-Sen
Date:
Hi.

Here's a proposed patch to use CPUID at startup to determine if the
SSE4.2 CRC instructions are available, to use them instead of the
slice-by-8 implementation (posted earlier).

A few notes:

1. GCC has included cpuid.h since 4.3.0, so I figured it was safe to
   use. It can be replaced with some inline assembly otherwise.

2. I've also used the crc32b/crc32q instructions directly rather than
   using ".bytes" to encode the instructions; bintuils versions since
   2007 or so have supported them.

3. I've included the MSVC implementation mostly as an example of how to
   extend this to different compilers/platforms. It's written according
   to the documentation for MSVC intrinsics, but I have not tested it.
   Suggestions/improvements are welcome.

-- Abhijit

Attachment

Re: What exactly is our CRC algorithm?

From
Bruce Momjian
Date:
On Thu, Dec 25, 2014 at 11:57:29AM +0530, Abhijit Menon-Sen wrote:
> Hi.
> 
> Here's a proposed patch to use CPUID at startup to determine if the
> SSE4.2 CRC instructions are available, to use them instead of the
> slice-by-8 implementation (posted earlier).
> 
> A few notes:
> 
> 1. GCC has included cpuid.h since 4.3.0, so I figured it was safe to
>    use. It can be replaced with some inline assembly otherwise.
> 
> 2. I've also used the crc32b/crc32q instructions directly rather than
>    using ".bytes" to encode the instructions; bintuils versions since
>    2007 or so have supported them.
> 
> 3. I've included the MSVC implementation mostly as an example of how to
>    extend this to different compilers/platforms. It's written according
>    to the documentation for MSVC intrinsics, but I have not tested it.
>    Suggestions/improvements are welcome.

Uh, what happens if the system is compiled on a different CPU that it is
run on?  Seems we would need a run-time CPU test.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + Everyone has their own god. +



Re: What exactly is our CRC algorithm?

From
Andres Freund
Date:
On December 26, 2014 4:50:33 PM CET, Bruce Momjian <bruce@momjian.us> wrote:
>On Thu, Dec 25, 2014 at 11:57:29AM +0530, Abhijit Menon-Sen wrote:
>> Hi.
>> 
>> Here's a proposed patch to use CPUID at startup to determine if the
>> SSE4.2 CRC instructions are available, to use them instead of the
>> slice-by-8 implementation (posted earlier).
>> 
>> A few notes:
>> 
>> 1. GCC has included cpuid.h since 4.3.0, so I figured it was safe to
>>    use. It can be replaced with some inline assembly otherwise.
>> 
>> 2. I've also used the crc32b/crc32q instructions directly rather than
>>    using ".bytes" to encode the instructions; bintuils versions since
>>    2007 or so have supported them.
>> 
>> 3. I've included the MSVC implementation mostly as an example of how
>to
>>    extend this to different compilers/platforms. It's written
>according
>>    to the documentation for MSVC intrinsics, but I have not tested
>it.
>>    Suggestions/improvements are welcome.
>
>Uh, what happens if the system is compiled on a different CPU that it
>is
>run on?  Seems we would need a run-time CPU test.

That's the cpuid thing mentioned above.

Andres

-- 
Please excuse brevity and formatting - I am writing this on my mobile phone.

Andres Freund                       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services



Re: What exactly is our CRC algorithm?

From
Bruce Momjian
Date:
On Fri, Dec 26, 2014 at 04:52:58PM +0100, Andres Freund wrote:
> On December 26, 2014 4:50:33 PM CET, Bruce Momjian <bruce@momjian.us> wrote:
> >On Thu, Dec 25, 2014 at 11:57:29AM +0530, Abhijit Menon-Sen wrote:
> >> Hi.
> >> 
> >> Here's a proposed patch to use CPUID at startup to determine if the
> >> SSE4.2 CRC instructions are available, to use them instead of the
> >> slice-by-8 implementation (posted earlier).
> >> 
> >> A few notes:
> >> 
> >> 1. GCC has included cpuid.h since 4.3.0, so I figured it was safe to
> >>    use. It can be replaced with some inline assembly otherwise.
> >> 
> >> 2. I've also used the crc32b/crc32q instructions directly rather than
> >>    using ".bytes" to encode the instructions; bintuils versions since
> >>    2007 or so have supported them.
> >> 
> >> 3. I've included the MSVC implementation mostly as an example of how
> >to
> >>    extend this to different compilers/platforms. It's written
> >according
> >>    to the documentation for MSVC intrinsics, but I have not tested
> >it.
> >>    Suggestions/improvements are welcome.
> >
> >Uh, what happens if the system is compiled on a different CPU that it
> >is
> >run on?  Seems we would need a run-time CPU test.
> 
> That's the cpuid thing mentioned above.

Oh, so cpuid is not a macro exported by the compiler but a C API.  Good.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + Everyone has their own god. +



Re: What exactly is our CRC algorithm?

From
Andres Freund
Date:
On December 26, 2014 6:05:34 PM CET, Bruce Momjian <bruce@momjian.us> wrote:
>On Fri, Dec 26, 2014 at 04:52:58PM +0100, Andres Freund wrote:
>> On December 26, 2014 4:50:33 PM CET, Bruce Momjian <bruce@momjian.us>
>wrote:
>> >On Thu, Dec 25, 2014 at 11:57:29AM +0530, Abhijit Menon-Sen wrote:
>> >> Hi.
>> >> 
>> >> Here's a proposed patch to use CPUID at startup to determine if
>the
>> >> SSE4.2 CRC instructions are available, to use them instead of the
>> >> slice-by-8 implementation (posted earlier).
>> >> 
>> >> A few notes:
>> >> 
>> >> 1. GCC has included cpuid.h since 4.3.0, so I figured it was safe
>to
>> >>    use. It can be replaced with some inline assembly otherwise.
>> >> 
>> >> 2. I've also used the crc32b/crc32q instructions directly rather
>than
>> >>    using ".bytes" to encode the instructions; bintuils versions
>since
>> >>    2007 or so have supported them.
>> >> 
>> >> 3. I've included the MSVC implementation mostly as an example of
>how
>> >to
>> >>    extend this to different compilers/platforms. It's written
>> >according
>> >>    to the documentation for MSVC intrinsics, but I have not tested
>> >it.
>> >>    Suggestions/improvements are welcome.
>> >
>> >Uh, what happens if the system is compiled on a different CPU that
>it
>> >is
>> >run on?  Seems we would need a run-time CPU test.
>> 
>> That's the cpuid thing mentioned above.
>
>Oh, so cpuid is not a macro exported by the compiler but a C API. 
>Good

Neither, really. It's a instruction returning CPU capabilities. 

-- 
Please excuse brevity and formatting - I am writing this on my mobile phone.

Andres Freund                       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services



Re: What exactly is our CRC algorithm?

From
Bruce Momjian
Date:
On Fri, Dec 26, 2014 at 04:52:58PM +0100, Andres Freund wrote:
> >Uh, what if the system is compiled on a different CPU that it
> >is
> >run on?  Seems we would need a run-time CPU test.
> 
> That's the cpuid thing mentioned above.

Is this something that could potentially change the data stored on disk?
Does pg_upgrade need to check for changes in this area?  Is the
detection exposed by pg_controldata?  Could this affect running the data
directory on a different CPU?

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + Everyone has their own god. +



Re: What exactly is our CRC algorithm?

From
Abhijit Menon-Sen
Date:
At 2014-12-26 13:11:43 -0500, bruce@momjian.us wrote:
>
> Is this something that could potentially change the data stored on
> disk? Does pg_upgrade need to check for changes in this area?  Is the
> detection exposed by pg_controldata?  Could this affect running the
> data directory on a different CPU?

No to all.

Subsequent to Heikki's change (already in master) to use the CRC-32C
algorithm (instead of the earlier mistakenly-reflected-but-not-quite
one), both the slice-by-8 software implementation posted earlier and
the SSE4.2 CRC32* instructions will compute exactly the same value.

(Yes, I have verified this in both cases.)

-- Abhijit



Re: What exactly is our CRC algorithm?

From
Bruce Momjian
Date:
On Fri, Dec 26, 2014 at 11:52:41PM +0530, Abhijit Menon-Sen wrote:
> At 2014-12-26 13:11:43 -0500, bruce@momjian.us wrote:
> >
> > Is this something that could potentially change the data stored on
> > disk? Does pg_upgrade need to check for changes in this area?  Is the
> > detection exposed by pg_controldata?  Could this affect running the
> > data directory on a different CPU?
> 
> No to all.
> 
> Subsequent to Heikki's change (already in master) to use the CRC-32C
> algorithm (instead of the earlier mistakenly-reflected-but-not-quite
> one), both the slice-by-8 software implementation posted earlier and
> the SSE4.2 CRC32* instructions will compute exactly the same value.
> 
> (Yes, I have verified this in both cases.)

Uh, we didn't fully change all code to use the new algorithm because
there were cases that the CRC was already stored on disk, e.g hstore,
ltree.  I assume you are only linking into Heikki's new code and will
not change the places that use the old CRC method on disk --- just
checking.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + Everyone has their own god. +



Re: What exactly is our CRC algorithm?

From
Abhijit Menon-Sen
Date:
At 2014-12-26 13:48:40 -0500, bruce@momjian.us wrote:
>
> I assume you are only linking into Heikki's new code and will not
> change the places that use the old CRC method on disk --- just
> checking.

Correct. The legacy CRC computation code used by ltree etc. is
completely unaffected by both my sets of changes.

-- Abhijit



Re: What exactly is our CRC algorithm?

From
Andres Freund
Date:
Hi,

On 2014-12-25 11:57:29 +0530, Abhijit Menon-Sen wrote:
> -extern pg_crc32 pg_comp_crc32c(pg_crc32 crc, const void *data, size_t len);
> +extern void pg_init_comp_crc32c(void);

How about pg_choose_crc_impl() or something?

> +extern pg_crc32 (*pg_comp_crc32c)(pg_crc32 crc, const void *data, size_t len);
>  
>  /*
>   * CRC calculation using the CRC-32C (Castagnoli) polynomial.
> 
> diff --git a/src/port/pg_crc.c b/src/port/pg_crc.c
> index 2f9857b..6be17b0 100644
> --- a/src/port/pg_crc.c
> +++ b/src/port/pg_crc.c
> @@ -21,6 +21,13 @@
>  #include "utils/pg_crc.h"
>  #include "utils/pg_crc_tables.h"
>  
> +#if defined(HAVE_CPUID_H)
> +#include <cpuid.h>
> +#elif defined(_MSC_VER)
> +#include <intrin.h>
> +#include <nmmintrin.h>
> +#endif
> +
>  static inline uint32 bswap32(uint32 x)
>  {
>  #if defined(__GNUC__) || defined(__clang__)
> @@ -39,8 +46,8 @@ static inline uint32 bswap32(uint32 x)
>  #define cpu_to_le32(x) x
>  #endif
>  
> -pg_crc32
> -pg_comp_crc32c(pg_crc32 crc, const void *data, size_t len)
> +static pg_crc32
> +pg_comp_crc32c_sb8(pg_crc32 crc, const void *data, size_t len)

_sb8? Unless I miss something it's not slice by 8 but rather bytewise?

>  {
>      const unsigned char *p = data;
>      const uint32 *p8;
> @@ -61,7 +68,6 @@ pg_comp_crc32c(pg_crc32 crc, const void *data, size_t len)
>       */
>  
>      p8 = (const uint32 *) p;
> -
>      while (len >= 8)
>      {
>          uint32 a = *p8++ ^ cpu_to_le32(crc);
> @@ -101,8 +107,102 @@ pg_comp_crc32c(pg_crc32 crc, const void *data, size_t len)
>       */
>  
>      p = (const unsigned char *) p8;
> -    while (len-- > 0)
> +    while (len > 0)
> +    {
>          crc = pg_crc32c_table[0][(crc ^ *p++) & 0xFF] ^ (crc >> 8);
> +        len--;
> +    }
> +
> +    return crc;
> +}
>  
> +static pg_crc32
> +pg_asm_crc32b(pg_crc32 crc, unsigned char data)
> +{

Should be marked inline.

> +#ifdef __GNUC__
> +    __asm__ ("crc32b %[data], %[crc]\n" : [crc] "+r" (crc) : [data] "rm" (data));

Have you checked which version of gcc introduced named references to
input/output parameters?

>      return crc;
> +#elif defined(_MSC_VER)
> +    return _mm_crc32_u8(crc, data);
> +#else
> +#error "Don't know how to generate crc32b instruction"
> +#endif
>  }
> +
> +static pg_crc32
> +pg_asm_crc32q(uint64 crc, unsigned long long data)
> +{

inline.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: What exactly is our CRC algorithm?

From
Abhijit Menon-Sen
Date:
At 2014-12-29 13:22:28 +0100, andres@2ndquadrant.com wrote:
>
> How about pg_choose_crc_impl() or something?

Done.

> _sb8? Unless I miss something it's not slice by 8 but rather bytewise?

This is meant to apply on top of the earlier patches I posted to
implement slice-by-8. I'll attach both here.

> Should be marked inline.

Done (and the other one too).

> > +#ifdef __GNUC__
> > +    __asm__ ("crc32b %[data], %[crc]\n" : [crc] "+r" (crc) : [data] "rm" (data));
>
> Have you checked which version of gcc introduced named references to
> input/output parameters?

No. The documentation calls them "asmSymbolicName"s, but I can't find
the term (or various likely alternatives, e.g. symbolic_operand may be
related) either in the release notes, the changelog, or the source (on
Github). Does anyone know, or know how to find out?

Meanwhile, I have attached the two patches with the other modifications.

-- Abhijit

Attachment

Re: What exactly is our CRC algorithm?

From
Abhijit Menon-Sen
Date:
At 2014-12-29 18:44:18 +0530, ams@2ndQuadrant.com wrote:
>
> > > +#ifdef __GNUC__
> > > +    __asm__ ("crc32b %[data], %[crc]\n" : [crc] "+r" (crc) : [data] "rm" (data));
> > 
> > Have you checked which version of gcc introduced named references to
> > input/output parameters?

OK, here we go:
   «As of GCC version 3.1, it is also possible to specify input and   output operands using symbolic names which can be
referencedwithin   the assembler code.»
 

GCC 3.1 was released on May 15, 2002. So it should be quite safe to use
this feature.

-- Abhijit



Re: What exactly is our CRC algorithm?

From
Abhijit Menon-Sen
Date:
Hi.

I'm re-attaching the two patches as produced by format-patch. I haven't
listed any reviewers. It's either just Andres, or maybe a lot of people.

Is anyone in a position to try out the patches on MSVC and see if they
build and work sensibly, please? (Otherwise it may be better to remove
those bits from the patch for now.)

Thanks.

-- Abhijit

Attachment

Re: What exactly is our CRC algorithm?

From
Heikki Linnakangas
Date:
On 12/30/2014 09:40 AM, Abhijit Menon-Sen wrote:
> Hi.
>
> I'm re-attaching the two patches as produced by format-patch. I haven't
> listed any reviewers. It's either just Andres, or maybe a lot of people.
>
> Is anyone in a position to try out the patches on MSVC and see if they
> build and work sensibly, please? (Otherwise it may be better to remove
> those bits from the patch for now.)

A couple of quick comments:

bswap32 is unused on on little-endian systems. That will give a compiler 
warning.

pg_comp_crc32c_sse processes one byte at a time, until the pointer is 
4-bytes aligned. Then it processes 8 bytes at a time. So it fetches the 
8-byte chunks from only 4-byte aligned addresses. Is that intentional? 
If unaligned access performs well, why bother with the initial 
byte-at-a-time processing at all?

- Heikki




Re: What exactly is our CRC algorithm?

From
Abhijit Menon-Sen
Date:
At 2014-12-30 16:05:50 +0200, hlinnakangas@vmware.com wrote:
>
> A couple of quick comments:
> 
> bswap32 is unused on on little-endian systems. That will give a
> compiler warning.

Huh. I don't get a warning, even when I add -Wunused to the build flags.
But since you mention it, it would be better to write the function thus:
   static inline uint32 cpu_to_le32(uint32 x)   {   #ifndef WORDS_BIGENDIAN       return x;   #elif defined(__GNUC__)
||defined(__clang__)       return __builtin_bswap32(x);   #else       return  ((x << 24) & 0xff000000) |
((x<<  8) & 0x00ff0000) |               ((x >>  8) & 0x0000ff00) |               ((x >> 24) & 0x000000ff);   #endif
}

> pg_comp_crc32c_sse […] fetches the 8-byte chunks from only 4-byte
> aligned addresses. Is that intentional?

Thanks for spotting that. I had meant to change the test to "& 7". But
again, now that you mention it, I'm not sure it's necessary. The CRC32*
instructions don't have the usual SSE alignment requirements, and I see
that the Linux kernel (among other implementations) process eight bytes
at a time from the start of the buffer and then process the remaining
bytes one at a time. I'll do a bit more research and post an update.

Thanks for having a look.

-- Abhijit



Re: What exactly is our CRC algorithm?

From
Andres Freund
Date:
On 2014-12-30 21:36:19 +0530, Abhijit Menon-Sen wrote:
> Thanks for spotting that. I had meant to change the test to "& 7". But
> again, now that you mention it, I'm not sure it's necessary. The CRC32*
> instructions don't have the usual SSE alignment requirements, and I see
> that the Linux kernel (among other implementations) process eight bytes
> at a time from the start of the buffer and then process the remaining
> bytes one at a time. I'll do a bit more research and post an update.

I'd done some quick and dirty benchmarking when writing my initial
crc32c POC and I found that four byte aligned accesses were faster than
ones not. But it really just was running it a couple times, nothing more.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: What exactly is our CRC algorithm?

From
Abhijit Menon-Sen
Date:
Hi.

OK, here are the patches with the various suggestions applied.

I found that the alignment didn't seem to make much difference for the
CRC32* instructions, so I changed to process (len/8)*8bytes followed by
(len%8)*1bytes, the way the Linux kernel does.

-- Abhijit

Attachment

Re: What exactly is our CRC algorithm?

From
Heikki Linnakangas
Date:
On 01/01/2015 09:17 AM, Abhijit Menon-Sen wrote:
> Hi.
>
> OK, here are the patches with the various suggestions applied.
>
> I found that the alignment didn't seem to make much difference for the
> CRC32* instructions, so I changed to process (len/8)*8bytes followed by
> (len%8)*1bytes, the way the Linux kernel does.

Ok.

In the slicing-by-8 version, I wonder if it would be better to do 
single-byte loads to c0-c7, instead of two 4-byte loads and shifts. 
4-byte loads are presumably faster than single byte loads, but then 
you'd avoid the shifts. And then you could go straight into the 
8-bytes-at-a-time loop, without the initial single-byte processing to 
get the start address aligned. (the Linux implementation doesn't do 
that, so maybe it's a bad idea, but might be worth testing..)

Looking at the Linux implementation, I think it only does the bswap once 
per call, not inside the hot loop. Would it even make sense to keep the 
crc variable in different byte order, and only do the byte-swap once in 
END_CRC32() ?

The comments need some work. I note that there is no mention of the 
slicing-by-8 algorithm anywhere in the comments (in the first patch).

Instead of checking for "defined(__GNUC__) || defined(__clang__)", 
should add an explicit configure test for __builtin_bswap32().

- Heikki




Re: What exactly is our CRC algorithm?

From
Abhijit Menon-Sen
Date:
At 2015-01-02 16:46:29 +0200, hlinnakangas@vmware.com wrote:
>
> In the slicing-by-8 version, I wonder if it would be better to do
> single-byte loads to c0-c7, instead of two 4-byte loads and shifts.

Nope. I did some tests, and the sb8 code is slightly slower if I remove
the 0-7byte alignment loop, and significantly slower if I switch to one
byte loads for the whole thing. So I think we should leave that part as
it is, but:

> Would it even make sense to keep the crc variable in different byte
> order, and only do the byte-swap once in END_CRC32() ?

…this certainly does make a noticeable difference. Will investigate.

> The comments need some work. I note that there is no mention of the
> slicing-by-8 algorithm anywhere in the comments (in the first patch).

Will fix. (Unfortunately the widely cited original Intel paper about
slice-by-8 seems to have gone AWOL, but I'll find something.)

> Instead of checking for "defined(__GNUC__) || defined(__clang__)",
> should add an explicit configure test for __builtin_bswap32().

Will do.

Thanks again.

-- Abhijit



Re: What exactly is our CRC algorithm?

From
Abhijit Menon-Sen
Date:
Hi Heikki.

I've attached two regenerated CRC patches, split up as before.

1. The slicing-by-8 patch contains numerous changes:

    a. A configure test for __builtin_bswap32
    b. A comment referencing the slicing-by-8 paper (which is behind a
       paywall, unfortunately, so I haven't even read it). Are more
       comments needed? If so, where/what kind?
    c. A byte-reversed big-endian version of the 8*256 table. In Linux,
       there's only one table that uses __constant_swab32, but for us
       it's simpler to have two tables.
    d. Thanks to (c), we can avoid the bswap32 in the hot loop.
    e. On big-endian systems, FIN_CRC32C now bswap32()s the CRC before
       finalising it. (We don't need to do this in INIT_CRC32C only
       because the initialiser is 0xFFFFFFFF.)

2. The sse4.2 patch has only some minor compile fixes.

I have built and tested both patches individually on little-endian
(amd64) and big-endian (ppc) systems. I verified that the _sse is
chosen at startup on the former, and _sb8 on the latter, and that
both implementations function correctly with respect to HEAD.

Please let me know if there's anything else I need to do.

-- Abhijit

Attachment

Re: What exactly is our CRC algorithm?

From
Heikki Linnakangas
Date:
On 01/09/2015 10:32 AM, Abhijit Menon-Sen wrote:
> 1. The slicing-by-8 patch contains numerous changes:

With this patch, CALC_CRC32C is no longer a pure macro, but includes a 
function call. That means that you can no longer just #include pg_crc.h 
and pg_crc_tables.h in an external program. We made that arrangement 
with two header files in 2012 [1], and added src/port/pg_crc.c which 
just #includes pg_crc_tables.h, so that the backend and frontend 
programs that use libpgport will have just one copy of the tables.

Now that there's some actual code in pg_crc.c, I think we have to just 
give up on being able to get the CRC implementation without libpgport. 
It was a nice thought, but I doubt there are any programs out there that 
would have a problem with that. Anything that wants to read the WAL 
needs xlogreader.c anyway.

But actually, we should now move pg_crc.c to src/common. It was a bit of 
a hack to have it in src/port in the first place, because it has nothing 
to do with portability, but src/common didn't exist back then. Now it does.

So I propose to move pg_crc.c to src/common, and move the tables from 
pg_crc_tables.h directly to pg_crc.c. Thoughts?

[1] 
http://www.postgresql.org/message-id/flat/CAAZKuFaNcf3=YtadkWwr8yHb+1axW2RepmQ2j8a9NNGkV7PN=w@mail.gmail.com.

- Heikki




Re: What exactly is our CRC algorithm?

From
Andres Freund
Date:
On 2015-02-08 18:46:30 +0200, Heikki Linnakangas wrote:
> So I propose to move pg_crc.c to src/common, and move the tables from
> pg_crc_tables.h directly to pg_crc.c. Thoughts?

+1.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: What exactly is our CRC algorithm?

From
Abhijit Menon-Sen
Date:
At 2015-02-08 18:46:30 +0200, hlinnakangas@vmware.com wrote:
>
> So I propose to move pg_crc.c to src/common, and move the tables
> from pg_crc_tables.h directly to pg_crc.c. Thoughts?

Sounds fine to me.

-- Abhijit



Re: What exactly is our CRC algorithm?

From
Heikki Linnakangas
Date:
On 02/08/2015 08:33 PM, Abhijit Menon-Sen wrote:
> At 2015-02-08 18:46:30 +0200, hlinnakangas@vmware.com wrote:
>>
>> So I propose to move pg_crc.c to src/common, and move the tables
>> from pg_crc_tables.h directly to pg_crc.c. Thoughts?
>
> Sounds fine to me.

Ok, I've committed a patch that just moves the existing code to
common/pg_crc.c. I also moved pg_crc.h from include/utils to
include/utils. That'll need any external programs to change their
#include accordingly, but I think it was worth that. include/common is
clearly the correct home for that file, and the only reason to keep it
in include/utils would've been for backwards-compatibility.

Attached is a rebased version of the slicing-by-8 patch. I've made some
cosmetic changes. Most notably, I turned the bswap32() function into a
macro. Better to avoid the overhead of a function call, and it also
avoids building the function altogether on little-endian systems that
don't need it.

I'll continue review this.

> At 2015-01-02 16:46:29 +0200, hlinnakangas@vmware.com wrote:
>>
>> Would it even make sense to keep the crc variable in different byte
>> order, and only do the byte-swap once in END_CRC32() ?
>
> ...this certainly does make a noticeable difference. Will investigate.

Do you have access to big-endian hardware to test this on? It seems like
an obvious optimization to shave off that one instruction from the hot
loop, but if it turns out not to make any measurable difference, I'd
prefer to keep the tables in the same order on big and little endian
systems, reducing the amount of #ifdefs needed.

I tested this on my laptop by adding a BSWAP32() into the hot loop -
which is bogus on a little endian Intel system - and it seems to make
about 5% difference in a quick micro-benchmark. But it would be nice to
get some numbers from the kind of big endian systems that people run in
the real world.

- Heikki

Attachment

Re: What exactly is our CRC algorithm?

From
Abhijit Menon-Sen
Date:
At 2015-02-09 12:52:41 +0200, hlinnakangas@vmware.com wrote:
>
> Ok, I've committed a patch that just moves the existing code to
> common/pg_crc.c […]

Thanks, looks good.

> Attached is a rebased version of the slicing-by-8 patch.

Looks OK too.

> Do you have access to big-endian hardware to test this on?

Yes, I tested it on a Linux/ppc system. I wasn't speculating when I said
it "does make a noticeable difference", though I'm afraid I did not keep
the timings after submitting the revised patch. The speedup was some
double-digit percentage, IIRC.

-- Abhijit



Re: What exactly is our CRC algorithm?

From
Heikki Linnakangas
Date:
On 02/09/2015 03:20 PM, Abhijit Menon-Sen wrote:
> At 2015-02-09 12:52:41 +0200, hlinnakangas@vmware.com wrote:
>> Do you have access to big-endian hardware to test this on?
>
> Yes, I tested it on a Linux/ppc system. I wasn't speculating when I said
> it "does make a noticeable difference", though I'm afraid I did not keep
> the timings after submitting the revised patch. The speedup was some
> double-digit percentage, IIRC.

Ok, that's good enough for me.

Committed with a few more edits on comments and such. I'll start looking 
at the second patch now.

- Heikki




Re: What exactly is our CRC algorithm?

From
Heikki Linnakangas
Date:
On 01/09/2015 10:32 AM, Abhijit Menon-Sen wrote:
> 2. The sse4.2 patch has only some minor compile fixes.
>
> I have built and tested both patches individually on little-endian
> (amd64) and big-endian (ppc) systems. I verified that the _sse is
> chosen at startup on the former, and _sb8 on the latter, and that
> both implementations function correctly with respect to HEAD.

I'd like to arrange this somewhat differently, to keep the really
low-level assembler blocks and such separate from the higher level
parts. Especially now that pg_crc.c is in src/common and src/port, it
doesn't seem appropriate to have assembler code in there. Also, some of
the high-level code can be reused if we later add support e.g. for the
ARMv8 CRC instructions.

I propose that we add a new header file in src/port, let's call it
crc_instructions.h. That file contains the very low-level stuff, the
pg_asm_crc32q() and pg_asm_crc32b() inline functions, which just contain
the single assembler instruction. Or the corresponding mnemomic or macro
depending on the compiler; the point of the crc_instructions.h header is
to hide the differences between compilers and architectures.

If the CRC instructions are available, crc_instructions.h defines
PG_HAVE_CRC32C_INSTRUCTIONS, as well as the pg_asm_crc32q() and
pg_asm_crc32b() macros/functions. It also defines
pg_crc32_instructions_runtime_check(), to perform the runtime check to
determine whether the instructions can be used on the current platform
(i.e. if you're running on a CPU with SSE 4.2 support). There's another
symbol PG_CRC32C_INSTRUCTIONS_NEED_RUNTIME_CHECK, which indicates
whether the runtime check is needed. That's currently always defined
when PG_HAVE_CRC32C_INSTRUCTIONS is, but conceivably you might want to
build a version that skips the runtime check altogether, and doesn't
therefore require the slicing-by-8 fallback implementation at all.
Gentoo users might like that ;-), as well as possible future
architectures that always have CRC32 instructions.

Attached is a patch along those lines. I haven't done much testing, but
it demonstrates the code structure I have in mind.

A couple of remarks on your original patch, which also apply to the
attached version:

> +static inline pg_crc32
> +pg_asm_crc32b(pg_crc32 crc, unsigned char data)
> +{
> +#if defined(__GNUC__) && defined(__x86_64__)
> +    __asm__ ("crc32b %[data], %[crc]\n" : [crc] "+r" (crc) : [data] "rm" (data));
> +    return crc;
> +#elif defined(_MSC_VER)
> +    return _mm_crc32_u8(crc, data);
> +#else
> +    /* Can't generate crc32b, but keep the compiler quiet. */
> +    return 0;
> +#endif

I believe the CRC instructions are also available in 32-bit mode, so the
check for __x86_64__ is not needed. Right? Any CPU recent enough to have
these instructions certainly supports the x86-64 instruction set, but
you might nevertheless have compiled and be running in i386 mode.

It would be nice to use GCC's builtin intrinsics, __builtin_ia32_crc32*
where available, but I guess those are only available if you build with
-msse4.2, and that would defeat the point of the runtime check.

I believe the _mm_* mnemonics would also work with the Intel compiler.

- Heikki

Attachment

Re: What exactly is our CRC algorithm?

From
Abhijit Menon-Sen
Date:
At 2015-02-10 14:30:51 +0200, hlinnakangas@vmware.com wrote:
>
> I propose that we add a new header file in src/port, let's call it
> crc_instructions.h […] the point of the crc_instructions.h header
> is to hide the differences between compilers and architectures.

Moving the assembly code/compiler intrinsics to a separate header is
fine with me, but I really don't like the proposed implementation at
all. Here are some comments:

1. I don't mind moving platform-specific tests for CRC32C instructions  to configure, but if we only define
PG_HAVE_CRC32C_INSTRUCTIONS,we  would anyway have to reproduce all that platform-specific stuff in  the header file. To
doit properly, I think we should generate the  right version of crc_instructions.h for the platform. Otherwise,  what's
thepoint? Might as well have only the complicated header.
 

2. At this point, I think we should stick with the _sse function rather  than a generic _hw one to "drive" the
platform-specificinstructions.  The structure of that function (n/8*(8bytes)+n%8*(bytes)) is specific  to what works
bestfor the SSE instructions. On another platform, we  may need something very similar, or very different.    With the
proposedstructure, _hw would inevitably become #ifdef'ed  non-trivially, e.g. to run a single-byte loop at the start to
align the main loop, but only for certain combinations of #defines.    I think we should consider what makes the most
sensewhen someone  actually wants to add support for another platform/compiler, and  not before.
 

3. I dislike everything about pg_crc32_instructions_runtime_check()—the  name, the separate
PG_CRC32C_INSTRUCTIONS_NEED_RUNTIME_CHECK#define,  the fact that you try to do the test "inline" rather than at
startup…
  Maybe you're trying to avoid checking separately at startup so that a  program that links with code from src/common
doesn'tneed to do its  own test. But I don't think the results are worth it.
 
  As for omitting the slicing-by-8 tables, if there's a more compelling  reason to do that than "Gentoo users might
likethat ;-)", I think it  should be done with a straightforward -DUSE_CRC_SSE style arrangement  rather than figuring
outthat you HAVE_CRC32C_INSTRUCTIONS but don't  NEED_RUNTIME_CHECK. If you want to build special-purpose binaries,
justpick whichever CRC implementation you like, done.
 

> I believe the CRC instructions are also available in 32-bit mode, so
> the check for __x86_64__ is not needed. Right?

I'm not sure, but I guess you're right.

> It would be nice to use GCC's builtin intrinsics,
> __builtin_ia32_crc32* where available, but I guess those are only
> available if you build with -msse4.2, and that would defeat the
> point of the runtime check.

Exactly.

-- Abhijit



Re: What exactly is our CRC algorithm?

From
Heikki Linnakangas
Date:
On 02/11/2015 11:28 AM, Abhijit Menon-Sen wrote:
> 1. I don't mind moving platform-specific tests for CRC32C instructions
>     to configure, but if we only define PG_HAVE_CRC32C_INSTRUCTIONS, we
>     would anyway have to reproduce all that platform-specific stuff in
>     the header file. To do it properly, I think we should generate the
>     right version of crc_instructions.h for the platform. Otherwise,
>     what's the point? Might as well have only the complicated header.

I don't follow. I didn't change configure at all, compared to your patch.

> 2. At this point, I think we should stick with the _sse function rather
>     than a generic _hw one to "drive" the platform-specific instructions.
>     The structure of that function (n/8*(8bytes)+n%8*(bytes)) is specific
>     to what works best for the SSE instructions. On another platform, we
>     may need something very similar, or very different.
>
>     With the proposed structure, _hw would inevitably become #ifdef'ed
>     non-trivially, e.g. to run a single-byte loop at the start to align
>     the main loop, but only for certain combinations of #defines.
>
>     I think we should consider what makes the most sense when someone
>     actually wants to add support for another platform/compiler, and
>     not before.

Hmm, ok. The ARM CRC32C instruction is very similar to the SSE4.2 one, 
but I presume it does require alignment.

> 3. I dislike everything about pg_crc32_instructions_runtime_check()—the
>     name, the separate PG_CRC32C_INSTRUCTIONS_NEED_RUNTIME_CHECK #define,
>     the fact that you try to do the test "inline" rather than at startup…
>
>     Maybe you're trying to avoid checking separately at startup so that a
>     program that links with code from src/common doesn't need to do its
>     own test. But I don't think the results are worth it.

As a case in point, with your patch pg_xlogdump would not have used the 
CRC instruction, because it never called pg_choose_crc_impl(). "Choose 
on first use" is much more convenient than requiring every program to 
call a function.

>     As for omitting the slicing-by-8 tables, if there's a more compelling
>     reason to do that than "Gentoo users might like that ;-)", I think it
>     should be done with a straightforward -DUSE_CRC_SSE style arrangement
>     rather than figuring out that you HAVE_CRC32C_INSTRUCTIONS but don't
>     NEED_RUNTIME_CHECK. If you want to build special-purpose binaries,
>     just pick whichever CRC implementation you like, done.

I stopped short of actually allowing you to force the use of SSE, e.g. 
with -DUSE_CRC_SSE, but my thinking was that that would force 
PG_HAVE_CRC32C_INSTRUCTIONS to be set, and NEEDS_RUNTIME_CHECK to be 
unset. I.e. those two #defines control what code is generated, but in 
the header file. I think what you're suggesting is that we'd instead 
have two mutually exclusive #defines, something like 
"USE_CRC_SSE_ALWAYS" and "USE_CRC_SSE_WITH_RUNTIME_CHECK". It wouldn't 
be much different, but would require some places to check for both.

>> It would be nice to use GCC's builtin intrinsics,
>> __builtin_ia32_crc32* where available, but I guess those are only
>> available if you build with -msse4.2, and that would defeat the
>> point of the runtime check.
>
> Exactly.

Hmm. Perhaps we should check for the built-in functions, and if they're 
available, use them and not set NEEDS_RUNTIME_CHECK. If you compile with 
-msse4.2, the code won't run without SSE4.2 anyway (or at least isn't 
guaranteed to run).

- Heikki




Re: What exactly is our CRC algorithm?

From
Abhijit Menon-Sen
Date:
At 2015-02-11 13:20:29 +0200, hlinnakangas@vmware.com wrote:
>
> I don't follow. I didn't change configure at all, compared to your
> patch.

OK, I extrapolated a little too much. Your patch didn't actually include
crc_instructions.h; from the name of the #define, I imagined you planned
to move the check to configure. But now I guess from your response that
PG_HAVE_CRC32C_INSTRUCTIONS would be #defined by the header (as you did
say it would be). So let's forget that part.

> As a case in point, with your patch pg_xlogdump would not have used
> the CRC instruction, because it never called pg_choose_crc_impl().
> "Choose on first use" is much more convenient than requiring every
> program to call a function.

I can see your point. I'm not fond of the code, but I guess it's not a
huge deal in the larger scheme of things.

> I think what you're suggesting is that we'd instead have two mutually
> exclusive #defines, something like "USE_CRC_SSE_ALWAYS" and
> "USE_CRC_SSE_WITH_RUNTIME_CHECK".

I was thinking of only one knob: USE_CRC_SSE would (try to) build the
code to use the SSE instructions without any test. If it (or another
USE_CRC_XXX flag) isn't set, we'd build all applicable variants and
test at runtime to decide what to use.

Just give me a little while to think through what that would look like,
I'll post a patch.

-- Abhijit



Re: What exactly is our CRC algorithm?

From
Heikki Linnakangas
Date:
On 02/11/2015 04:20 PM, Abhijit Menon-Sen wrote:
> At 2015-02-11 13:20:29 +0200, hlinnakangas@vmware.com wrote:
>>
>> I don't follow. I didn't change configure at all, compared to your
>> patch.
>
> OK, I extrapolated a little too much. Your patch didn't actually include
> crc_instructions.h;

Oh, I'm sorry. Here's the complete patch with crc_instructions.h
- Heikki


Attachment

Re: What exactly is our CRC algorithm?

From
Heikki Linnakangas
Date:
On 02/12/2015 09:26 PM, Heikki Linnakangas wrote:
> On 02/11/2015 04:20 PM, Abhijit Menon-Sen wrote:
>> At 2015-02-11 13:20:29 +0200, hlinnakangas@vmware.com wrote:
>>>
>>> I don't follow. I didn't change configure at all, compared to your
>>> patch.
>>
>> OK, I extrapolated a little too much. Your patch didn't actually include
>> crc_instructions.h;
>
> Oh, I'm sorry. Here's the complete patch with crc_instructions.h

I was just about to commit the attached, which is the same as the
previous patch with just cosmetic comment changes, but then I realized
that this probably doesn't compile with Visual Studio 2005 or older. The
code does "#ifdef _MSC_VER", and then uses the _mm_crc32_u64 intrinsic,
but that intrinsic was added in Visual Studio 2008. I think we'll need a
version check there. Or better yet, a direct configure test to check if
the intrinsic exists - that way we get to also use it on Intel
compilers, which I believe also has the same intrinsics.

You want to write that or should I? How do you like this latest version
of the patch otherwise? You had some criticism earlier, but I had
forgotten to include the crc_instructions.h header file in that earlier
version.

- Heikki


Attachment

Re: What exactly is our CRC algorithm?

From
Andres Freund
Date:
On 2015-03-25 19:18:51 +0200, Heikki Linnakangas wrote:
> I was just about to commit the attached, which is the same as the previous
> patch with just cosmetic comment changes, but then I realized that this
> probably doesn't compile with Visual Studio 2005 or older. The code does
> "#ifdef _MSC_VER", and then uses the _mm_crc32_u64 intrinsic, but that
> intrinsic was added in Visual Studio 2008. I think we'll need a version
> check there.

Good catch.

> Or better yet, a direct configure test to check if the
> intrinsic exists - that way we get to also use it on Intel compilers, which
> I believe also has the same intrinsics.

Maybe I'm missing something, but configure isn't run for msvc?


Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: What exactly is our CRC algorithm?

From
Heikki Linnakangas
Date:
On 03/25/2015 07:20 PM, Andres Freund wrote:
> On 2015-03-25 19:18:51 +0200, Heikki Linnakangas wrote:
>> Or better yet, a direct configure test to check if the
>> intrinsic exists - that way we get to also use it on Intel compilers, which
>> I believe also has the same intrinsics.
>
> Maybe I'm missing something, but configure isn't run for msvc?

Good point. On MSVC, we use the pre-built pg_config.h.win32 file 
instead. There are already a couple of cases like this in it:

/* Define to 1 if you have the `rint' function. */
#if (_MSC_VER >= 1800)
#define HAVE_RINT 1
#endif

I think we should do that for the CRC32 intrinsic too.

- Heikki




Re: What exactly is our CRC algorithm?

From
Petr Jelinek
Date:
On 25/03/15 18:24, Heikki Linnakangas wrote:
> On 03/25/2015 07:20 PM, Andres Freund wrote:
>> On 2015-03-25 19:18:51 +0200, Heikki Linnakangas wrote:
>>> Or better yet, a direct configure test to check if the
>>> intrinsic exists - that way we get to also use it on Intel compilers,
>>> which
>>> I believe also has the same intrinsics.
>>
>> Maybe I'm missing something, but configure isn't run for msvc?
>
> Good point. On MSVC, we use the pre-built pg_config.h.win32 file
> instead. There are already a couple of cases like this in it:
>
> /* Define to 1 if you have the `rint' function. */
> #if (_MSC_VER >= 1800)
> #define HAVE_RINT 1
> #endif
>
> I think we should do that for the CRC32 intrinsic too.
>

Yeah, 1500 being MSVC 2008. But I think there is little point in putting 
it in pg_config.h.win32 when it's only going to be used in 2 places in 
single header file.

Do you plan to commit this otherwise as is?

--  Petr Jelinek                  http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training &
Services



Re: What exactly is our CRC algorithm?

From
Abhijit Menon-Sen
Date:
At 2015-03-25 19:18:51 +0200, hlinnaka@iki.fi wrote:
>
> I think we'll need a version check there. […]
> 
> You want to write that or should I?

I'm not familiar with MSVC at all, so it would be nice if you did it.

> How do you like this latest version of the patch otherwise?

I'm sorry, but I'm still not especially fond of it.

Apart from removing the startup check so that client programs can also
use the best available CRC32C implementation without jumping through
hoops, I don't feel that the other changes buy us very much.

Also, assuming that the point is that people who don't care about CRCs
deeply should nevertheless be able to produce special-purpose binaries
with only the optimal implementation included, we should probably have
some instructions about how to do that.

Thinking about what you were trying to do, I would find an arrangement
roughly like the following to be clearer to follow in terms of adding
new implementations and so on:

#if defined(USE_CRC_SSE42) || …can build SSE4.2 CRC code…   #define HAVE_CRC_SSE42 1   pg_crc32c_sse42() { … }   bool
sse42_crc32c_available(){ … }   pg_comp_crc32c = pg_crc32c_sse42;
 
#elif defined(USE_CRC_ARM) || …can build ARM CRC code…   #define HAVE_CRC_ARM 1   pg_crc32c_arm() { … }   bool
arm_crc32c_available(){ … }   pg_comp_crc32c = pg_crc32c_arm;
 
#endif

#define CRC_SELECTION 1
#if defined(USE_CRC_SSE42) || defined(USE_CRC_ARM) || …
#undef CRC_SELECTION
#endif

#ifdef CRC_SELECTION   pg_crc32c_sb8() { … }
   pg_comp_crc32c_choose(…)   {       pg_comp_crc32c = pg_crc32c_sb8;
   #ifdef HAVE_CRC_SSE42       if (sse42_crc32c_available())           pg_comp_crc32c = pg_crc32c_sse42;   #elif …
…   #endif
 
       return pg_comp_crc32c(…);   }
   pg_comp_crc32c = pg_crc32c_choose;
#endif

Then someone who wants to force the building of (only) the SSE4.2
implementation can build with -DUSE_CRC_SSE42. And if you turn on
USE_CRC_ARM when you can't build ARM code, it won't build. (The
HAVE_CRC_xxx #defines could also move to configure tests.)

If you don't specify any USE_CRC_xxx explicitly, then it'll build
whichever (probably) one it can, and select it at runtime if it's
available.

All that said, I do recognise that there are all relatively cosmetic
concerns, and I don't object seriously to any of it. On the contrary,
thanks for taking the time to review and work on the patch. Nobody
else has expressed an opinion, so I'll leave it to you to decide whether
to commit as-is, or if you want me to pursue the above approach instead.

In the realm of very minor nitpicking, here are a couple of points I
noticed in crc_instructions.h:

1. I think «if (pg_crc32_instructions_runtime_check())» would read  better if the function were named
crc32c_instructions_availableor  pg_crc32c_is_hw_accelerated, or something like that.
 

2. It documents pg_accumulate_crc32c_byte and  pg_accumulate_crc32c_uint64, but actually defines pg_asm_crc32b and
pg_asm_crc32q.If you update the code rather than the documentation,  _update_ may be slightly preferable to
_accumulate_,and maybe the  suffixes should be _uint8 and _uint64.
 

3. The descriptions (e.g. "Add one byte to the current crc value.")  should also probably read "Update the current crc
valuefor the given  byte/eight bytes of data".
 

Thanks.

-- Abhijit



Re: What exactly is our CRC algorithm?

From
Heikki Linnakangas
Date:
On 04/02/2015 12:39 PM, Abhijit Menon-Sen wrote:
> At 2015-03-25 19:18:51 +0200, hlinnaka@iki.fi wrote:
>>
>> I think we'll need a version check there. […]
>>
>> You want to write that or should I?
>
> I'm not familiar with MSVC at all, so it would be nice if you did it.

Thinking more about the configure checks, I think the current approach 
of using inline assembly on gcc is not quite right. We're only using 
inline assembly to force producing SSE 4.2 code, even when -msse4.2 is 
not used. That feels wrong.

And who's to say that the assembler supports the SSE instructions 
anyway? At least we'd need a configure check for that too. We have a 
buildfarm animal that still uses gcc 2.95.3, which was released in 2001. 
I don't have a compiler of that vintage to test with, but I assume an 
old enough assembler would not know about the crc32q instruction and 
fail to compile.

I believe the GCC way to do this would be to put the SSE4.2-specific 
code into a separate source file, and compile that file with "-msse4.2". 
And when you compile with -msse4.2, gcc actually also supports the 
_mm_crc32_u8/u64 intrinsics.

- Heikki




Re: What exactly is our CRC algorithm?

From
Abhijit Menon-Sen
Date:
At 2015-04-02 17:58:23 +0300, hlinnaka@iki.fi wrote:
>
> We're only using inline assembly to force producing SSE 4.2 code, even
> when -msse4.2 is not used. That feels wrong.

Why? It feels OK to me (and to Andres, per earlier discussions about
exactly this topic). Doing it this way allows the binary to run on a
non-SSE4.2 platform (and not use the CRC instructions).

Also, -msse4.2 was added to the compiler later than support for the
instructions was added to the assembler.

> We have a buildfarm animal that still uses gcc 2.95.3, which was
> released in 2001. I don't have a compiler of that vintage to test
> with, but I assume an old enough assembler would not know about the
> crc32q instruction and fail to compile.

GCC from <2002 wouldn't support the symbolic operand names in inline
assembly. binutils from <2007 (IIRC) wouldn't support the assembler
instructions themselves.

We could work around the latter by using the appropriate sequence of
bytes. We could work around the former by using the old syntax for
operands.

> I believe the GCC way to do this would be to put the SSE4.2-specific
> code into a separate source file, and compile that file with
> "-msse4.2". And when you compile with -msse4.2, gcc actually also
> supports the _mm_crc32_u8/u64 intrinsics.

I have no objection to this.

Building only that file with -msse4.2 would resolve the problem of the
output binary requiring SSE4.2; and the only compilers to be excluded
are old enough to be uninteresting at least to me personally.

Have you done/are you doing this already, or do you want me to? I could
use advice on how to add build flags to only one file, since I don't
know of any precendent for that.

-- Abhijit



Re: What exactly is our CRC algorithm?

From
Andres Freund
Date:
On 2015-04-02 20:57:24 +0530, Abhijit Menon-Sen wrote:
> At 2015-04-02 17:58:23 +0300, hlinnaka@iki.fi wrote:
> >
> > We're only using inline assembly to force producing SSE 4.2 code, even
> > when -msse4.2 is not used. That feels wrong.
> 
> Why? It feels OK to me (and to Andres, per earlier discussions about
> exactly this topic). Doing it this way allows the binary to run on a
> non-SSE4.2 platform (and not use the CRC instructions).

Right. And SSE4.2 isn't that widespread yet.

> > I believe the GCC way to do this would be to put the SSE4.2-specific
> > code into a separate source file, and compile that file with
> > "-msse4.2". And when you compile with -msse4.2, gcc actually also
> > supports the _mm_crc32_u8/u64 intrinsics.

To me this seems like a somewhat pointless exercise. I actually think
from a performance POV it's better to have all the functions in one
source file, so the compiler can inline things into the trampoline if it
feels like it.

Greetings,

Andres Freund



Re: What exactly is our CRC algorithm?

From
Heikki Linnakangas
Date:
On 04/02/2015 06:27 PM, Abhijit Menon-Sen wrote:
> At 2015-04-02 17:58:23 +0300, hlinnaka@iki.fi wrote:
>>
>> We're only using inline assembly to force producing SSE 4.2 code, even
>> when -msse4.2 is not used. That feels wrong.
>
> Why? It feels OK to me (and to Andres, per earlier discussions about
> exactly this topic). Doing it this way allows the binary to run on a
> non-SSE4.2 platform (and not use the CRC instructions).

Being able to run on non-SSE4.2 platforms is required, for sure.

> Also, -msse4.2 was added to the compiler later than support for the
> instructions was added to the assembler.

It was added in gcc 4.2. That's good enough for me.

>> We have a buildfarm animal that still uses gcc 2.95.3, which was
>> released in 2001. I don't have a compiler of that vintage to test
>> with, but I assume an old enough assembler would not know about the
>> crc32q instruction and fail to compile.
>
> GCC from <2002 wouldn't support the symbolic operand names in inline
> assembly. binutils from <2007 (IIRC) wouldn't support the assembler
> instructions themselves.
>
> We could work around the latter by using the appropriate sequence of
> bytes. We could work around the former by using the old syntax for
> operands.

I'm OK with not supporting the new instructions when building with an
old compiler/assembler. But the build shouldn't fail with an old
compiler/assembler. Using old syntax or raw bytes just to avoid failing
on an ancient compiler seems ugly.

>> I believe the GCC way to do this would be to put the SSE4.2-specific
>> code into a separate source file, and compile that file with
>> "-msse4.2". And when you compile with -msse4.2, gcc actually also
>> supports the _mm_crc32_u8/u64 intrinsics.
>
> I have no objection to this.
>
> Building only that file with -msse4.2 would resolve the problem of the
> output binary requiring SSE4.2; and the only compilers to be excluded
> are old enough to be uninteresting at least to me personally.
>
> Have you done/are you doing this already, or do you want me to? I could
> use advice on how to add build flags to only one file, since I don't
> know of any precendent for that.

I came up with the attached. The SSE4.2 specific code is now in a
separate file, in src/port/pg_crc32c_sse42.c. The slicing-by-8
implementation is moved to src/port/pg_crc32c_sb8.c, and the function to
choose the implementation at runtime is in src/port/pg_crc32c_choose.c.
How does this look to you?

BTW, we might want to move the "traditional" and "legacy" crc32
implementations out of src/common. They are only used in backend code now.

- Heikki


Attachment

Re: What exactly is our CRC algorithm?

From
Robert Haas
Date:
On Thu, Apr 2, 2015 at 5:33 PM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> It was added in gcc 4.2. That's good enough for me.

I think it's fine to have optional optimizations that require gcc >=
4.2, as long as older platforms don't break outright.

>>> We have a buildfarm animal that still uses gcc 2.95.3, which was
>>> released in 2001. I don't have a compiler of that vintage to test
>>> with, but I assume an old enough assembler would not know about the
>>> crc32q instruction and fail to compile.
>>
>>
>> GCC from <2002 wouldn't support the symbolic operand names in inline
>> assembly. binutils from <2007 (IIRC) wouldn't support the assembler
>> instructions themselves.
>>
>> We could work around the latter by using the appropriate sequence of
>> bytes. We could work around the former by using the old syntax for
>> operands.
>
> I'm OK with not supporting the new instructions when building with an old
> compiler/assembler. But the build shouldn't fail with an old
> compiler/assembler. Using old syntax or raw bytes just to avoid failing on
> an ancient compiler seems ugly.

I dunno about old syntax, but raw bytes seems like a bad idea, for sure.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: What exactly is our CRC algorithm?

From
Abhijit Menon-Sen
Date:
At 2015-04-03 00:33:10 +0300, hlinnaka@iki.fi wrote:
>
> I came up with the attached.

I like it very much.

src/port/Makefile has (note src/srv):
   +# pg_crc32c_sse42.o and its _src.o version need CFLAGS_SSE42   +pg_crc32c_sse42.o: CFLAGS+=$(CFLAGS_SSE42)
+pg_crc32c_sse42_srv.o:CFLAGS+=$(CFLAGS_SSE42)
 

Other than that, this looks great. Thank you.

> BTW, we might want to move the "traditional" and "legacy" crc32
> implementations out of src/common. They are only used in backend
> code now.

I agree.

-- Abhijit



Re: What exactly is our CRC algorithm?

From
Heikki Linnakangas
Date:
On 04/03/2015 05:28 AM, Abhijit Menon-Sen wrote:
> At 2015-04-03 00:33:10 +0300, hlinnaka@iki.fi wrote:
>>
>> I came up with the attached.
>
> I like it very much.
>
> src/port/Makefile has (note src/srv):
>
>      +# pg_crc32c_sse42.o and its _src.o version need CFLAGS_SSE42
>      +pg_crc32c_sse42.o: CFLAGS+=$(CFLAGS_SSE42)
>      +pg_crc32c_sse42_srv.o: CFLAGS+=$(CFLAGS_SSE42)
>
> Other than that, this looks great. Thank you.
>
>> BTW, we might want to move the "traditional" and "legacy" crc32
>> implementations out of src/common. They are only used in backend
>> code now.
>
> I agree.

Committed this now, after some further cleanup and reorganizing the 
autoconf stuff.

- Heikki