Thread: What exactly is our CRC algorithm?
Our CRC algorithm is a bit weird. It's supposedly CRC-32, with the same polynomial as used in Ethernet et al, but it actually is not. The comments refer to "Painless Guide to CRC Error Detection Algorithms" by Ross N. Williams [1] (http://www.ross.net/crc/download/crc_v3.txt), but I think it was implemented incorrectly. As a test case, I used an input of a single zero byte. I calculated the CRC using Postgres' INIT_CRC32+COMP_CRC32+FIN_CRC32, and compared with various online CRC calculation tools and C snippets. The Postgres algorithm produces the value 2D02EF72, while the correct one is D202EF8D. The first and last byte are inverted. For longer inputs, the values diverge, and I can't see any obvious pattern between the Postgres and correct values. There are many variants of CRC calculations, as explained in Ross's guide. But ours doesn't seem to correspond to the reversed or reflected variants either. I compiled the code from Ross's document, and built a small test program to test it. I used Ross's "reverse" lookup table, which is the same table we use in Postgres. It produces this output: Calculating CRC-32 (polynomial 04C11DB7) for a single zero byte: D202EF8D 11010010000000101110111110001101 (simple) 2D02EF72 10101101000000101110111101110010 (lookup) D202EF8D 11010010000000101110111110001101 (lookup reflected) Hmm. So the simple, non-table driven, calculation gives the same result as using the lookup table using the reflected lookup code. That's expected; the lookup method is supposed to be the same, just faster. However, using the "normal" lookup code, but with a "reflected" lookup table, produces the same result as Postgres' algorithm. Indeed, that's what we do in PostgreSQL. But AFAICS, that's an incorrect combination. You're supposed to the non-reflected lookup table with the non-reflected lookup code; you can't mix and match. As far as I can tell, PostgreSQL's so-called CRC algorithm doesn't correspond to any bit-by-bit CRC variant and polynomial. My math skills are not strong enough to determine what the consequences of that are. It might still be a decent checksum. Or not. I couldn't tell if the good error detection properties of the normal CRC-32 polynomial apply to our algorithm or not. Thoughts? Attached is the test program I used for this. - Heikki
Attachment
On 2014-10-08 22:13:46 +0300, Heikki Linnakangas wrote: > Hmm. So the simple, non-table driven, calculation gives the same result as > using the lookup table using the reflected lookup code. That's expected; the > lookup method is supposed to be the same, just faster. However, using the > "normal" lookup code, but with a "reflected" lookup table, produces the same > result as Postgres' algorithm. Indeed, that's what we do in PostgreSQL. But > AFAICS, that's an incorrect combination. You're supposed to the > non-reflected lookup table with the non-reflected lookup code; you can't mix > and match. > > As far as I can tell, PostgreSQL's so-called CRC algorithm doesn't > correspond to any bit-by-bit CRC variant and polynomial. My math skills are > not strong enough to determine what the consequences of that are. It might > still be a decent checksum. Or not. I couldn't tell if the good error > detection properties of the normal CRC-32 polynomial apply to our algorithm > or not. Additional interesting datapoints are that hstore and ltree contain the same tables - but properly use the reflected computation. > Thoughts? It clearly seems like a bad idea to continue with this - I don't think anybody here knows which guarantees this gives us. The question is how can we move away from this. There's unfortunately two places that embed PGC32 that are likely to prove problematic when fixing the algorithm: pg_trgm and tsgist both seem to include crc's in their logic in a persistent way. I think we should provide INIT/COMP/FIN_PG32 using the current algorithm for these. If we're switching to a saner computation, we should imo also switch to a better polynom - CRC-32C has better error detection capabilities than CRC32 and is available in hardware. As we're paying the price pf breaking compat anyway... Arguably we could also say that given that there's been little evident problems with the borked computation we could also switch to a much faster hash instead of continuing to use crc... Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 09/10/14 10:13, Andres Freund wrote: > On 2014-10-08 22:13:46 +0300, Heikki Linnakangas wrote: >> Hmm. So the simple, non-table driven, calculation gives the same result as >> using the lookup table using the reflected lookup code. That's expected; the >> lookup method is supposed to be the same, just faster. However, using the >> "normal" lookup code, but with a "reflected" lookup table, produces the same >> result as Postgres' algorithm. Indeed, that's what we do in PostgreSQL. But >> AFAICS, that's an incorrect combination. You're supposed to the >> non-reflected lookup table with the non-reflected lookup code; you can't mix >> and match. >> >> As far as I can tell, PostgreSQL's so-called CRC algorithm doesn't >> correspond to any bit-by-bit CRC variant and polynomial. My math skills are >> not strong enough to determine what the consequences of that are. It might >> still be a decent checksum. Or not. I couldn't tell if the good error >> detection properties of the normal CRC-32 polynomial apply to our algorithm >> or not. > Additional interesting datapoints are that hstore and ltree contain the > same tables - but properly use the reflected computation. > >> Thoughts? > It clearly seems like a bad idea to continue with this - I don't think > anybody here knows which guarantees this gives us. > > The question is how can we move away from this. There's unfortunately > two places that embed PGC32 that are likely to prove problematic when > fixing the algorithm: pg_trgm and tsgist both seem to include crc's in > their logic in a persistent way. I think we should provide > INIT/COMP/FIN_PG32 using the current algorithm for these. > > If we're switching to a saner computation, we should imo also switch to > a better polynom - CRC-32C has better error detection capabilities than > CRC32 and is available in hardware. As we're paying the price pf > breaking compat anyway... > > Arguably we could also say that given that there's been little evident > problems with the borked computation we could also switch to a much > faster hash instead of continuing to use crc... > > Greetings, > > Andres Freund > Could a 64 bit variant of some kind be useful as an option - in addition to a correct 32 bit? As most people have 64 bit processors and storage is less constrained now-a-days, as well as we tend to store much larger chunks of data. Cheers, Gavin
On 10/09/2014 01:23 AM, Gavin Flower wrote: > On 09/10/14 10:13, Andres Freund wrote: >> If we're switching to a saner computation, we should imo also switch to >> a better polynom - CRC-32C has better error detection capabilities than >> CRC32 and is available in hardware. As we're paying the price pf >> breaking compat anyway... >> >> Arguably we could also say that given that there's been little evident >> problems with the borked computation we could also switch to a much >> faster hash instead of continuing to use crc... > > Could a 64 bit variant of some kind be useful as an option - in addition > to a correct 32 bit? More bits allows you to detect more errors. That's the only advantage. I've never heard that being a problem, so no, I don't think that's a good idea. > As most people have 64 bit processors and storage is less constrained > now-a-days, as well as we tend to store much larger chunks of data. That's irrelevant to the CRC in the WAL. Each WAL record is CRC'd separately, and they tend to be very small (less than 8k typically) regardless of how "large chunks of data" you store in the database. - Heikki
On 10/09/2014 12:13 AM, Andres Freund wrote: > On 2014-10-08 22:13:46 +0300, Heikki Linnakangas wrote: >> As far as I can tell, PostgreSQL's so-called CRC algorithm doesn't >> correspond to any bit-by-bit CRC variant and polynomial. My math skills are >> not strong enough to determine what the consequences of that are. It might >> still be a decent checksum. Or not. I couldn't tell if the good error >> detection properties of the normal CRC-32 polynomial apply to our algorithm >> or not. > > Additional interesting datapoints are that hstore and ltree contain the > same tables - but properly use the reflected computation. > >> Thoughts? > > It clearly seems like a bad idea to continue with this - I don't think > anybody here knows which guarantees this gives us. > > The question is how can we move away from this. There's unfortunately > two places that embed PGC32 that are likely to prove problematic when > fixing the algorithm: pg_trgm and tsgist both seem to include crc's in > their logic in a persistent way. I think we should provide > INIT/COMP/FIN_PG32 using the current algorithm for these. Agreed, it's not worth breaking pg_upgrade for this. > If we're switching to a saner computation, we should imo also switch to > a better polynom - CRC-32C has better error detection capabilities than > CRC32 and is available in hardware. As we're paying the price pf > breaking compat anyway... Agreed. > Arguably we could also say that given that there's been little evident > problems with the borked computation we could also switch to a much > faster hash instead of continuing to use crc... I don't feel like taking the leap. Once we switch to slice-by-4/8 and/or use a hardware instruction when available, CRC is fast enough. I came up with the attached patches. They do three things: 1. Get rid of the 64-bit CRC code. It's not used for anything, and haven't been for years, so it doesn't seem worth spending any effort to fix them. 2. Switch to CRC-32C (Castagnoli) for WAL and other places that don't need to remain compatible across major versions. 3. Use the same lookup table for hstore and ltree, as used for the legacy "almost CRC-32" algorithm. The tables are identical, so might as well. Any objections? - Heikki
Attachment
On 10/27/2014 06:02 PM, Heikki Linnakangas wrote: > I came up with the attached patches. They do three things: > > 1. Get rid of the 64-bit CRC code. It's not used for anything, and > haven't been for years, so it doesn't seem worth spending any effort to > fix them. > > 2. Switch to CRC-32C (Castagnoli) for WAL and other places that don't > need to remain compatible across major versions. > > 3. Use the same lookup table for hstore and ltree, as used for the > legacy "almost CRC-32" algorithm. The tables are identical, so might as > well. > > Any objections? I hear none, so committed with some small fixes. - Heikki
On Tue, Nov 4, 2014 at 4:47 AM, Heikki Linnakangas <hlinnakangas@vmware.com> wrote: > I hear none, so committed with some small fixes. Are you going to get the slice-by-N stuff working next, to speed this up? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2014-11-04 08:21:13 -0500, Robert Haas wrote: > On Tue, Nov 4, 2014 at 4:47 AM, Heikki Linnakangas > <hlinnakangas@vmware.com> wrote: > > I hear none, so committed with some small fixes. > > Are you going to get the slice-by-N stuff working next, to speed this up? I don't plan to do anything serious with it, but I've hacked up the crc code to use the hardware instruction. The results are quite good - crc vanishes entirely from the profile for most workloads. It's still visible for bulk copy, but that's pretty much it. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 11/04/2014 03:21 PM, Robert Haas wrote: > On Tue, Nov 4, 2014 at 4:47 AM, Heikki Linnakangas > <hlinnakangas@vmware.com> wrote: >> I hear none, so committed with some small fixes. > > Are you going to get the slice-by-N stuff working next, to speed this up? I don't have any concrete plans, but yeah, that would be great. So definitely maybe. - Heikki
On Tue, Nov 4, 2014 at 3:17 PM, Heikki Linnakangas <hlinnakangas@vmware.com> wrote:
With Regards,
Amit Kapila.
On 10/27/2014 06:02 PM, Heikki Linnakangas wrote:I came up with the attached patches. They do three things:
1. Get rid of the 64-bit CRC code. It's not used for anything, and
haven't been for years, so it doesn't seem worth spending any effort to
fix them.
2. Switch to CRC-32C (Castagnoli) for WAL and other places that don't
need to remain compatible across major versions.
Will this change allow database created before this commit to be
started after this commit?
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
On 11/07/2014 07:08 AM, Amit Kapila wrote: > On Tue, Nov 4, 2014 at 3:17 PM, Heikki Linnakangas <hlinnakangas@vmware.com> > wrote: > >> On 10/27/2014 06:02 PM, Heikki Linnakangas wrote: >> >>> I came up with the attached patches. They do three things: >>> >>> 1. Get rid of the 64-bit CRC code. It's not used for anything, and >>> haven't been for years, so it doesn't seem worth spending any effort to >>> fix them. >>> >>> 2. Switch to CRC-32C (Castagnoli) for WAL and other places that don't >>> need to remain compatible across major versions. >>> >>> > Will this change allow database created before this commit to be > started after this commit? No. You could use pg_resetxlog to fix the WAL, but I think at least relmap files would still prevent you from starting up. You could use pg_upgrade. - Heikki
At 2014-11-04 14:40:36 +0100, andres@2ndquadrant.com wrote: > > On 2014-11-04 08:21:13 -0500, Robert Haas wrote: > > > > Are you going to get the slice-by-N stuff working next, to speed > > this up? > > I don't plan to do anything serious with it, but I've hacked up the > crc code to use the hardware instruction. I'm working on this (first speeding up the default calculation using slice-by-N, then adding support for the SSE4.2 CRC instruction on top). -- Abhijit
On Tue, Nov 11, 2014 at 6:26 AM, Abhijit Menon-Sen <ams@2ndquadrant.com> wrote: > At 2014-11-04 14:40:36 +0100, andres@2ndquadrant.com wrote: >> On 2014-11-04 08:21:13 -0500, Robert Haas wrote: >> > Are you going to get the slice-by-N stuff working next, to speed >> > this up? >> >> I don't plan to do anything serious with it, but I've hacked up the >> crc code to use the hardware instruction. > > I'm working on this (first speeding up the default calculation using > slice-by-N, then adding support for the SSE4.2 CRC instruction on top). Great! -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
At 2014-11-11 16:56:00 +0530, ams@2ndQuadrant.com wrote: > > I'm working on this (first speeding up the default calculation using > slice-by-N, then adding support for the SSE4.2 CRC instruction on > top). I've done the first part in the attached patch, and I'm working on the second (especially the bits to issue CPUID at startup and decide which implementation to use). As a benchmark, I ran pg_xlogdump --stats against 11GB of WAL data (674 segments) generated by running a total of 2M pgbench transactions on a db initialised with scale factor 25. The tests were run on my i5-3230 CPU, and the code in each case was compiled with "-O3 -msse4.2" (and without --enable-debug). The profile was dominated by the CRC calculation in ValidXLogRecord. With HEAD's CRC code: bin/pg_xlogdump --stats wal/000000010000000000000001 29.81s user 3.56s system 77% cpu 43.274 total bin/pg_xlogdump --stats wal/000000010000000000000001 29.59s user 3.85s system 75% cpu 44.227 total With slice-by-4 (a minor variant of the attached patch; the results are included only for curiosity's sake, but I can post the code if needed): bin/pg_xlogdump --stats wal/000000010000000000000001 13.52s user 3.82s system 48% cpu 35.808 total bin/pg_xlogdump --stats wal/000000010000000000000001 13.34s user 3.96s system 48% cpu 35.834 total With slice-by-8 (i.e. the attached patch): bin/pg_xlogdump --stats wal/000000010000000000000001 7.88s user 3.96s system 34% cpu 34.414 total bin/pg_xlogdump --stats wal/000000010000000000000001 7.85s user 4.10s system 34% cpu 35.001 total (Note the progressive reduction in user time from ~29s to ~8s.) Finally, just for comparison, here's what happens when we use the hardware instruction via gcc's __builtin_ia32_crc32xx intrinsics (i.e. the additional patch I'm working on): bin/pg_xlogdump --stats wal/000000010000000000000001 3.33s user 4.79s system 23% cpu 34.832 total There are a number of potential micro-optimisations, I just wanted to submit the obvious thing first and explore more possibilities later. -- Abhijit
Attachment
On 11/19/2014 05:58 PM, Abhijit Menon-Sen wrote: > At 2014-11-11 16:56:00 +0530, ams@2ndQuadrant.com wrote: >> >> I'm working on this (first speeding up the default calculation using >> slice-by-N, then adding support for the SSE4.2 CRC instruction on >> top). > > I've done the first part in the attached patch, and I'm working on the > second (especially the bits to issue CPUID at startup and decide which > implementation to use). > > As a benchmark, I ran pg_xlogdump --stats against 11GB of WAL data (674 > segments) generated by running a total of 2M pgbench transactions on a > db initialised with scale factor 25. That's an interesting choice of workload. That sure is heavy on the CRC calculation, but the speed of pg_xlogdump hardly matters in real life. > With HEAD's CRC code: > > bin/pg_xlogdump --stats wal/000000010000000000000001 29.81s user 3.56s system 77% cpu 43.274 total > bin/pg_xlogdump --stats wal/000000010000000000000001 29.59s user 3.85s system 75% cpu 44.227 total > > With slice-by-4 (a minor variant of the attached patch; the results are > included only for curiosity's sake, but I can post the code if needed): > > bin/pg_xlogdump --stats wal/000000010000000000000001 13.52s user 3.82s system 48% cpu 35.808 total > bin/pg_xlogdump --stats wal/000000010000000000000001 13.34s user 3.96s system 48% cpu 35.834 total > > With slice-by-8 (i.e. the attached patch): > > bin/pg_xlogdump --stats wal/000000010000000000000001 7.88s user 3.96s system 34% cpu 34.414 total > bin/pg_xlogdump --stats wal/000000010000000000000001 7.85s user 4.10s system 34% cpu 35.001 total > > (Note the progressive reduction in user time from ~29s to ~8s.) > > Finally, just for comparison, here's what happens when we use the > hardware instruction via gcc's __builtin_ia32_crc32xx intrinsics > (i.e. the additional patch I'm working on): > > bin/pg_xlogdump --stats wal/000000010000000000000001 3.33s user 4.79s system 23% cpu 34.832 total Impressive! It would be good to see separate benchmarks on WAL generation, and WAL replay. pg_xlogdump is probably close to WAL replay, but the WAL generation case is somewhat different, as WAL is generated in small pieces, and each piece is accumulated to the CRC separately. The slice-by-X algorithm might be less effective in that scenario. I have no doubt that it's still a lot faster than the byte-at-a-time operation, but would be nice to have numbers on it. - Heikki
On Wed, Nov 19, 2014 at 11:44 AM, Heikki Linnakangas <hlinnakangas@vmware.com> wrote: > That's an interesting choice of workload. That sure is heavy on the CRC > calculation, but the speed of pg_xlogdump hardly matters in real life. But isn't a workload that is heavy on CRC calculation exactly what we want here? That way we can see clearly how much benefit we're getting on that particular part of the computation. It'll still speed up other workloads, too, just not as much. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > On Wed, Nov 19, 2014 at 11:44 AM, Heikki Linnakangas > <hlinnakangas@vmware.com> wrote: >> That's an interesting choice of workload. That sure is heavy on the CRC >> calculation, but the speed of pg_xlogdump hardly matters in real life. > But isn't a workload that is heavy on CRC calculation exactly what we > want here? That way we can see clearly how much benefit we're getting > on that particular part of the computation. It'll still speed up > other workloads, too, just not as much. Heikki's point is that it's an unrealistic choice of CRC chunk size. Maybe that doesn't matter very much, but it's unproven. regards, tom lane
On 11/19/2014 06:49 PM, Robert Haas wrote: > On Wed, Nov 19, 2014 at 11:44 AM, Heikki Linnakangas > <hlinnakangas@vmware.com> wrote: >> That's an interesting choice of workload. That sure is heavy on the CRC >> calculation, but the speed of pg_xlogdump hardly matters in real life. > > But isn't a workload that is heavy on CRC calculation exactly what we > want here? That way we can see clearly how much benefit we're getting > on that particular part of the computation. It'll still speed up > other workloads, too, just not as much. Sure. But pg_xlogdump's way of using the CRC isn't necessarily representative of how the backend uses it. It's probably pretty close to WAL replay in the server, but even there the server might be hurt more by the extra cache used by the lookup tables. And a backend generating the WAL computes the CRC on smaller pieces than pg_xlogdump and WAL redo does. That said, the speedup is so large that I'm sure this is a big win in the server too, despite those factors. - Heikki
On 2014-11-19 19:12:22 +0200, Heikki Linnakangas wrote: > On 11/19/2014 06:49 PM, Robert Haas wrote: > >On Wed, Nov 19, 2014 at 11:44 AM, Heikki Linnakangas > ><hlinnakangas@vmware.com> wrote: > >>That's an interesting choice of workload. That sure is heavy on the CRC > >>calculation, but the speed of pg_xlogdump hardly matters in real life. > > > >But isn't a workload that is heavy on CRC calculation exactly what we > >want here? That way we can see clearly how much benefit we're getting > >on that particular part of the computation. It'll still speed up > >other workloads, too, just not as much. > > Sure. But pg_xlogdump's way of using the CRC isn't necessarily > representative of how the backend uses it. It's probably pretty close to WAL > replay in the server, but even there the server might be hurt more by the > extra cache used by the lookup tables. And a backend generating the WAL > computes the CRC on smaller pieces than pg_xlogdump and WAL redo does. Right. Although it hugely depends on the checkpoint settings - if there's many FPWs it doesn't matter much. Obviously it won't be a fourfold performance improvement in the server, but given the profiles I've seen in the past I'm pretty sure it'll bee a benefit. > That said, the speedup is so large that I'm sure this is a big win in the > server too, despite those factors. Yep. I've done some fast and loose benchmarking in the past and it was quite noticeable. Made XLogInsert() nearly entirely drop from profiles. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
At 2014-11-19 19:12:22 +0200, hlinnakangas@vmware.com wrote: > > But pg_xlogdump's way of using the CRC isn't necessarily > representative of how the backend uses it. It's probably pretty close > to WAL replay in the server, but even there the server might be hurt > more by the extra cache used by the lookup tables. Sure. As Robert said, my initial benchmark was designed to show the CRC improvements in isolation. I would be happy to conduct other tests and post the numbers. If I understand correctly, I need to demonstrate two things that are "probably fine", but we don't have proof of: (a) that the improvements in pg_xlogdump performance translate to an improvement in the server when reading WAL. (b) that the slice-by-8 code doesn't hurt performance for writing WAL. To address (a), I am timing a standby restoring the same 11GB of WAL via restore_command with and without the CRC patch. My earlier tests showed that this time can vary quite a bit between runs even with no changes, but I expect to see an improvement anyway. Suggestions for how to address (b) are welcome. -- Abhijit
At 2014-11-20 09:52:01 +0530, ams@2ndQuadrant.com wrote: > > To address (a), I am timing a standby restoring the same 11GB of WAL > via restore_command with and without the CRC patch. I ran "perf record -F 100 --call-graph=dwarf bin/postgres -D backup", where: (a) bin/postgres was compiled, as before, with CFLAGS="-O3 -msse4.2" and without --enable-debug. (b) backup is a basebackup of the original data directory before I ran pgbench; it has a recovery.conf with a restore_commandto fetch the WAL generated during the pgbench run. I ran the test "frice" (as my daughter would say) with HEAD and the slice-by-8 patch, and counted the time between the first and last «restored log file "NNN" from archive» log messages. The number in parentheses shows the order in which I ran the tests. HEAD: 6m47s (1), 6m23s (2), 3m12s (6), 3m04s (7) 8CRC: 5m16s (3), 3m13s (4), 2m57s (5), 2m46s (8) In the unpatched server, the profile shows ValidXLogRecord at >50% in every case (54.81%, 59.41%, 58.82%, 59.05%). In the patched server, pg_comp_crc32c was at the top of the profile in three cases, with 10.29%, 12.01%, and 10.99%. In test (4), however, it was at 6.38%, and first place went to copy_user_enhanced_fast_string with 10.03% (which was in second place the other three times). I believe the wallclock runtime in these tests was influenced more by IO than CPU, but that the profile shows a worthwhile improvement anyway. I have the perf.data from each run, and I can rerun the tests if anyone wants to suggest a different strategy. > Suggestions for how to address (b) are welcome. I'll think about how to do this, and work on the SSE4.2 patch meanwhile. -- Abhijit
At 2014-11-20 13:47:00 +0530, ams@2ndQuadrant.com wrote: > > > Suggestions for how to address (b) are welcome. With help from Andres, I set up a workload where XLogInsert* was at the top of my profiles: server with fsync and synchronous_commit off, and pgbench running a multiple-row insert into a single-text-column table with -M prepared -c 25 -t 250000 -f script. Unfortunately I can't see much difference despite running things with slightly different parameters a few dozen times. For example, here are real/user/sys times from three runs each with HEAD and slice-by-8 on an otherwise-idle i7-3770 server with a couple of undistinguished Toshiba 7200rpm SATA disks in RAID-1: HEAD: 2m24.822s/0m18.776s/0m23.156s 3m34.586s/0m18.784s/0m24.324s 3m41.557s/0m18.640s/0m23.920s Slice-by-8: 2m26.977s/0m18.420s/0m22.884s 3m36.664s/0m18.376s/0m24.232s 3m43.930s/0m18.580s/0m24.560s I don't know how to interpret these results (especially the tendency for the tests to slow down as time passes, with every version). At best, it shows that the new CRC code doesn't hurt, at worst it's just irrelevant. Supporting the latter interpretation, using the hardware-CRC patch also gives similar results (but XLogInsert is not at the top of the profile). Hardware-CRC: 2m29.090s/0m18.652s/0m23.764s 3m30.188s/0m18.692s/0m25.332s 3m38.110s/0m20.424s/0m24.532s If anyone has other suggestions, I'm all ears. -- Abhijit
On 11/21/2014 12:11 PM, Abhijit Menon-Sen wrote: > At 2014-11-20 13:47:00 +0530, ams@2ndQuadrant.com wrote: >> >>> Suggestions for how to address (b) are welcome. > > With help from Andres, I set up a workload where XLogInsert* was at the > top of my profiles: server with fsync and synchronous_commit off, and > pgbench running a multiple-row insert into a single-text-column table > with -M prepared -c 25 -t 250000 -f script. How wide is the row, in terms of bytes? You should see bigger improvement with wider rows, as you get longer contiguous chunks of data to CRC that way. With very narrow rows, you might not see much difference because the chunks are so small. Did you run these tests with a fresh checkout, including the WAL format patch I committed yesterday? That would make some difference to how many XLogRecData chunks there are for each insertion record. If that's the problem, it might be beneficial to memcpy() all the data to a temporary buffer, and calculate the CRC over the whole, instead of CRC'ing each XLogRecData chunk separately. XLogRecordAssemble() uses a scratch area, hdr_scratch, for building all the headers. You could check how much rmgr-specific data there is, and if there isn't much, just append the data to that scratch area too. > Unfortunately I can't see much difference despite running things with > slightly different parameters a few dozen times. For example, here are > real/user/sys times from three runs each with HEAD and slice-by-8 on an > otherwise-idle i7-3770 server with a couple of undistinguished Toshiba > 7200rpm SATA disks in RAID-1: > > HEAD: > 2m24.822s/0m18.776s/0m23.156s > 3m34.586s/0m18.784s/0m24.324s > 3m41.557s/0m18.640s/0m23.920s > > Slice-by-8: > 2m26.977s/0m18.420s/0m22.884s > 3m36.664s/0m18.376s/0m24.232s > 3m43.930s/0m18.580s/0m24.560s > > I don't know how to interpret these results (especially the tendency for > the tests to slow down as time passes, with every version). The user/sys times are very stable, but the real time is much higher and increases. Does that mean that it's blocked on I/O? - Heikki
On 2014-11-21 13:01:50 +0200, Heikki Linnakangas wrote: > On 11/21/2014 12:11 PM, Abhijit Menon-Sen wrote: > >At 2014-11-20 13:47:00 +0530, ams@2ndQuadrant.com wrote: > >> > >>>Suggestions for how to address (b) are welcome. > > > >With help from Andres, I set up a workload where XLogInsert* was at the > >top of my profiles: server with fsync and synchronous_commit off, and > >pgbench running a multiple-row insert into a single-text-column table > >with -M prepared -c 25 -t 250000 -f script. > > How wide is the row, in terms of bytes? You should see bigger improvement > with wider rows, as you get longer contiguous chunks of data to CRC that > way. With very narrow rows, you might not see much difference because the > chunks are so small. The primary goal, as I understood it, was to test very short records to test whether there's a regression due to slice-by-8. There doesn't seem to be any. It's, IIRC, trivial to reproduce significant performance benefits in workloads with lots of FPWs or large records (like COPY), but those weren't what you and Tom were concerned about... > If that's the problem, it might be beneficial to memcpy() all the data to a > temporary buffer, and calculate the CRC over the whole, instead of CRC'ing > each XLogRecData chunk separately. XLogRecordAssemble() uses a scratch area, > hdr_scratch, for building all the headers. You could check how much > rmgr-specific data there is, and if there isn't much, just append the data > to that scratch area too. I think that might very well make sense - but it's imo something better done separately from the smarter CRC algorithm. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 11/21/2014 01:06 PM, Andres Freund wrote: > On 2014-11-21 13:01:50 +0200, Heikki Linnakangas wrote: >> On 11/21/2014 12:11 PM, Abhijit Menon-Sen wrote: >>> At 2014-11-20 13:47:00 +0530, ams@2ndQuadrant.com wrote: >>>> >>>>> Suggestions for how to address (b) are welcome. >>> >>> With help from Andres, I set up a workload where XLogInsert* was at the >>> top of my profiles: server with fsync and synchronous_commit off, and >>> pgbench running a multiple-row insert into a single-text-column table >>> with -M prepared -c 25 -t 250000 -f script. >> >> How wide is the row, in terms of bytes? You should see bigger improvement >> with wider rows, as you get longer contiguous chunks of data to CRC that >> way. With very narrow rows, you might not see much difference because the >> chunks are so small. > > The primary goal, as I understood it, was to test very short records to > test whether there's a regression due to slice-by-8. There doesn't seem > to be any. Ah, OK. Mission accomplished, then. > It's, IIRC, trivial to reproduce significant performance benefits in > workloads with lots of FPWs or large records (like COPY), but those > weren't what you and Tom were concerned about... Yeah. >> If that's the problem, it might be beneficial to memcpy() all the data to a >> temporary buffer, and calculate the CRC over the whole, instead of CRC'ing >> each XLogRecData chunk separately. XLogRecordAssemble() uses a scratch area, >> hdr_scratch, for building all the headers. You could check how much >> rmgr-specific data there is, and if there isn't much, just append the data >> to that scratch area too. > > I think that might very well make sense - but it's imo something better > done separately from the smarter CRC algorithm. Agreed. - Heikki
On Fri, Nov 21, 2014 at 12:11 PM, Abhijit Menon-Sen <ams@2ndquadrant.com> wrote: > If anyone has other suggestions, I'm all ears. Do you have a WIP patch I could take a look at and tweak? Maybe there's something about the compilers code generation that could be improved. Regards, Ants Aasma -- Cybertec Schönig & Schönig GmbH Gröhrmühlgasse 26 A-2700 Wiener Neustadt Web: http://www.postgresql-support.de
On Fri, Nov 21, 2014 at 03:41:45PM +0530, Abhijit Menon-Sen wrote: > At 2014-11-20 13:47:00 +0530, ams@2ndQuadrant.com wrote: > > > > > Suggestions for how to address (b) are welcome. > > With help from Andres, I set up a workload where XLogInsert* was at the > top of my profiles: server with fsync and synchronous_commit off, and > pgbench running a multiple-row insert into a single-text-column table > with -M prepared -c 25 -t 250000 -f script. > > Unfortunately I can't see much difference despite running things with > slightly different parameters a few dozen times. For example, here are > real/user/sys times from three runs each with HEAD and slice-by-8 on an > otherwise-idle i7-3770 server with a couple of undistinguished Toshiba > 7200rpm SATA disks in RAID-1: > > HEAD: > 2m24.822s/0m18.776s/0m23.156s > 3m34.586s/0m18.784s/0m24.324s > 3m41.557s/0m18.640s/0m23.920s > > Slice-by-8: > 2m26.977s/0m18.420s/0m22.884s > 3m36.664s/0m18.376s/0m24.232s > 3m43.930s/0m18.580s/0m24.560s > > I don't know how to interpret these results (especially the tendency for > the tests to slow down as time passes, with every version). At best, it > shows that the new CRC code doesn't hurt, at worst it's just irrelevant. > Supporting the latter interpretation, using the hardware-CRC patch also > gives similar results (but XLogInsert is not at the top of the profile). > > Hardware-CRC: > 2m29.090s/0m18.652s/0m23.764s > 3m30.188s/0m18.692s/0m25.332s > 3m38.110s/0m20.424s/0m24.532s > > If anyone has other suggestions, I'm all ears. > > -- Abhijit Hi, This indicates that another part of the system is the resource limit, not the CRC calculation. My money is on the I/O system. Try it using an in memory filesystem and see if that matters. Even if it is still the same results, at least the new version of CRC is no worse than the current version. Regards, Ken
On Fri, Nov 21, 2014 at 07:49:45AM -0600, ktm@rice.edu wrote: > Hi, > > This indicates that another part of the system is the resource limit, > not the CRC calculation. My money is on the I/O system. Try it using > an in memory filesystem and see if that matters. Even if it is still > the same results, at least the new version of CRC is no worse than > the current version. I would test it with fsync=off to remove the fsync delay. That will simulate an SSD or BBU controller. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + Everyone has their own god. +
At 2014-11-26 18:56:52 -0500, bruce@momjian.us wrote: > > I would test it with fsync=off to remove the fsync delay. That will > simulate an SSD or BBU controller. The earlier tests were run with fsync=off. I ran another round of the replay tests on the i7 server, this time replaying 30GB of WAL segments. master 16:08:03 16:16:23 08:20 16:19:33 16:28:03 08:30 16:32:20 16:41:17 08:57 16:42:42 16:51:22 08:40 8crc 16:52:11 16:59:48 07:37 17:01:07 17:08:30 07:23 17:22:16 17:30:11 07:55 17:31:29 17:39:06 07:37 These are just the results of the last four runs with each branch, but the difference was consistent over many more runs. To summarise: the slice-by-8 patch (a) increases pure CRC performance by a large margin, (b) has a positive effect on reading WAL during replay, (c) doesn't hurt XLogInsert in typical cases (which was the main worry). -- Abhijit
On Thu, Nov 27, 2014 at 11:19:51AM +0530, Abhijit Menon-Sen wrote: > At 2014-11-26 18:56:52 -0500, bruce@momjian.us wrote: > > > > I would test it with fsync=off to remove the fsync delay. That will > > simulate an SSD or BBU controller. > > The earlier tests were run with fsync=off. > > I ran another round of the replay tests on the i7 server, this time > replaying 30GB of WAL segments. > > master > > 16:08:03 16:16:23 08:20 > 16:19:33 16:28:03 08:30 > 16:32:20 16:41:17 08:57 > 16:42:42 16:51:22 08:40 > > 8crc > > 16:52:11 16:59:48 07:37 > 17:01:07 17:08:30 07:23 > 17:22:16 17:30:11 07:55 > 17:31:29 17:39:06 07:37 > > These are just the results of the last four runs with each branch, but > the difference was consistent over many more runs. > > To summarise: the slice-by-8 patch (a) increases pure CRC performance by > a large margin, (b) has a positive effect on reading WAL during replay, > (c) doesn't hurt XLogInsert in typical cases (which was the main worry). Thanks, these are good numbers. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + Everyone has their own god. +
Hi. Here's a proposed patch to use CPUID at startup to determine if the SSE4.2 CRC instructions are available, to use them instead of the slice-by-8 implementation (posted earlier). A few notes: 1. GCC has included cpuid.h since 4.3.0, so I figured it was safe to use. It can be replaced with some inline assembly otherwise. 2. I've also used the crc32b/crc32q instructions directly rather than using ".bytes" to encode the instructions; bintuils versions since 2007 or so have supported them. 3. I've included the MSVC implementation mostly as an example of how to extend this to different compilers/platforms. It's written according to the documentation for MSVC intrinsics, but I have not tested it. Suggestions/improvements are welcome. -- Abhijit
Attachment
On Thu, Dec 25, 2014 at 11:57:29AM +0530, Abhijit Menon-Sen wrote: > Hi. > > Here's a proposed patch to use CPUID at startup to determine if the > SSE4.2 CRC instructions are available, to use them instead of the > slice-by-8 implementation (posted earlier). > > A few notes: > > 1. GCC has included cpuid.h since 4.3.0, so I figured it was safe to > use. It can be replaced with some inline assembly otherwise. > > 2. I've also used the crc32b/crc32q instructions directly rather than > using ".bytes" to encode the instructions; bintuils versions since > 2007 or so have supported them. > > 3. I've included the MSVC implementation mostly as an example of how to > extend this to different compilers/platforms. It's written according > to the documentation for MSVC intrinsics, but I have not tested it. > Suggestions/improvements are welcome. Uh, what happens if the system is compiled on a different CPU that it is run on? Seems we would need a run-time CPU test. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + Everyone has their own god. +
On December 26, 2014 4:50:33 PM CET, Bruce Momjian <bruce@momjian.us> wrote: >On Thu, Dec 25, 2014 at 11:57:29AM +0530, Abhijit Menon-Sen wrote: >> Hi. >> >> Here's a proposed patch to use CPUID at startup to determine if the >> SSE4.2 CRC instructions are available, to use them instead of the >> slice-by-8 implementation (posted earlier). >> >> A few notes: >> >> 1. GCC has included cpuid.h since 4.3.0, so I figured it was safe to >> use. It can be replaced with some inline assembly otherwise. >> >> 2. I've also used the crc32b/crc32q instructions directly rather than >> using ".bytes" to encode the instructions; bintuils versions since >> 2007 or so have supported them. >> >> 3. I've included the MSVC implementation mostly as an example of how >to >> extend this to different compilers/platforms. It's written >according >> to the documentation for MSVC intrinsics, but I have not tested >it. >> Suggestions/improvements are welcome. > >Uh, what happens if the system is compiled on a different CPU that it >is >run on? Seems we would need a run-time CPU test. That's the cpuid thing mentioned above. Andres -- Please excuse brevity and formatting - I am writing this on my mobile phone. Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Fri, Dec 26, 2014 at 04:52:58PM +0100, Andres Freund wrote: > On December 26, 2014 4:50:33 PM CET, Bruce Momjian <bruce@momjian.us> wrote: > >On Thu, Dec 25, 2014 at 11:57:29AM +0530, Abhijit Menon-Sen wrote: > >> Hi. > >> > >> Here's a proposed patch to use CPUID at startup to determine if the > >> SSE4.2 CRC instructions are available, to use them instead of the > >> slice-by-8 implementation (posted earlier). > >> > >> A few notes: > >> > >> 1. GCC has included cpuid.h since 4.3.0, so I figured it was safe to > >> use. It can be replaced with some inline assembly otherwise. > >> > >> 2. I've also used the crc32b/crc32q instructions directly rather than > >> using ".bytes" to encode the instructions; bintuils versions since > >> 2007 or so have supported them. > >> > >> 3. I've included the MSVC implementation mostly as an example of how > >to > >> extend this to different compilers/platforms. It's written > >according > >> to the documentation for MSVC intrinsics, but I have not tested > >it. > >> Suggestions/improvements are welcome. > > > >Uh, what happens if the system is compiled on a different CPU that it > >is > >run on? Seems we would need a run-time CPU test. > > That's the cpuid thing mentioned above. Oh, so cpuid is not a macro exported by the compiler but a C API. Good. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + Everyone has their own god. +
On December 26, 2014 6:05:34 PM CET, Bruce Momjian <bruce@momjian.us> wrote: >On Fri, Dec 26, 2014 at 04:52:58PM +0100, Andres Freund wrote: >> On December 26, 2014 4:50:33 PM CET, Bruce Momjian <bruce@momjian.us> >wrote: >> >On Thu, Dec 25, 2014 at 11:57:29AM +0530, Abhijit Menon-Sen wrote: >> >> Hi. >> >> >> >> Here's a proposed patch to use CPUID at startup to determine if >the >> >> SSE4.2 CRC instructions are available, to use them instead of the >> >> slice-by-8 implementation (posted earlier). >> >> >> >> A few notes: >> >> >> >> 1. GCC has included cpuid.h since 4.3.0, so I figured it was safe >to >> >> use. It can be replaced with some inline assembly otherwise. >> >> >> >> 2. I've also used the crc32b/crc32q instructions directly rather >than >> >> using ".bytes" to encode the instructions; bintuils versions >since >> >> 2007 or so have supported them. >> >> >> >> 3. I've included the MSVC implementation mostly as an example of >how >> >to >> >> extend this to different compilers/platforms. It's written >> >according >> >> to the documentation for MSVC intrinsics, but I have not tested >> >it. >> >> Suggestions/improvements are welcome. >> > >> >Uh, what happens if the system is compiled on a different CPU that >it >> >is >> >run on? Seems we would need a run-time CPU test. >> >> That's the cpuid thing mentioned above. > >Oh, so cpuid is not a macro exported by the compiler but a C API. >Good Neither, really. It's a instruction returning CPU capabilities. -- Please excuse brevity and formatting - I am writing this on my mobile phone. Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Fri, Dec 26, 2014 at 04:52:58PM +0100, Andres Freund wrote: > >Uh, what if the system is compiled on a different CPU that it > >is > >run on? Seems we would need a run-time CPU test. > > That's the cpuid thing mentioned above. Is this something that could potentially change the data stored on disk? Does pg_upgrade need to check for changes in this area? Is the detection exposed by pg_controldata? Could this affect running the data directory on a different CPU? -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + Everyone has their own god. +
At 2014-12-26 13:11:43 -0500, bruce@momjian.us wrote: > > Is this something that could potentially change the data stored on > disk? Does pg_upgrade need to check for changes in this area? Is the > detection exposed by pg_controldata? Could this affect running the > data directory on a different CPU? No to all. Subsequent to Heikki's change (already in master) to use the CRC-32C algorithm (instead of the earlier mistakenly-reflected-but-not-quite one), both the slice-by-8 software implementation posted earlier and the SSE4.2 CRC32* instructions will compute exactly the same value. (Yes, I have verified this in both cases.) -- Abhijit
On Fri, Dec 26, 2014 at 11:52:41PM +0530, Abhijit Menon-Sen wrote: > At 2014-12-26 13:11:43 -0500, bruce@momjian.us wrote: > > > > Is this something that could potentially change the data stored on > > disk? Does pg_upgrade need to check for changes in this area? Is the > > detection exposed by pg_controldata? Could this affect running the > > data directory on a different CPU? > > No to all. > > Subsequent to Heikki's change (already in master) to use the CRC-32C > algorithm (instead of the earlier mistakenly-reflected-but-not-quite > one), both the slice-by-8 software implementation posted earlier and > the SSE4.2 CRC32* instructions will compute exactly the same value. > > (Yes, I have verified this in both cases.) Uh, we didn't fully change all code to use the new algorithm because there were cases that the CRC was already stored on disk, e.g hstore, ltree. I assume you are only linking into Heikki's new code and will not change the places that use the old CRC method on disk --- just checking. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + Everyone has their own god. +
At 2014-12-26 13:48:40 -0500, bruce@momjian.us wrote: > > I assume you are only linking into Heikki's new code and will not > change the places that use the old CRC method on disk --- just > checking. Correct. The legacy CRC computation code used by ltree etc. is completely unaffected by both my sets of changes. -- Abhijit
Hi, On 2014-12-25 11:57:29 +0530, Abhijit Menon-Sen wrote: > -extern pg_crc32 pg_comp_crc32c(pg_crc32 crc, const void *data, size_t len); > +extern void pg_init_comp_crc32c(void); How about pg_choose_crc_impl() or something? > +extern pg_crc32 (*pg_comp_crc32c)(pg_crc32 crc, const void *data, size_t len); > > /* > * CRC calculation using the CRC-32C (Castagnoli) polynomial. > > diff --git a/src/port/pg_crc.c b/src/port/pg_crc.c > index 2f9857b..6be17b0 100644 > --- a/src/port/pg_crc.c > +++ b/src/port/pg_crc.c > @@ -21,6 +21,13 @@ > #include "utils/pg_crc.h" > #include "utils/pg_crc_tables.h" > > +#if defined(HAVE_CPUID_H) > +#include <cpuid.h> > +#elif defined(_MSC_VER) > +#include <intrin.h> > +#include <nmmintrin.h> > +#endif > + > static inline uint32 bswap32(uint32 x) > { > #if defined(__GNUC__) || defined(__clang__) > @@ -39,8 +46,8 @@ static inline uint32 bswap32(uint32 x) > #define cpu_to_le32(x) x > #endif > > -pg_crc32 > -pg_comp_crc32c(pg_crc32 crc, const void *data, size_t len) > +static pg_crc32 > +pg_comp_crc32c_sb8(pg_crc32 crc, const void *data, size_t len) _sb8? Unless I miss something it's not slice by 8 but rather bytewise? > { > const unsigned char *p = data; > const uint32 *p8; > @@ -61,7 +68,6 @@ pg_comp_crc32c(pg_crc32 crc, const void *data, size_t len) > */ > > p8 = (const uint32 *) p; > - > while (len >= 8) > { > uint32 a = *p8++ ^ cpu_to_le32(crc); > @@ -101,8 +107,102 @@ pg_comp_crc32c(pg_crc32 crc, const void *data, size_t len) > */ > > p = (const unsigned char *) p8; > - while (len-- > 0) > + while (len > 0) > + { > crc = pg_crc32c_table[0][(crc ^ *p++) & 0xFF] ^ (crc >> 8); > + len--; > + } > + > + return crc; > +} > > +static pg_crc32 > +pg_asm_crc32b(pg_crc32 crc, unsigned char data) > +{ Should be marked inline. > +#ifdef __GNUC__ > + __asm__ ("crc32b %[data], %[crc]\n" : [crc] "+r" (crc) : [data] "rm" (data)); Have you checked which version of gcc introduced named references to input/output parameters? > return crc; > +#elif defined(_MSC_VER) > + return _mm_crc32_u8(crc, data); > +#else > +#error "Don't know how to generate crc32b instruction" > +#endif > } > + > +static pg_crc32 > +pg_asm_crc32q(uint64 crc, unsigned long long data) > +{ inline. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
At 2014-12-29 13:22:28 +0100, andres@2ndquadrant.com wrote: > > How about pg_choose_crc_impl() or something? Done. > _sb8? Unless I miss something it's not slice by 8 but rather bytewise? This is meant to apply on top of the earlier patches I posted to implement slice-by-8. I'll attach both here. > Should be marked inline. Done (and the other one too). > > +#ifdef __GNUC__ > > + __asm__ ("crc32b %[data], %[crc]\n" : [crc] "+r" (crc) : [data] "rm" (data)); > > Have you checked which version of gcc introduced named references to > input/output parameters? No. The documentation calls them "asmSymbolicName"s, but I can't find the term (or various likely alternatives, e.g. symbolic_operand may be related) either in the release notes, the changelog, or the source (on Github). Does anyone know, or know how to find out? Meanwhile, I have attached the two patches with the other modifications. -- Abhijit
Attachment
At 2014-12-29 18:44:18 +0530, ams@2ndQuadrant.com wrote: > > > > +#ifdef __GNUC__ > > > + __asm__ ("crc32b %[data], %[crc]\n" : [crc] "+r" (crc) : [data] "rm" (data)); > > > > Have you checked which version of gcc introduced named references to > > input/output parameters? OK, here we go: «As of GCC version 3.1, it is also possible to specify input and output operands using symbolic names which can be referencedwithin the assembler code.» GCC 3.1 was released on May 15, 2002. So it should be quite safe to use this feature. -- Abhijit
Hi. I'm re-attaching the two patches as produced by format-patch. I haven't listed any reviewers. It's either just Andres, or maybe a lot of people. Is anyone in a position to try out the patches on MSVC and see if they build and work sensibly, please? (Otherwise it may be better to remove those bits from the patch for now.) Thanks. -- Abhijit
Attachment
On 12/30/2014 09:40 AM, Abhijit Menon-Sen wrote: > Hi. > > I'm re-attaching the two patches as produced by format-patch. I haven't > listed any reviewers. It's either just Andres, or maybe a lot of people. > > Is anyone in a position to try out the patches on MSVC and see if they > build and work sensibly, please? (Otherwise it may be better to remove > those bits from the patch for now.) A couple of quick comments: bswap32 is unused on on little-endian systems. That will give a compiler warning. pg_comp_crc32c_sse processes one byte at a time, until the pointer is 4-bytes aligned. Then it processes 8 bytes at a time. So it fetches the 8-byte chunks from only 4-byte aligned addresses. Is that intentional? If unaligned access performs well, why bother with the initial byte-at-a-time processing at all? - Heikki
At 2014-12-30 16:05:50 +0200, hlinnakangas@vmware.com wrote: > > A couple of quick comments: > > bswap32 is unused on on little-endian systems. That will give a > compiler warning. Huh. I don't get a warning, even when I add -Wunused to the build flags. But since you mention it, it would be better to write the function thus: static inline uint32 cpu_to_le32(uint32 x) { #ifndef WORDS_BIGENDIAN return x; #elif defined(__GNUC__) ||defined(__clang__) return __builtin_bswap32(x); #else return ((x << 24) & 0xff000000) | ((x<< 8) & 0x00ff0000) | ((x >> 8) & 0x0000ff00) | ((x >> 24) & 0x000000ff); #endif } > pg_comp_crc32c_sse […] fetches the 8-byte chunks from only 4-byte > aligned addresses. Is that intentional? Thanks for spotting that. I had meant to change the test to "& 7". But again, now that you mention it, I'm not sure it's necessary. The CRC32* instructions don't have the usual SSE alignment requirements, and I see that the Linux kernel (among other implementations) process eight bytes at a time from the start of the buffer and then process the remaining bytes one at a time. I'll do a bit more research and post an update. Thanks for having a look. -- Abhijit
On 2014-12-30 21:36:19 +0530, Abhijit Menon-Sen wrote: > Thanks for spotting that. I had meant to change the test to "& 7". But > again, now that you mention it, I'm not sure it's necessary. The CRC32* > instructions don't have the usual SSE alignment requirements, and I see > that the Linux kernel (among other implementations) process eight bytes > at a time from the start of the buffer and then process the remaining > bytes one at a time. I'll do a bit more research and post an update. I'd done some quick and dirty benchmarking when writing my initial crc32c POC and I found that four byte aligned accesses were faster than ones not. But it really just was running it a couple times, nothing more. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Hi. OK, here are the patches with the various suggestions applied. I found that the alignment didn't seem to make much difference for the CRC32* instructions, so I changed to process (len/8)*8bytes followed by (len%8)*1bytes, the way the Linux kernel does. -- Abhijit
Attachment
On 01/01/2015 09:17 AM, Abhijit Menon-Sen wrote: > Hi. > > OK, here are the patches with the various suggestions applied. > > I found that the alignment didn't seem to make much difference for the > CRC32* instructions, so I changed to process (len/8)*8bytes followed by > (len%8)*1bytes, the way the Linux kernel does. Ok. In the slicing-by-8 version, I wonder if it would be better to do single-byte loads to c0-c7, instead of two 4-byte loads and shifts. 4-byte loads are presumably faster than single byte loads, but then you'd avoid the shifts. And then you could go straight into the 8-bytes-at-a-time loop, without the initial single-byte processing to get the start address aligned. (the Linux implementation doesn't do that, so maybe it's a bad idea, but might be worth testing..) Looking at the Linux implementation, I think it only does the bswap once per call, not inside the hot loop. Would it even make sense to keep the crc variable in different byte order, and only do the byte-swap once in END_CRC32() ? The comments need some work. I note that there is no mention of the slicing-by-8 algorithm anywhere in the comments (in the first patch). Instead of checking for "defined(__GNUC__) || defined(__clang__)", should add an explicit configure test for __builtin_bswap32(). - Heikki
At 2015-01-02 16:46:29 +0200, hlinnakangas@vmware.com wrote: > > In the slicing-by-8 version, I wonder if it would be better to do > single-byte loads to c0-c7, instead of two 4-byte loads and shifts. Nope. I did some tests, and the sb8 code is slightly slower if I remove the 0-7byte alignment loop, and significantly slower if I switch to one byte loads for the whole thing. So I think we should leave that part as it is, but: > Would it even make sense to keep the crc variable in different byte > order, and only do the byte-swap once in END_CRC32() ? …this certainly does make a noticeable difference. Will investigate. > The comments need some work. I note that there is no mention of the > slicing-by-8 algorithm anywhere in the comments (in the first patch). Will fix. (Unfortunately the widely cited original Intel paper about slice-by-8 seems to have gone AWOL, but I'll find something.) > Instead of checking for "defined(__GNUC__) || defined(__clang__)", > should add an explicit configure test for __builtin_bswap32(). Will do. Thanks again. -- Abhijit
Hi Heikki. I've attached two regenerated CRC patches, split up as before. 1. The slicing-by-8 patch contains numerous changes: a. A configure test for __builtin_bswap32 b. A comment referencing the slicing-by-8 paper (which is behind a paywall, unfortunately, so I haven't even read it). Are more comments needed? If so, where/what kind? c. A byte-reversed big-endian version of the 8*256 table. In Linux, there's only one table that uses __constant_swab32, but for us it's simpler to have two tables. d. Thanks to (c), we can avoid the bswap32 in the hot loop. e. On big-endian systems, FIN_CRC32C now bswap32()s the CRC before finalising it. (We don't need to do this in INIT_CRC32C only because the initialiser is 0xFFFFFFFF.) 2. The sse4.2 patch has only some minor compile fixes. I have built and tested both patches individually on little-endian (amd64) and big-endian (ppc) systems. I verified that the _sse is chosen at startup on the former, and _sb8 on the latter, and that both implementations function correctly with respect to HEAD. Please let me know if there's anything else I need to do. -- Abhijit
Attachment
On 01/09/2015 10:32 AM, Abhijit Menon-Sen wrote: > 1. The slicing-by-8 patch contains numerous changes: With this patch, CALC_CRC32C is no longer a pure macro, but includes a function call. That means that you can no longer just #include pg_crc.h and pg_crc_tables.h in an external program. We made that arrangement with two header files in 2012 [1], and added src/port/pg_crc.c which just #includes pg_crc_tables.h, so that the backend and frontend programs that use libpgport will have just one copy of the tables. Now that there's some actual code in pg_crc.c, I think we have to just give up on being able to get the CRC implementation without libpgport. It was a nice thought, but I doubt there are any programs out there that would have a problem with that. Anything that wants to read the WAL needs xlogreader.c anyway. But actually, we should now move pg_crc.c to src/common. It was a bit of a hack to have it in src/port in the first place, because it has nothing to do with portability, but src/common didn't exist back then. Now it does. So I propose to move pg_crc.c to src/common, and move the tables from pg_crc_tables.h directly to pg_crc.c. Thoughts? [1] http://www.postgresql.org/message-id/flat/CAAZKuFaNcf3=YtadkWwr8yHb+1axW2RepmQ2j8a9NNGkV7PN=w@mail.gmail.com. - Heikki
On 2015-02-08 18:46:30 +0200, Heikki Linnakangas wrote: > So I propose to move pg_crc.c to src/common, and move the tables from > pg_crc_tables.h directly to pg_crc.c. Thoughts? +1. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
At 2015-02-08 18:46:30 +0200, hlinnakangas@vmware.com wrote: > > So I propose to move pg_crc.c to src/common, and move the tables > from pg_crc_tables.h directly to pg_crc.c. Thoughts? Sounds fine to me. -- Abhijit
On 02/08/2015 08:33 PM, Abhijit Menon-Sen wrote: > At 2015-02-08 18:46:30 +0200, hlinnakangas@vmware.com wrote: >> >> So I propose to move pg_crc.c to src/common, and move the tables >> from pg_crc_tables.h directly to pg_crc.c. Thoughts? > > Sounds fine to me. Ok, I've committed a patch that just moves the existing code to common/pg_crc.c. I also moved pg_crc.h from include/utils to include/utils. That'll need any external programs to change their #include accordingly, but I think it was worth that. include/common is clearly the correct home for that file, and the only reason to keep it in include/utils would've been for backwards-compatibility. Attached is a rebased version of the slicing-by-8 patch. I've made some cosmetic changes. Most notably, I turned the bswap32() function into a macro. Better to avoid the overhead of a function call, and it also avoids building the function altogether on little-endian systems that don't need it. I'll continue review this. > At 2015-01-02 16:46:29 +0200, hlinnakangas@vmware.com wrote: >> >> Would it even make sense to keep the crc variable in different byte >> order, and only do the byte-swap once in END_CRC32() ? > > ...this certainly does make a noticeable difference. Will investigate. Do you have access to big-endian hardware to test this on? It seems like an obvious optimization to shave off that one instruction from the hot loop, but if it turns out not to make any measurable difference, I'd prefer to keep the tables in the same order on big and little endian systems, reducing the amount of #ifdefs needed. I tested this on my laptop by adding a BSWAP32() into the hot loop - which is bogus on a little endian Intel system - and it seems to make about 5% difference in a quick micro-benchmark. But it would be nice to get some numbers from the kind of big endian systems that people run in the real world. - Heikki
Attachment
At 2015-02-09 12:52:41 +0200, hlinnakangas@vmware.com wrote: > > Ok, I've committed a patch that just moves the existing code to > common/pg_crc.c […] Thanks, looks good. > Attached is a rebased version of the slicing-by-8 patch. Looks OK too. > Do you have access to big-endian hardware to test this on? Yes, I tested it on a Linux/ppc system. I wasn't speculating when I said it "does make a noticeable difference", though I'm afraid I did not keep the timings after submitting the revised patch. The speedup was some double-digit percentage, IIRC. -- Abhijit
On 02/09/2015 03:20 PM, Abhijit Menon-Sen wrote: > At 2015-02-09 12:52:41 +0200, hlinnakangas@vmware.com wrote: >> Do you have access to big-endian hardware to test this on? > > Yes, I tested it on a Linux/ppc system. I wasn't speculating when I said > it "does make a noticeable difference", though I'm afraid I did not keep > the timings after submitting the revised patch. The speedup was some > double-digit percentage, IIRC. Ok, that's good enough for me. Committed with a few more edits on comments and such. I'll start looking at the second patch now. - Heikki
On 01/09/2015 10:32 AM, Abhijit Menon-Sen wrote: > 2. The sse4.2 patch has only some minor compile fixes. > > I have built and tested both patches individually on little-endian > (amd64) and big-endian (ppc) systems. I verified that the _sse is > chosen at startup on the former, and _sb8 on the latter, and that > both implementations function correctly with respect to HEAD. I'd like to arrange this somewhat differently, to keep the really low-level assembler blocks and such separate from the higher level parts. Especially now that pg_crc.c is in src/common and src/port, it doesn't seem appropriate to have assembler code in there. Also, some of the high-level code can be reused if we later add support e.g. for the ARMv8 CRC instructions. I propose that we add a new header file in src/port, let's call it crc_instructions.h. That file contains the very low-level stuff, the pg_asm_crc32q() and pg_asm_crc32b() inline functions, which just contain the single assembler instruction. Or the corresponding mnemomic or macro depending on the compiler; the point of the crc_instructions.h header is to hide the differences between compilers and architectures. If the CRC instructions are available, crc_instructions.h defines PG_HAVE_CRC32C_INSTRUCTIONS, as well as the pg_asm_crc32q() and pg_asm_crc32b() macros/functions. It also defines pg_crc32_instructions_runtime_check(), to perform the runtime check to determine whether the instructions can be used on the current platform (i.e. if you're running on a CPU with SSE 4.2 support). There's another symbol PG_CRC32C_INSTRUCTIONS_NEED_RUNTIME_CHECK, which indicates whether the runtime check is needed. That's currently always defined when PG_HAVE_CRC32C_INSTRUCTIONS is, but conceivably you might want to build a version that skips the runtime check altogether, and doesn't therefore require the slicing-by-8 fallback implementation at all. Gentoo users might like that ;-), as well as possible future architectures that always have CRC32 instructions. Attached is a patch along those lines. I haven't done much testing, but it demonstrates the code structure I have in mind. A couple of remarks on your original patch, which also apply to the attached version: > +static inline pg_crc32 > +pg_asm_crc32b(pg_crc32 crc, unsigned char data) > +{ > +#if defined(__GNUC__) && defined(__x86_64__) > + __asm__ ("crc32b %[data], %[crc]\n" : [crc] "+r" (crc) : [data] "rm" (data)); > + return crc; > +#elif defined(_MSC_VER) > + return _mm_crc32_u8(crc, data); > +#else > + /* Can't generate crc32b, but keep the compiler quiet. */ > + return 0; > +#endif I believe the CRC instructions are also available in 32-bit mode, so the check for __x86_64__ is not needed. Right? Any CPU recent enough to have these instructions certainly supports the x86-64 instruction set, but you might nevertheless have compiled and be running in i386 mode. It would be nice to use GCC's builtin intrinsics, __builtin_ia32_crc32* where available, but I guess those are only available if you build with -msse4.2, and that would defeat the point of the runtime check. I believe the _mm_* mnemonics would also work with the Intel compiler. - Heikki
Attachment
At 2015-02-10 14:30:51 +0200, hlinnakangas@vmware.com wrote: > > I propose that we add a new header file in src/port, let's call it > crc_instructions.h […] the point of the crc_instructions.h header > is to hide the differences between compilers and architectures. Moving the assembly code/compiler intrinsics to a separate header is fine with me, but I really don't like the proposed implementation at all. Here are some comments: 1. I don't mind moving platform-specific tests for CRC32C instructions to configure, but if we only define PG_HAVE_CRC32C_INSTRUCTIONS,we would anyway have to reproduce all that platform-specific stuff in the header file. To doit properly, I think we should generate the right version of crc_instructions.h for the platform. Otherwise, what's thepoint? Might as well have only the complicated header. 2. At this point, I think we should stick with the _sse function rather than a generic _hw one to "drive" the platform-specificinstructions. The structure of that function (n/8*(8bytes)+n%8*(bytes)) is specific to what works bestfor the SSE instructions. On another platform, we may need something very similar, or very different. With the proposedstructure, _hw would inevitably become #ifdef'ed non-trivially, e.g. to run a single-byte loop at the start to align the main loop, but only for certain combinations of #defines. I think we should consider what makes the most sensewhen someone actually wants to add support for another platform/compiler, and not before. 3. I dislike everything about pg_crc32_instructions_runtime_check()—the name, the separate PG_CRC32C_INSTRUCTIONS_NEED_RUNTIME_CHECK#define, the fact that you try to do the test "inline" rather than at startup… Maybe you're trying to avoid checking separately at startup so that a program that links with code from src/common doesn'tneed to do its own test. But I don't think the results are worth it. As for omitting the slicing-by-8 tables, if there's a more compelling reason to do that than "Gentoo users might likethat ;-)", I think it should be done with a straightforward -DUSE_CRC_SSE style arrangement rather than figuring outthat you HAVE_CRC32C_INSTRUCTIONS but don't NEED_RUNTIME_CHECK. If you want to build special-purpose binaries, justpick whichever CRC implementation you like, done. > I believe the CRC instructions are also available in 32-bit mode, so > the check for __x86_64__ is not needed. Right? I'm not sure, but I guess you're right. > It would be nice to use GCC's builtin intrinsics, > __builtin_ia32_crc32* where available, but I guess those are only > available if you build with -msse4.2, and that would defeat the > point of the runtime check. Exactly. -- Abhijit
On 02/11/2015 11:28 AM, Abhijit Menon-Sen wrote: > 1. I don't mind moving platform-specific tests for CRC32C instructions > to configure, but if we only define PG_HAVE_CRC32C_INSTRUCTIONS, we > would anyway have to reproduce all that platform-specific stuff in > the header file. To do it properly, I think we should generate the > right version of crc_instructions.h for the platform. Otherwise, > what's the point? Might as well have only the complicated header. I don't follow. I didn't change configure at all, compared to your patch. > 2. At this point, I think we should stick with the _sse function rather > than a generic _hw one to "drive" the platform-specific instructions. > The structure of that function (n/8*(8bytes)+n%8*(bytes)) is specific > to what works best for the SSE instructions. On another platform, we > may need something very similar, or very different. > > With the proposed structure, _hw would inevitably become #ifdef'ed > non-trivially, e.g. to run a single-byte loop at the start to align > the main loop, but only for certain combinations of #defines. > > I think we should consider what makes the most sense when someone > actually wants to add support for another platform/compiler, and > not before. Hmm, ok. The ARM CRC32C instruction is very similar to the SSE4.2 one, but I presume it does require alignment. > 3. I dislike everything about pg_crc32_instructions_runtime_check()—the > name, the separate PG_CRC32C_INSTRUCTIONS_NEED_RUNTIME_CHECK #define, > the fact that you try to do the test "inline" rather than at startup… > > Maybe you're trying to avoid checking separately at startup so that a > program that links with code from src/common doesn't need to do its > own test. But I don't think the results are worth it. As a case in point, with your patch pg_xlogdump would not have used the CRC instruction, because it never called pg_choose_crc_impl(). "Choose on first use" is much more convenient than requiring every program to call a function. > As for omitting the slicing-by-8 tables, if there's a more compelling > reason to do that than "Gentoo users might like that ;-)", I think it > should be done with a straightforward -DUSE_CRC_SSE style arrangement > rather than figuring out that you HAVE_CRC32C_INSTRUCTIONS but don't > NEED_RUNTIME_CHECK. If you want to build special-purpose binaries, > just pick whichever CRC implementation you like, done. I stopped short of actually allowing you to force the use of SSE, e.g. with -DUSE_CRC_SSE, but my thinking was that that would force PG_HAVE_CRC32C_INSTRUCTIONS to be set, and NEEDS_RUNTIME_CHECK to be unset. I.e. those two #defines control what code is generated, but in the header file. I think what you're suggesting is that we'd instead have two mutually exclusive #defines, something like "USE_CRC_SSE_ALWAYS" and "USE_CRC_SSE_WITH_RUNTIME_CHECK". It wouldn't be much different, but would require some places to check for both. >> It would be nice to use GCC's builtin intrinsics, >> __builtin_ia32_crc32* where available, but I guess those are only >> available if you build with -msse4.2, and that would defeat the >> point of the runtime check. > > Exactly. Hmm. Perhaps we should check for the built-in functions, and if they're available, use them and not set NEEDS_RUNTIME_CHECK. If you compile with -msse4.2, the code won't run without SSE4.2 anyway (or at least isn't guaranteed to run). - Heikki
At 2015-02-11 13:20:29 +0200, hlinnakangas@vmware.com wrote: > > I don't follow. I didn't change configure at all, compared to your > patch. OK, I extrapolated a little too much. Your patch didn't actually include crc_instructions.h; from the name of the #define, I imagined you planned to move the check to configure. But now I guess from your response that PG_HAVE_CRC32C_INSTRUCTIONS would be #defined by the header (as you did say it would be). So let's forget that part. > As a case in point, with your patch pg_xlogdump would not have used > the CRC instruction, because it never called pg_choose_crc_impl(). > "Choose on first use" is much more convenient than requiring every > program to call a function. I can see your point. I'm not fond of the code, but I guess it's not a huge deal in the larger scheme of things. > I think what you're suggesting is that we'd instead have two mutually > exclusive #defines, something like "USE_CRC_SSE_ALWAYS" and > "USE_CRC_SSE_WITH_RUNTIME_CHECK". I was thinking of only one knob: USE_CRC_SSE would (try to) build the code to use the SSE instructions without any test. If it (or another USE_CRC_XXX flag) isn't set, we'd build all applicable variants and test at runtime to decide what to use. Just give me a little while to think through what that would look like, I'll post a patch. -- Abhijit
On 02/11/2015 04:20 PM, Abhijit Menon-Sen wrote: > At 2015-02-11 13:20:29 +0200, hlinnakangas@vmware.com wrote: >> >> I don't follow. I didn't change configure at all, compared to your >> patch. > > OK, I extrapolated a little too much. Your patch didn't actually include > crc_instructions.h; Oh, I'm sorry. Here's the complete patch with crc_instructions.h - Heikki
Attachment
On 02/12/2015 09:26 PM, Heikki Linnakangas wrote: > On 02/11/2015 04:20 PM, Abhijit Menon-Sen wrote: >> At 2015-02-11 13:20:29 +0200, hlinnakangas@vmware.com wrote: >>> >>> I don't follow. I didn't change configure at all, compared to your >>> patch. >> >> OK, I extrapolated a little too much. Your patch didn't actually include >> crc_instructions.h; > > Oh, I'm sorry. Here's the complete patch with crc_instructions.h I was just about to commit the attached, which is the same as the previous patch with just cosmetic comment changes, but then I realized that this probably doesn't compile with Visual Studio 2005 or older. The code does "#ifdef _MSC_VER", and then uses the _mm_crc32_u64 intrinsic, but that intrinsic was added in Visual Studio 2008. I think we'll need a version check there. Or better yet, a direct configure test to check if the intrinsic exists - that way we get to also use it on Intel compilers, which I believe also has the same intrinsics. You want to write that or should I? How do you like this latest version of the patch otherwise? You had some criticism earlier, but I had forgotten to include the crc_instructions.h header file in that earlier version. - Heikki
Attachment
On 2015-03-25 19:18:51 +0200, Heikki Linnakangas wrote: > I was just about to commit the attached, which is the same as the previous > patch with just cosmetic comment changes, but then I realized that this > probably doesn't compile with Visual Studio 2005 or older. The code does > "#ifdef _MSC_VER", and then uses the _mm_crc32_u64 intrinsic, but that > intrinsic was added in Visual Studio 2008. I think we'll need a version > check there. Good catch. > Or better yet, a direct configure test to check if the > intrinsic exists - that way we get to also use it on Intel compilers, which > I believe also has the same intrinsics. Maybe I'm missing something, but configure isn't run for msvc? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 03/25/2015 07:20 PM, Andres Freund wrote: > On 2015-03-25 19:18:51 +0200, Heikki Linnakangas wrote: >> Or better yet, a direct configure test to check if the >> intrinsic exists - that way we get to also use it on Intel compilers, which >> I believe also has the same intrinsics. > > Maybe I'm missing something, but configure isn't run for msvc? Good point. On MSVC, we use the pre-built pg_config.h.win32 file instead. There are already a couple of cases like this in it: /* Define to 1 if you have the `rint' function. */ #if (_MSC_VER >= 1800) #define HAVE_RINT 1 #endif I think we should do that for the CRC32 intrinsic too. - Heikki
On 25/03/15 18:24, Heikki Linnakangas wrote: > On 03/25/2015 07:20 PM, Andres Freund wrote: >> On 2015-03-25 19:18:51 +0200, Heikki Linnakangas wrote: >>> Or better yet, a direct configure test to check if the >>> intrinsic exists - that way we get to also use it on Intel compilers, >>> which >>> I believe also has the same intrinsics. >> >> Maybe I'm missing something, but configure isn't run for msvc? > > Good point. On MSVC, we use the pre-built pg_config.h.win32 file > instead. There are already a couple of cases like this in it: > > /* Define to 1 if you have the `rint' function. */ > #if (_MSC_VER >= 1800) > #define HAVE_RINT 1 > #endif > > I think we should do that for the CRC32 intrinsic too. > Yeah, 1500 being MSVC 2008. But I think there is little point in putting it in pg_config.h.win32 when it's only going to be used in 2 places in single header file. Do you plan to commit this otherwise as is? -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
At 2015-03-25 19:18:51 +0200, hlinnaka@iki.fi wrote: > > I think we'll need a version check there. […] > > You want to write that or should I? I'm not familiar with MSVC at all, so it would be nice if you did it. > How do you like this latest version of the patch otherwise? I'm sorry, but I'm still not especially fond of it. Apart from removing the startup check so that client programs can also use the best available CRC32C implementation without jumping through hoops, I don't feel that the other changes buy us very much. Also, assuming that the point is that people who don't care about CRCs deeply should nevertheless be able to produce special-purpose binaries with only the optimal implementation included, we should probably have some instructions about how to do that. Thinking about what you were trying to do, I would find an arrangement roughly like the following to be clearer to follow in terms of adding new implementations and so on: #if defined(USE_CRC_SSE42) || …can build SSE4.2 CRC code… #define HAVE_CRC_SSE42 1 pg_crc32c_sse42() { … } bool sse42_crc32c_available(){ … } pg_comp_crc32c = pg_crc32c_sse42; #elif defined(USE_CRC_ARM) || …can build ARM CRC code… #define HAVE_CRC_ARM 1 pg_crc32c_arm() { … } bool arm_crc32c_available(){ … } pg_comp_crc32c = pg_crc32c_arm; #endif #define CRC_SELECTION 1 #if defined(USE_CRC_SSE42) || defined(USE_CRC_ARM) || … #undef CRC_SELECTION #endif #ifdef CRC_SELECTION pg_crc32c_sb8() { … } pg_comp_crc32c_choose(…) { pg_comp_crc32c = pg_crc32c_sb8; #ifdef HAVE_CRC_SSE42 if (sse42_crc32c_available()) pg_comp_crc32c = pg_crc32c_sse42; #elif … … #endif return pg_comp_crc32c(…); } pg_comp_crc32c = pg_crc32c_choose; #endif Then someone who wants to force the building of (only) the SSE4.2 implementation can build with -DUSE_CRC_SSE42. And if you turn on USE_CRC_ARM when you can't build ARM code, it won't build. (The HAVE_CRC_xxx #defines could also move to configure tests.) If you don't specify any USE_CRC_xxx explicitly, then it'll build whichever (probably) one it can, and select it at runtime if it's available. All that said, I do recognise that there are all relatively cosmetic concerns, and I don't object seriously to any of it. On the contrary, thanks for taking the time to review and work on the patch. Nobody else has expressed an opinion, so I'll leave it to you to decide whether to commit as-is, or if you want me to pursue the above approach instead. In the realm of very minor nitpicking, here are a couple of points I noticed in crc_instructions.h: 1. I think «if (pg_crc32_instructions_runtime_check())» would read better if the function were named crc32c_instructions_availableor pg_crc32c_is_hw_accelerated, or something like that. 2. It documents pg_accumulate_crc32c_byte and pg_accumulate_crc32c_uint64, but actually defines pg_asm_crc32b and pg_asm_crc32q.If you update the code rather than the documentation, _update_ may be slightly preferable to _accumulate_,and maybe the suffixes should be _uint8 and _uint64. 3. The descriptions (e.g. "Add one byte to the current crc value.") should also probably read "Update the current crc valuefor the given byte/eight bytes of data". Thanks. -- Abhijit
On 04/02/2015 12:39 PM, Abhijit Menon-Sen wrote: > At 2015-03-25 19:18:51 +0200, hlinnaka@iki.fi wrote: >> >> I think we'll need a version check there. […] >> >> You want to write that or should I? > > I'm not familiar with MSVC at all, so it would be nice if you did it. Thinking more about the configure checks, I think the current approach of using inline assembly on gcc is not quite right. We're only using inline assembly to force producing SSE 4.2 code, even when -msse4.2 is not used. That feels wrong. And who's to say that the assembler supports the SSE instructions anyway? At least we'd need a configure check for that too. We have a buildfarm animal that still uses gcc 2.95.3, which was released in 2001. I don't have a compiler of that vintage to test with, but I assume an old enough assembler would not know about the crc32q instruction and fail to compile. I believe the GCC way to do this would be to put the SSE4.2-specific code into a separate source file, and compile that file with "-msse4.2". And when you compile with -msse4.2, gcc actually also supports the _mm_crc32_u8/u64 intrinsics. - Heikki
At 2015-04-02 17:58:23 +0300, hlinnaka@iki.fi wrote: > > We're only using inline assembly to force producing SSE 4.2 code, even > when -msse4.2 is not used. That feels wrong. Why? It feels OK to me (and to Andres, per earlier discussions about exactly this topic). Doing it this way allows the binary to run on a non-SSE4.2 platform (and not use the CRC instructions). Also, -msse4.2 was added to the compiler later than support for the instructions was added to the assembler. > We have a buildfarm animal that still uses gcc 2.95.3, which was > released in 2001. I don't have a compiler of that vintage to test > with, but I assume an old enough assembler would not know about the > crc32q instruction and fail to compile. GCC from <2002 wouldn't support the symbolic operand names in inline assembly. binutils from <2007 (IIRC) wouldn't support the assembler instructions themselves. We could work around the latter by using the appropriate sequence of bytes. We could work around the former by using the old syntax for operands. > I believe the GCC way to do this would be to put the SSE4.2-specific > code into a separate source file, and compile that file with > "-msse4.2". And when you compile with -msse4.2, gcc actually also > supports the _mm_crc32_u8/u64 intrinsics. I have no objection to this. Building only that file with -msse4.2 would resolve the problem of the output binary requiring SSE4.2; and the only compilers to be excluded are old enough to be uninteresting at least to me personally. Have you done/are you doing this already, or do you want me to? I could use advice on how to add build flags to only one file, since I don't know of any precendent for that. -- Abhijit
On 2015-04-02 20:57:24 +0530, Abhijit Menon-Sen wrote: > At 2015-04-02 17:58:23 +0300, hlinnaka@iki.fi wrote: > > > > We're only using inline assembly to force producing SSE 4.2 code, even > > when -msse4.2 is not used. That feels wrong. > > Why? It feels OK to me (and to Andres, per earlier discussions about > exactly this topic). Doing it this way allows the binary to run on a > non-SSE4.2 platform (and not use the CRC instructions). Right. And SSE4.2 isn't that widespread yet. > > I believe the GCC way to do this would be to put the SSE4.2-specific > > code into a separate source file, and compile that file with > > "-msse4.2". And when you compile with -msse4.2, gcc actually also > > supports the _mm_crc32_u8/u64 intrinsics. To me this seems like a somewhat pointless exercise. I actually think from a performance POV it's better to have all the functions in one source file, so the compiler can inline things into the trampoline if it feels like it. Greetings, Andres Freund
On 04/02/2015 06:27 PM, Abhijit Menon-Sen wrote: > At 2015-04-02 17:58:23 +0300, hlinnaka@iki.fi wrote: >> >> We're only using inline assembly to force producing SSE 4.2 code, even >> when -msse4.2 is not used. That feels wrong. > > Why? It feels OK to me (and to Andres, per earlier discussions about > exactly this topic). Doing it this way allows the binary to run on a > non-SSE4.2 platform (and not use the CRC instructions). Being able to run on non-SSE4.2 platforms is required, for sure. > Also, -msse4.2 was added to the compiler later than support for the > instructions was added to the assembler. It was added in gcc 4.2. That's good enough for me. >> We have a buildfarm animal that still uses gcc 2.95.3, which was >> released in 2001. I don't have a compiler of that vintage to test >> with, but I assume an old enough assembler would not know about the >> crc32q instruction and fail to compile. > > GCC from <2002 wouldn't support the symbolic operand names in inline > assembly. binutils from <2007 (IIRC) wouldn't support the assembler > instructions themselves. > > We could work around the latter by using the appropriate sequence of > bytes. We could work around the former by using the old syntax for > operands. I'm OK with not supporting the new instructions when building with an old compiler/assembler. But the build shouldn't fail with an old compiler/assembler. Using old syntax or raw bytes just to avoid failing on an ancient compiler seems ugly. >> I believe the GCC way to do this would be to put the SSE4.2-specific >> code into a separate source file, and compile that file with >> "-msse4.2". And when you compile with -msse4.2, gcc actually also >> supports the _mm_crc32_u8/u64 intrinsics. > > I have no objection to this. > > Building only that file with -msse4.2 would resolve the problem of the > output binary requiring SSE4.2; and the only compilers to be excluded > are old enough to be uninteresting at least to me personally. > > Have you done/are you doing this already, or do you want me to? I could > use advice on how to add build flags to only one file, since I don't > know of any precendent for that. I came up with the attached. The SSE4.2 specific code is now in a separate file, in src/port/pg_crc32c_sse42.c. The slicing-by-8 implementation is moved to src/port/pg_crc32c_sb8.c, and the function to choose the implementation at runtime is in src/port/pg_crc32c_choose.c. How does this look to you? BTW, we might want to move the "traditional" and "legacy" crc32 implementations out of src/common. They are only used in backend code now. - Heikki
Attachment
On Thu, Apr 2, 2015 at 5:33 PM, Heikki Linnakangas <hlinnaka@iki.fi> wrote: > It was added in gcc 4.2. That's good enough for me. I think it's fine to have optional optimizations that require gcc >= 4.2, as long as older platforms don't break outright. >>> We have a buildfarm animal that still uses gcc 2.95.3, which was >>> released in 2001. I don't have a compiler of that vintage to test >>> with, but I assume an old enough assembler would not know about the >>> crc32q instruction and fail to compile. >> >> >> GCC from <2002 wouldn't support the symbolic operand names in inline >> assembly. binutils from <2007 (IIRC) wouldn't support the assembler >> instructions themselves. >> >> We could work around the latter by using the appropriate sequence of >> bytes. We could work around the former by using the old syntax for >> operands. > > I'm OK with not supporting the new instructions when building with an old > compiler/assembler. But the build shouldn't fail with an old > compiler/assembler. Using old syntax or raw bytes just to avoid failing on > an ancient compiler seems ugly. I dunno about old syntax, but raw bytes seems like a bad idea, for sure. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
At 2015-04-03 00:33:10 +0300, hlinnaka@iki.fi wrote: > > I came up with the attached. I like it very much. src/port/Makefile has (note src/srv): +# pg_crc32c_sse42.o and its _src.o version need CFLAGS_SSE42 +pg_crc32c_sse42.o: CFLAGS+=$(CFLAGS_SSE42) +pg_crc32c_sse42_srv.o:CFLAGS+=$(CFLAGS_SSE42) Other than that, this looks great. Thank you. > BTW, we might want to move the "traditional" and "legacy" crc32 > implementations out of src/common. They are only used in backend > code now. I agree. -- Abhijit
On 04/03/2015 05:28 AM, Abhijit Menon-Sen wrote: > At 2015-04-03 00:33:10 +0300, hlinnaka@iki.fi wrote: >> >> I came up with the attached. > > I like it very much. > > src/port/Makefile has (note src/srv): > > +# pg_crc32c_sse42.o and its _src.o version need CFLAGS_SSE42 > +pg_crc32c_sse42.o: CFLAGS+=$(CFLAGS_SSE42) > +pg_crc32c_sse42_srv.o: CFLAGS+=$(CFLAGS_SSE42) > > Other than that, this looks great. Thank you. > >> BTW, we might want to move the "traditional" and "legacy" crc32 >> implementations out of src/common. They are only used in backend >> code now. > > I agree. Committed this now, after some further cleanup and reorganizing the autoconf stuff. - Heikki