Thread: Substituting Checksum Algorithm (was: Enabling Checksums)
Starting a new thread to more narrowly address the topic. Attached is my reorganization of Ants's patch here: http://www.postgresql.org/message-id/CA +CSw_vinyf-w45i=M1m__MpJZY=e8S4Nt_KNnpEbtWjTOaSUA@mail.gmail.com My changes: * wrest the core FNV algorithm from the specific concerns of a data page - PageCalcChecksum16 mixes the block number, reduces to 16 bits, and avoids the pd_checksum field - the checksum algorithm is just a pure block checksum with a 32-bit result * moved the FNV checksum into a separate file, checksum.c * added Ants's suggested compilation flags for better optimization * slight update to the paragraph in the README that discusses concerns specific to a data page I do have a couple questions/concerns about Ants's patch though: * The README mentions a slight bias; does that come from the mod (2^16-1)? That's how I updated the README, so I wanted to make sure. * How was the FNV_PRIME chosen? * I can't match the individual algorithm step as described in the README to the actual code. And the comments in the README don't make it clear enough which one is right (or maybe they both are, and I'm just tired). The README says: hash = (hash ^ value) * ((hash ^ value) >> 3) (which is obviously missing the FNV_PRIME factor) and the code says: +#define CHECKSUM_COMP(checksum, value) do {\ + uint32 __tmp = (checksum) ^ (value);\ + (checksum) = __tmp * FNV_PRIME ^ (__tmp >> 3);\ +} while (0) I'm somewhat on the fence about the "shift right". It was discussed in this part of the thread: http://www.postgresql.org/message-id/99343716-5F5A-45C8-B2F6-74B9BA357396@phlo.org I think we should be able to state with a little more clarity in the README why there is a problem with plain FNV-1a, and why this modification is both effective and safe. I'd lean toward simplicity and closer adherence to the published version of the algorithm rather than detecting a few more obscure error patterns. It looks like the modification slows down the algorithm, too. Regards, Jeff Davis
Attachment
On Apr23, 2013, at 09:17 , Jeff Davis <pgsql@j-davis.com> wrote: > I'd lean toward simplicity and closer adherence to the published version > of the algorithm rather than detecting a few more obscure error > patterns. It looks like the modification slows down the algorithm, too. The pattern that plain FNV1 misses are not that obscure, unfortunately. With plain FNV1A, the n-th bit of an input word (i.e. 32-bit block) only affects bits n through 31 of the checksum. In particular, the most significant bit of every 32-bit block only affects the MSB of the checksum, making the algorithm miss any even number of flipped MSBs. More generally, any form of data corruption that affects only the top N bits are missed at least once out of 2^N times, since changing only those bits cannot yield more than 2^N different checksum values. Such corruption pattern may not be especially likely, given that we're mainly protecting against disk corruption, not memory corruption. But quantifying how unlikely exactly seems hard, thus providing at least some form of protection against such errors seems prudent. In addition, even with the shift-induced slowdown, FNV1+SHIFT still performs similarly to hardware-accelerated CRC, reaching about 6bytes/cycle on modern Intel CPUS. This really is plenty fast - if I'm not mistaken, it translates to well over 10 GB/s. So overall -1 for removing the shift. best regards, Florian Pflug
<p dir="ltr">On Apr 23, 2013 10:17 AM, "Jeff Davis" <<a href="mailto:pgsql@j-davis.com">pgsql@j-davis.com</a>> wrote:<br/> > Attached is my reorganization of Ants's patch here:<br /> ><br /> > <a href="http://www.postgresql.org/message-id/CA">http://www.postgresql.org/message-id/CA</a><br/> > +CSw_vinyf-w45i=M1m__MpJZY=<a href="mailto:e8S4Nt_KNnpEbtWjTOaSUA@mail.gmail.com">e8S4Nt_KNnpEbtWjTOaSUA@mail.gmail.com</a><pdir="ltr">Thanks for yourhelp. Some notes below.<p dir="ltr">> My changes:<br /> ><br /> > * wrest the core FNV algorithm from the specificconcerns of a data page<br /> > - PageCalcChecksum16 mixes the block number, reduces to 16 bits,<br /> > and avoids the pd_checksum field<br /> > - the checksum algorithm is just a pure block checksum with a 32-bit<br/> > result<br /> > * moved the FNV checksum into a separate file, checksum.c<p dir="ltr">I think thefunction should not be called checksum_fnv as it implies that we use the well known straightforward implementation. Maybechecksum_block or some other generic name.<p dir="ltr">> * added Ants's suggested compilation flags for better optimization<pdir="ltr">-msse4.1 is not safe to use in default builds. On the other hand it doesn't hurt to just specifyit in CFLAGS for the whole compile (possibly as -march=native). We should just omit it and mention somewhere thatSSE4.1 enabled builds will have better checksum performance.<p dir="ltr">> * slight update to the paragraph in theREADME that discusses concerns<br /> > specific to a data page<br /> ><br /> > I do have a couple questions/concernsabout Ants's patch though:<br /> ><br /> > * The README mentions a slight bias; does that come fromthe mod<br /> > (2^16-1)? That's how I updated the README, so I wanted to make sure.<p dir="ltr">Yes.<p dir="ltr">>* How was the FNV_PRIME chosen?<p dir="ltr">I still haven't found the actual source for this value. It's specifiedas the value to use for 32bit FNV-1a.<p dir="ltr">> * I can't match the individual algorithm step as describedin the README<br /> > to the actual code. And the comments in the README don't make it clear<br /> > enoughwhich one is right (or maybe they both are, and I'm just tired).<p dir="ltr">I will try to reword.<p dir="ltr">>The README says:<br /> ><br /> > hash = (hash ^ value) * ((hash ^ value) >> 3)<br /> ><br/> > (which is obviously missing the FNV_PRIME factor) and the code says:<br /> ><br /> > +#define CHECKSUM_COMP(checksum,value) do {\<br /> > + uint32 __tmp = (checksum) ^ (value);\<br /> > + (checksum)= __tmp * FNV_PRIME ^ (__tmp >> 3);\<br /> > +} while (0)<br /> ><br /> > I'm somewhat on thefence about the "shift right". It was discussed in<br /> > this part of the thread:<br /> ><br /> > <a href="http://www.postgresql.org/message-id/99343716-5F5A-45C8-B2F6-74B9BA357396@phlo.org">http://www.postgresql.org/message-id/99343716-5F5A-45C8-B2F6-74B9BA357396@phlo.org</a><br />><br /> > I think we should be able to state with a little more clarity in the<br /> > README why there is a problemwith plain FNV-1a, and why this<br /> > modification is both effective and safe.<p dir="ltr">Florian already mentionedwhy it's effective. I have an intuition why it's safe, will try to come up with a well reasoned argument.<p dir="ltr">Regards,<br/> Antd Aasma
On 2013-04-23 00:17:28 -0700, Jeff Davis wrote: > + # important optimization flags for checksum.c > + ifeq ($(GCC),yes) > + checksum.o: CFLAGS += -msse4.1 -funroll-loops -ftree-vectorize > + endif I am pretty sure we can't do those unconditionally: - -funroll-loops and -ftree-vectorize weren't always part of gcc afair, so we would need a configure check for those - SSE4.1 looks like a total no-go, its not available everywhere. We *can* add runtime detection of that with gcc fairly easilyand one-time if we wan't to go there (later?) using 'ifunc's, but that needs a fair amount of infrastructure work. - We can rely on SSE1/2 on amd64, but I think thats automatically enabled there. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Tue, Apr 23, 2013 at 11:47 AM, Andres Freund <andres@2ndquadrant.com> wrote: > On 2013-04-23 00:17:28 -0700, Jeff Davis wrote: >> + # important optimization flags for checksum.c >> + ifeq ($(GCC),yes) >> + checksum.o: CFLAGS += -msse4.1 -funroll-loops -ftree-vectorize >> + endif > > I am pretty sure we can't do those unconditionally: > - -funroll-loops and -ftree-vectorize weren't always part of gcc afair, > so we would need a configure check for those -funroll-loops is available from at least GCC 2.95. -ftree-vectorize is GCC 4.0+. From what I read from the documentation on ICC -axSSE4.1 should generate a plain and accelerated version and do a runtime check., I don't know if ICC vectorizes the specific loop in the patch, but I would expect it to given that Intels vectorization has generally been better than GCCs and the loop is about as simple as it gets. I don't know the relevant options for other compilers. > - SSE4.1 looks like a total no-go, its not available everywhere. We > *can* add runtime detection of that with gcc fairly easily and > one-time if we wan't to go there (later?) using 'ifunc's, but that > needs a fair amount of infrastructure work. > - We can rely on SSE1/2 on amd64, but I think thats automatically > enabled there. This is why I initially went for the lower strength 16bit checksum calculation - requiring only SSE2 would have made supporting the vectorized version on amd64 trivial. By now my feeling is that it's not prudent to compromise in quality to save some infrastructure complexity. If we set a hypothetical VECTORIZATION_FLAGS variable at configure time, the performance is still there for those who need it and can afford CPU specific builds. Regards, Ants Aasma -- Cybertec Schönig & Schönig GmbH Gröhrmühlgasse 26 A-2700 Wiener Neustadt Web: http://www.postgresql-support.de
On Tue, 2013-04-23 at 11:44 +0300, Ants Aasma wrote: > I will try to reword. Did you have a chance to clarify this, as well as some of the other documentation issues Simon mentioned here? http://www.postgresql.org/message-id/CA+U5nMKVEu8UDXQe +Nk=d7Nqm4ypFSzaEf0esaK4j31LyQCOBQ@mail.gmail.com I'm not sure if others are waiting on me for a new patch or not. I can give the documentation issues a try, but I was hesitant to do so because you've done the research. The problems that I can correct are fairly trivial. Regards,Jeff Davis
<p dir="ltr">On Apr 25, 2013 10:38 PM, "Jeff Davis" <<a href="mailto:pgsql@j-davis.com">pgsql@j-davis.com</a>> wrote:<br/> ><br /> > On Tue, 2013-04-23 at 11:44 +0300, Ants Aasma wrote:<br /> > > I will try to reword.<br/> ><br /> > Did you have a chance to clarify this, as well as some of the other<br /> > documentationissues Simon mentioned here?<br /> ><br /> > <a href="http://www.postgresql.org/message-id/CA+U5nMKVEu8UDXQe">http://www.postgresql.org/message-id/CA+U5nMKVEu8UDXQe</a><br />> +Nk=<a href="mailto:d7Nqm4ypFSzaEf0esaK4j31LyQCOBQ@mail.gmail.com">d7Nqm4ypFSzaEf0esaK4j31LyQCOBQ@mail.gmail.com</a><p dir="ltr">Notyet. I am busy at the moment due to events in private life. I might be able to find some time over the weekendto go over the documentation but no guarantees.<p dir="ltr">> I'm not sure if others are waiting on me for a newpatch or not. I can<br /> > give the documentation issues a try, but I was hesitant to do so because<br /> > you'vedone the research.<br /> ><br /> > The problems that I can correct are fairly trivial.<p dir="ltr">The unresolvedcode issue that I know of is moving the compiler flags behind a configure check. I would greatly appreciate itif you could take a look at that. My config-fu is weak and it would take me some time to figure out how to do that.<p dir="ltr">Regards,<br/> Ants Aasma
On Apr26, 2013, at 10:28 , Ants Aasma <ants.aasma@eesti.ee> wrote: > On Apr 25, 2013 10:38 PM, "Jeff Davis" <pgsql@j-davis.com> wrote: > > > > On Tue, 2013-04-23 at 11:44 +0300, Ants Aasma wrote: > > > I will try to reword. > > > > Did you have a chance to clarify this, as well as some of the other > > documentation issues Simon mentioned here? > > > > http://www.postgresql.org/message-id/CA+U5nMKVEu8UDXQe > > +Nk=d7Nqm4ypFSzaEf0esaK4j31LyQCOBQ@mail.gmail.com > > Not yet. I am busy at the moment due to events in private life. I might be able to find some time over the weekend to goover the documentation but no guarantees. I can try to write up the reasoning behind the choice of FNV1+SHIFT3 as a checksum function, but I'm quite busy too so I'mnot 100% certain I'll get to it. If that's OK with you Ants, that is. > > I'm not sure if others are waiting on me for a new patch or not. I can > > give the documentation issues a try, but I was hesitant to do so because > > you've done the research. > > > > The problems that I can correct are fairly trivial. > > The unresolved code issue that I know of is moving the compiler flags behind a configure check. I would greatly appreciateit if you could take a look at that. My config-fu is weak and it would take me some time to figure out how to dothat. Do we necessarily have to do that before beta? If not, let's concentrate on getting the basic path in, and let's add thegcc-specific compiler options later. best regards, Florian Pflug
On 2013-04-26 13:11:00 +0200, Florian Pflug wrote: > > The unresolved code issue that I know of is moving the compiler flags behind a configure check. I would greatly appreciateit if you could take a look at that. My config-fu is weak and it would take me some time to figure out how to dothat. > > Do we necessarily have to do that before beta? If not, let's concentrate on getting the basic path in, and let's add thegcc-specific compiler options later. If we want them we should do it before beta, otherwise we won't notice problems that the flags cause (misoptimizations, problems on compiler versions, ...). So either now or in 9.4. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Andres Freund <andres@2ndquadrant.com> writes: >> On 2013-04-26 13:11:00 +0200, Florian Pflug wrote: >>> The unresolved code issue that I know of is moving the compiler flags behind a configure check. I would greatly appreciateit if you could take a look at that. My config-fu is weak and it would take me some time to figure out how to dothat. >> Do we necessarily have to do that before beta? If not, let's concentrate on getting the basic path in, and let's add thegcc-specific compiler options later. > If we want them we should do it before beta, otherwise we won't notice > problems that the flags cause (misoptimizations, problems on compiler > versions, ...). So either now or in 9.4. Yeah, beta1 is the point in the cycle where we have the most leverage for discovering portability problems. We should not be leaving anything involving configure tests as "to fix later". regards, tom lane
On 26 April 2013 14:40, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Andres Freund <andres@2ndquadrant.com> writes: >>> On 2013-04-26 13:11:00 +0200, Florian Pflug wrote: >>>> The unresolved code issue that I know of is moving the compiler flags behind a configure check. I would greatly appreciateit if you could take a look at that. My config-fu is weak and it would take me some time to figure out how to dothat. > >>> Do we necessarily have to do that before beta? If not, let's concentrate on getting the basic path in, and let's addthe gcc-specific compiler options later. > >> If we want them we should do it before beta, otherwise we won't notice >> problems that the flags cause (misoptimizations, problems on compiler >> versions, ...). So either now or in 9.4. > > Yeah, beta1 is the point in the cycle where we have the most leverage > for discovering portability problems. We should not be leaving anything > involving configure tests as "to fix later". I'm expecting to spend some time on this over the weekend, once I've re-read the thread and patches to see if there is something to commit. That's my last time window, so this looks like the last chance to make changes before beta. --Simon Riggs http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Fri, Apr 26, 2013 at 7:09 AM, Simon Riggs <simon@2ndquadrant.com> wrote: > I'm expecting to spend some time on this over the weekend, once I've > re-read the thread and patches to see if there is something to commit. > > That's my last time window, so this looks like the last chance to make > changes before beta. I updated the patch and split it into two parts (attached). The first patch is the checksum algorithm itself. I have done some documentation updates and moved it into the C file (rather than the README), but further explanation of the "shift right 3" modification will need to be by Ants or Florian. The second patch adds the configure-time check for the right compilation flags, and uses them when compiling checksum.c. I called the new variable CFLAGS_EXTRA, for lack of a better idea, so feel free to come up with a new name. It doesn't check for, or use, -msse4.1, but that can be specified by the user by configuring with CFLAGS_EXTRA="-msse4.1". I don't know of any more required changes, aside from documentation improvements. Regards, Jeff Davis
Attachment
On 4/26/13 3:57 PM, Jeff Davis wrote: > The second patch adds the configure-time check for the right > compilation flags, and uses them when compiling checksum.c. I > called the new variable CFLAGS_EXTRA, for lack of a better idea, > so feel free to come up with a new name. It doesn't check for, or > use, -msse4.1, but that can be specified by the user by > configuring with CFLAGS_EXTRA="-msse4.1". Thank you, that is the last piece I was looking at but couldn't nail down on my own. With that I should be able to duplicate both the slicing by 8 CRC speedup Ants sent over (which also expected some optimization changes) and trying something FNV based this weekend. I think I need to do two baselines: master without checksums, and master with extra optimizations but still without checksums. It may be the case that using better compile time optimizations gives a general speedup that's worth considering regardless. The optimizations seem to have a very significant impact on the checksum feature, but I'd like to quantify how they change the code a little bit before even getting into that. -- Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com
On Fri, 2013-04-26 at 16:40 -0400, Greg Smith wrote: > I think I need to do two baselines: master without checksums, and > master with extra optimizations but still without checksums. It may be > the case that using better compile time optimizations gives a general > speedup that's worth considering regardless. The optimizations seem to > have a very significant impact on the checksum feature, but I'd like to > quantify how they change the code a little bit before even getting into > that. The patch only affects optimization flags used when compiling checksum.c, so it should have no effect on other areas of the code. If you want to compile the whole source with those flags, then just do: CFLAGS="-msse4.1 -funroll-loops -ftree-vectorize" ./configure Changing the optimization flags for existing code will have a larger impact and should be considered separately from checksums. Regards,Jeff Davis
On 2013-04-26 12:57:09 -0700, Jeff Davis wrote: > I updated the patch and split it into two parts (attached). > The second patch adds the configure-time check for the right > compilation flags, and uses them when compiling checksum.c. I > called the new variable CFLAGS_EXTRA, for lack of a better idea, > so feel free to come up with a new name. It doesn't check for, or > use, -msse4.1, but that can be specified by the user by > configuring with CFLAGS_EXTRA="-msse4.1". CFLAGS_VECTORIZATION? EXTRA sounds to generic to me. > --- a/config/c-compiler.m4 > +++ b/config/c-compiler.m4 > @@ -242,6 +242,30 @@ undefine([Ac_cachevar])dnl > > > > +# PGAC_PROG_CC_CFLAGS_EXTRA_OPT > +# ----------------------- > +# Given a string, check if the compiler supports the string as a > +# command-line option. If it does, add the string to CFLAGS_EXTRA. > +AC_DEFUN([PGAC_PROG_CC_CFLAGS_EXTRA_OPT], > +[define([Ac_cachevar], [AS_TR_SH([pgac_cv_prog_cc_cflags_extra_$1])])dnl > +AC_CACHE_CHECK([whether $CC supports $1], [Ac_cachevar], > +[pgac_save_CFLAGS_EXTRA=$CFLAGS_EXTRA > +CFLAGS_EXTRA="$pgac_save_CFLAGS_EXTRA $1" > +ac_save_c_werror_flag=$ac_c_werror_flag > +ac_c_werror_flag=yes > +_AC_COMPILE_IFELSE([AC_LANG_PROGRAM()], > + [Ac_cachevar=yes], > + [Ac_cachevar=no]) > +ac_c_werror_flag=$ac_save_c_werror_flag > +CFLAGS_EXTRA="$pgac_save_CFLAGS_EXTRA"]) > +if test x"$Ac_cachevar" = x"yes"; then > + CFLAGS_EXTRA="$CFLAGS_EXTRA $1" > +fi > +undefine([Ac_cachevar])dnl > +])# PGAC_PROG_CC_CFLAGS_EXTRA_OPT I think it would be better to have a PGAC_PROG_CC_VAR_OPT or so which assigns the flag to some passed variable name. Then we can reuse it for different vars and I have the feeling those will come. And having a CFLAGS_VECTOR_OPT would just be stupid ;) Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Sat, 2013-04-27 at 00:20 +0200, Andres Freund wrote: > CFLAGS_VECTORIZATION? EXTRA sounds to generic to me. I went with CFLAGS_VECTOR to be a little shorter while still keeping some meaning. > I think it would be better to have a PGAC_PROG_CC_VAR_OPT or so which > assigns the flag to some passed variable name. Then we can reuse it for > different vars and I have the feeling those will come. And having a > CFLAGS_VECTOR_OPT would just be stupid ;) Good suggestion; done. Thank you for the review. New renamed patch attached for the build options only (the other patch for the FNV checksum algorithm is unchanged). Regards, Jeff Davis
Attachment
On Fri, Apr 26, 2013 at 10:57 PM, Jeff Davis <pgsql@j-davis.com> wrote: > On Fri, Apr 26, 2013 at 7:09 AM, Simon Riggs <simon@2ndquadrant.com> wrote: >> I'm expecting to spend some time on this over the weekend, once I've >> re-read the thread and patches to see if there is something to commit. >> >> That's my last time window, so this looks like the last chance to make >> changes before beta. > > I updated the patch and split it into two parts (attached). > > The first patch is the checksum algorithm itself. I have done > some documentation updates and moved it into the C file (rather > than the README), but further explanation of the "shift right 3" > modification will need to be by Ants or Florian. > > The second patch adds the configure-time check for the right > compilation flags, and uses them when compiling checksum.c. I > called the new variable CFLAGS_EXTRA, for lack of a better idea, > so feel free to come up with a new name. It doesn't check for, or > use, -msse4.1, but that can be specified by the user by > configuring with CFLAGS_EXTRA="-msse4.1". > > I don't know of any more required changes, aside from > documentation improvements. I have updated the base patch. This is supposed to go under the cflags-vector patch that Jeff posted yesterday. I had the opportunity to run a few more tests of the hash. Based on the tests I switched the shift-right operation from 3 to 17bits (the original value was chosen by gut feel). Avalanche tests showed that this value removed bias the quickest. You can see the difference in the attached image, colors are still black 0% bias, blue 5%, green 33%, yellow 75%, red 100%. The final box in the diagram is covered by the final mixing iteration. The take away from these diagrams is 17 mixes better than 3. 17 still has some residual bias for the final iteration on the page. The effective information content values in checksum for 16 high order bits on final 32 32bit words on the page are: 16.0 15.1 14.1 13.3 12.6 12.1 11.8 11.7 11.5 11.5 11.1 11.0 10.9 10.6 10.6 10.4. Error detection capability for the highest bit is therefore 1:1351. Based on this I also switched to using two iterations of zeroes at the end, this way the lowest information content is 15.976bits or 1:64473. Documentation changes: * reworded the algorithm description so the order of steps is more apparent. * added a link to the FNV reference page. * fixed note about FNV being 4 bytes at a time. Official variant is 1 byte at a time. * added a segment of why the algorithm was chosen and its error detection capabilities. * added a segment about how the code affects vectorization. Issue to decide before commiting: * Whether to use 32 or 64 parallel checksums. The tradeoff for 64 is a small performance hit (10-20%) on todays CPUs for a small performance gain on Haswell processors coming to market this year and up to a theoretical 2x performance gain on future CPUs. Changing this is just a matter of changing N_SUMS and updating documentation to match. Regards, Ants Aasma -- Cybertec Schönig & Schönig GmbH Gröhrmühlgasse 26 A-2700 Wiener Neustadt Web: http://www.postgresql-support.de
Attachment
On 28 April 2013 13:22, Ants Aasma <ants@cybertec.at> wrote: > I have updated the base patch. This is supposed to go under the > cflags-vector patch that Jeff posted yesterday. I've committed your patch, with two changes * one comment extended * adding the .h file from Jeff's last main patch Please can Ants and Jeff review the commit and confirm that is what we wanted. Thanks. I'm about to light up the build farm with a trial commit of the compiler instructions stuff. --Simon Riggs http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 30 April 2013 06:57, Simon Riggs <simon@2ndquadrant.com> wrote: > I'm about to light up the build farm with a trial commit of the > compiler instructions stuff. Amazingly that seemed to work. ISTM that we also need this patch to put memory barriers in place otherwise the code might be rearranged. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
On 2013-04-30 11:55:29 +0100, Simon Riggs wrote: > ISTM that we also need this patch to put memory barriers in place > otherwise the code might be rearranged. > > -- > Simon Riggs http://www.2ndQuadrant.com/ > PostgreSQL Development, 24x7 Support, Training & Services > --- a/src/backend/storage/page/bufpage.c > +++ b/src/backend/storage/page/bufpage.c > @@ -960,11 +960,14 @@ PageCalcChecksum16(Page page, BlockNumber blkno) > * Save pd_checksum and set it to zero, so that the checksum calculation > * isn't affected by the checksum stored on the page. We do this to > * allow optimization of the checksum calculation on the whole block > - * in one go. > + * in one go. Memory barriers are required to avoid rearrangement here. > */ > save_checksum = phdr->pd_checksum; > + pg_memory_barrier(); > phdr->pd_checksum = 0; > + pg_memory_barrier(); > checksum = checksum_block(page, BLCKSZ); > + pg_memory_barrier(); > phdr->pd_checksum = save_checksum; > > /* mix in the block number to detect transposed pages */ Why? I am not sure which rearrangement you're fearing? In all cases where there is danger of concurrent write access to the page we should already be working on a copy? Also, if we need a memory barrier I can only see a point in the 2nd one. The first and third shouldn't ever be able to change anything since we are only writing to local memory? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Tue, Apr 30, 2013 at 1:55 PM, Simon Riggs <simon@2ndquadrant.com> wrote: > On 30 April 2013 06:57, Simon Riggs <simon@2ndquadrant.com> wrote: > >> I'm about to light up the build farm with a trial commit of the >> compiler instructions stuff. > > Amazingly that seemed to work. Thanks for committing. Sorry about missing the .h file from the patch. The two commits look good to me. I can confirm that compiling with CFLAGS="-O2 -march=native" will vectorize the committed code on GCC 4.7. I also checked the situation on clang. clang-3.2 isn't able to vectorize the loop even with vectorization options. I will check what is stopping it. If any volunteer has a working build setup with ICC or MSVC and is willing to run a couple of test compiles, I think we can achieve vectorization there too. > ISTM that we also need this patch to put memory barriers in place > otherwise the code might be rearranged. The compiler and CPU both have to preserve correctness when rearranging code, so I don't think we care about it here. It might matter if these routine could be called concurrently by multiple backends for a single buffer, but in that case memory barriers won't be enough, we'd need full exclusion. Regards, Ants Aasma -- Cybertec Schönig & Schönig GmbH Gröhrmühlgasse 26 A-2700 Wiener Neustadt Web: http://www.postgresql-support.de
On 30 April 2013 12:23, Ants Aasma <ants@cybertec.at> wrote: >> ISTM that we also need this patch to put memory barriers in place >> otherwise the code might be rearranged. > > The compiler and CPU both have to preserve correctness when > rearranging code, so I don't think we care about it here. It might > matter if these routine could be called concurrently by multiple > backends for a single buffer, but in that case memory barriers won't > be enough, we'd need full exclusion. Certainly happy not to commit anything else... Thanks to Ants and Andres. --Simon Riggs http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Simon Riggs <simon@2ndQuadrant.com> writes: > ISTM that we also need this patch to put memory barriers in place > otherwise the code might be rearranged. This is simply silly. regards, tom lane
On 30 April 2013 15:51, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Simon Riggs <simon@2ndQuadrant.com> writes: >> ISTM that we also need this patch to put memory barriers in place >> otherwise the code might be rearranged. > > This is simply silly. You crack me up sometimes. Yes, it is; seem to be having a bad day for thinkos. --Simon Riggs http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
I re-ran the benchmark that's had me most worried against the committed code and things look good so far. I've been keeping quiet because my tests recently have all agreed with what Ants already described. This is more a confirmation summary than new data. The problem case has been Jeff's test 2 "worst-case overhead for calculating checksum while reading data" from the OS cache. I wrapped that into a test harness and gave results similar to Jeff's at http://www.postgresql.org/message-id/5133D732.4090801@2ndQuadrant.com based on the originally proposed Fletcher-16 checksum. I made some system improvements since then such that the absolute runtime improved for most of the tests I'm running. But the percentage changes didn't seem off enough to bother re-running the Fletcher tests again. Details are in attached spreadsheet, to summarize: -The original Fletcher-16 code slowed this test case down 24 to 32%, depending on whether you look at the average of 3 runs or the median. -The initial checksum commit with the truncated WAL CRC was almost an order of magnitude worse: 146% to 224% slowdown. The test case that took ~830ms was taking as much as 2652ms with that method. I'm still not sure why the first run of this test is always so much faster than the second and third. But since it happens so often I think it's fair to consider that worst case really important. -Committed FNV-1a implementation is now slightly better than Fletcher-16 speed wise: 19 to 27% slowdown. -Slicing by 8 CRC I didn't test because once I'd fully come around to agree with Ants's position it didn't seem likely to be useful. I don't want to lose track of that idea though, it might be the right path for a future implementation with 32 bit checksums. Since the >=25% slowdown on this test with Fletcher-16 turned into more like a 2% drop on more mixed workloads, I'd expect we're back to where that's again the case with the new FNV-1a. I plan to step back to looking at more of those cases, but it will take a few days at least to start sorting that out. -- Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com
Attachment
On Tue, Apr 30, 2013 at 01:05:30PM -0400, Greg Smith wrote: > I re-ran the benchmark that's had me most worried against the > committed code and things look good so far. I've been keeping quiet > because my tests recently have all agreed with what Ants already > described. This is more a confirmation summary than new data. I came across this today: Data Integrity Extensions, basically a standard for have an application calculate a checksum of a block and submitting it together with the block so that the disk can verify that the block it is writing matches what the application sent. It appears SCSI has standardised on a CRC-16 checksum with polynomial 0x18bb7 . http://www.t10.org/ftp/t10/document.03/03-290r0.pdf https://oss.oracle.com/~mkp/docs/dix-draft.pdf Not directly relavent to PostgreSQL now, but possibly in the future. Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > He who writes carelessly confesses thereby at the very outset that he does > not attach much importance to his own thoughts. -- Arthur Schopenhauer
On 4/30/13 5:26 PM, Martijn van Oosterhout wrote: > I came across this today: Data Integrity Extensions, basically a > standard for have an application calculate a checksum of a block and > submitting it together with the block so that the disk can verify that > the block it is writing matches what the application sent. > > It appears SCSI has standardised on a CRC-16 checksum with polynomial > 0x18bb7 . To be pedantic for a minute (for the first time *ever* on pgsql-hackers) it's not quite all of SCSI. iSCSI has joined btrfs by settling on CRC-32C with the Castagnoli polynomial, as mentioned in that first reference. CRC-32C is also the one with the SSE4.2 instructions to help too. All the work around the T10/Data Integrity Field standard that's going on is nice. I think it's going to leave a lot of PostgreSQL users behind though. I'd bet a large sum of money that five years from now, there will still be more than 10X as many PostgreSQL servers on EC2 as on T10/DIF capable hardware. I feel pretty good that this new FNV-1a implementation is a good trade-off spot that balances error detection and performance impact. If you want a 16 bit checksum that seems ready for beta today, we can't do much better. Fletcher-16 had too many detection holes, the WAL checksum was way too expensive. Optimized FNV-1a is even better than unoptimized Fletcher-16 without as many detection issues. Can't even complain about the code bloat for this part either--checksum.c is only 68 lines if you take out its documentation. The WAL logging of hint bits is where the scary stuff to me for this feature has always been at. My gut feel is that doing that needed to start being available as an option anyway. Just this month we've had two customer issues pop up where we had to look for block differences between a master and a standby. The security update forced some normal update stragglers to where they now have the 9.1.6 index corruption fix, and we're looking for cases where standby indexes might have been corrupted by it. In this case the comparisons can just avoid anything but indexes, so hint bits are thankfully not involved. But having false positives pop out of comparing a master and standby due to hint bits makes this sort of process much harder in general. Being able to turn checksums on, and then compare more things between master and standby without expecting any block differences, that will make both routine quality auditing and forensics of broken clusters so much easier. -- Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com
Orthogonal to this thread, but: On Tue, Apr 30, 2013 at 06:39:09PM -0400, Greg Smith wrote: > The WAL logging of hint bits is where the scary stuff to me for this > feature has always been at. My gut feel is that doing that needed to > start being available as an option anyway. Just this month we've had > two customer issues pop up where we had to look for block differences > between a master and a standby. The security update forced some normal > update stragglers to where they now have the 9.1.6 index corruption fix, > and we're looking for cases where standby indexes might have been > corrupted by it. In this case the comparisons can just avoid anything > but indexes, so hint bits are thankfully not involved. B-tree indexes have hints; see callers of ItemIdMarkDead(). -- Noah Misch EnterpriseDB http://www.enterprisedb.com
On 2013-04-30 18:39:09 -0400, Greg Smith wrote: > The WAL logging of hint bits is where the scary stuff to me for this feature > has always been at. My gut feel is that doing that needed to start being > available as an option anyway. Just this month we've had two customer > issues pop up where we had to look for block differences between a master > and a standby. The security update forced some normal update stragglers to > where they now have the 9.1.6 index corruption fix, and we're looking for > cases where standby indexes might have been corrupted by it. In this case > the comparisons can just avoid anything but indexes, so hint bits are > thankfully not involved. > > But having false positives pop out of comparing a master and standby due to > hint bits makes this sort of process much harder in general. Being able to > turn checksums on, and then compare more things between master and standby > without expecting any block differences, that will make both routine quality > auditing and forensics of broken clusters so much easier. I don't think the current implementation helps you with that. We only log the first hint bit set after a checkpoint, you will still get inconsistent bits set after that. So you might have some fewer inconsistencies but not enough to weed them out manually or such. c.f. MarkBufferDirtyHint() and XLogSaveBufferForHint(). Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services