Thread: Re: Cost of XLogInsert CRC calculations
"Mark Cave-Ayland" <m.cave-ayland@webbased.co.uk> writes: > I didn't post the sources to the list originally as I wasn't sure if the > topic were of enough interest to warrant a larger email. I've attached the > two corrected programs as a .tar.gz - crctest.c uses uint32, whereas > crctest64.c uses uint64. I did some experimentation and concluded that gcc is screwing up big-time on optimizing the CRC64 code for 32-bit Intel. It does much better on every other architecture though. Here are some numbers with gcc 3.2.3 on an Intel Xeon machine. (I'm showing the median of three trials in each case, but the numbers were pretty repeatable. I also tried gcc 4.0.0 on this machine and got similar numbers.) gcc -O1 crctest.c 0.328571 s gcc -O2 crctest.c 0.297978 s gcc -O3 crctest.c 0.306894 s gcc -O1 crctest64.c 0.358263 s gcc -O2 crctest64.c 0.773544 s gcc -O3 crctest64.c 0.770945 s When -O2 is slower than -O1, you know the compiler is blowing it :-(. I fooled around with non-default -march settings but didn't see much change. Similar tests on a several-year-old Pentium 4 machine, this time with gcc version 3.4.3: gcc -O1 -march=pentium4 crctest.c 0.486266 s gcc -O2 -march=pentium4 crctest.c 0.520237 s gcc -O3 -march=pentium4 crctest.c 0.520299 s gcc -O1 -march=pentium4 crctest64.c 0.928107 s gcc -O2 -march=pentium4 crctest64.c 1.247673 s gcc -O3 -march=pentium4 crctest64.c 1.654102 s Here are some comparisons showing that the performance difference is not inherent: IA64 (Itanium 2), gcc 3.2.3: gcc -O1 crctest.c 0.898595 s gcc -O2 crctest.c 0.599005 s gcc -O3 crctest.c 0.598824 s gcc -O1 crctest64.c 0.524257 s gcc -O2 crctest64.c 0.524168 s gcc -O3 crctest64.c 0.524140 s X86_64 (Opteron), gcc 3.2.3: gcc -O1 crctest.c 0.460000 s gcc -O2 crctest.c 0.460000 s gcc -O3 crctest.c 0.460000 s gcc -O1 crctest64.c 0.410000 s gcc -O2 crctest64.c 0.410000 s gcc -O3 crctest64.c 0.410000 s PPC64 (IBM POWER4+), gcc 3.2.3 gcc -O1 crctest.c 0.819492 s gcc -O2 crctest.c 0.819427 s gcc -O3 crctest.c 0.820616 s gcc -O1 crctest64.c 0.751639 s gcc -O2 crctest64.c 0.894250 s gcc -O3 crctest64.c 0.888959 s PPC (Mac G4), gcc 3.3 gcc -O1 crctest.c 0.949094 s gcc -O2 crctest.c 1.011220 s gcc -O3 crctest.c 1.013847 s gcc -O1 crctest64.c 1.314093 s gcc -O2 crctest64.c 1.015367 s gcc -O3 crctest64.c 1.011468 s HPPA, gcc 2.95.3: gcc -O1 crctest.c 1.796604 s gcc -O2 crctest.c 1.676023 s gcc -O3 crctest.c 1.676476 s gcc -O1 crctest64.c 2.022798 s gcc -O2 crctest64.c 1.916185 s gcc -O3 crctest64.c 1.904094 s Given the lack of impressive advantage to the 64-bit code even on 64-bit architectures, it might be best to go with the 32-bit code everywhere, but I also think we have grounds to file a gcc bug report. Anyone want to try it with non-gcc compilers? I attach a slightly cleaned-up version of Mark's original (doesn't draw compiler warnings or errors on what I tried it on). regards, tom lane
Attachment
Hi Tom, I didn't post the sources to the list originally as I wasn't sure if the topic were of enough interest to warrant a larger email. I've attached the two corrected programs as a .tar.gz - crctest.c uses uint32, whereas crctest64.c uses uint64. Kind regards, Mark. ------------------------ WebBased Ltd 17 Research Way Plymouth PL6 8BT T: +44 (0)1752 797131 F: +44 (0)1752 791023 W: http://www.webbased.co.uk > -----Original Message----- > From: Tom Lane [mailto:tgl@sss.pgh.pa.us] > Sent: 16 May 2005 15:01 > To: Mark Cave-Ayland (External) > Cc: pgsql-hackers@postgresql.org > Subject: Re: [HACKERS] Cost of XLogInsert CRC calculations > > > "Mark Cave-Ayland" <m.cave-ayland@webbased.co.uk> writes: > > Sigh, so it would help if I had added the offset to the data > > pointer... ;) > > Would you post the corrected program so people can try it on > a few other architectures? No point in reinventing the > wheel, even if it is a pretty trivial wheel. > > regards, tom lane >
Attachment
I tested crctest in two machines and two versions of gcc. UltraSPARC III, gcc 2.95.3: gcc -O1 crctest.c 1.321517 s gcc -O2 crctest.c 1.099186 s gcc -O3 crctest.c 1.099330 s gcc -O1 crctest64.c 1.651599 s gcc -O2 crctest64.c 1.429089 s gcc -O3 crctest64.c 1.434296 s UltraSPARC III, gcc 3.4.3: gcc -O1 crctest.c 1.209168 s gcc -O2 crctest.c 1.206253 s gcc -O3 crctest.c 1.209762 s gcc -O1 crctest64.c 1.545899 s gcc -O2 crctest64.c 1.545290 s gcc -O3 crctest64.c 1.540993 s Pentium III, gcc 2.95.3: gcc -O1 crctest.c 1.548432 s gcc -O2 crctest.c 1.226873 s gcc -O3 crctest.c 1.227699 s gcc -O1 crctest64.c 1.362152 s gcc -O2 crctest64.c 1.259324 s gcc -O3 crctest64.c 1.259608 s Pentium III, gcc 3.4.3: gcc -O1 crctest.c 1.084822 s gcc -O2 crctest.c 0.921594 s gcc -O3 crctest.c 0.921910 s gcc -O1 crctest64.c 1.188287 s gcc -O2 crctest64.c 1.242013 s gcc -O3 crctest64.c 1.638812 s I think that it can improve the performance by loop unrolling. I measured the performance when the loop unrolled by -funroll-loops option or hand-tune. (hand-tune version is attached.) UltraSPARC III, gcc 2.95.3: gcc -O2 crctest.c 1.098880 s gcc -O2 -funroll-loops crctest.c 0.874165 s gcc -O2 crctest_unroll.c 0.808208 s UltraSPARC III, gcc 3.4.3: gcc -O2 crctest.c 1.209168 s gcc -O2 -funroll-loops crctest.c 1.127973 s gcc -O2 crctest_unroll.c 1.017485 s Pentium III, gcc 2.95.3: gcc -O2 crctest.c 1.226873 s gcc -O2 -funroll-loops crctest.c 1.077475 s gcc -O2 crctest_unroll.c 1.051375 s Pentium III, gcc 3.4.3: gcc -O2 crctest.c 0.921594 s gcc -O2 -funroll-loops crctest.c 0.873614 s gcc -O2 crctest_unroll.c 0.839384 s regards, --- Atsushi Ogawa
Attachment
> -----Original Message----- > From: Tom Lane [mailto:tgl@sss.pgh.pa.us] > Sent: 16 May 2005 17:36 > To: Mark Cave-Ayland (External) > Cc: pgsql-hackers@postgresql.org > Subject: Re: [HACKERS] Cost of XLogInsert CRC calculations (cut) > I did some experimentation and concluded that gcc is screwing > up big-time on optimizing the CRC64 code for 32-bit Intel. > It does much better on every other architecture though. Hi Tom, Thanks very much for showing that the unint64 slowdown was caused by the optimisation done by gcc - I've had a go at filing a bug with the gcc people at http://gcc.gnu.org/bugzilla/show_bug.cgi?id=21617 so it would be interesting see if they can solve this. Perhaps like you suggest, the short term solution is to use the uint32 CRC64 code everywhere at the moment. I hope to find some time later this week to write and test some CRC32 routines, and will post the results back to the list. Many thanks, Mark. ------------------------ WebBased Ltd 17 Research Way Plymouth PL6 8BT T: +44 (0)1752 797131 F: +44 (0)1752 791023 W: http://www.webbased.co.uk
On Mon, 16 May 2005, Tom Lane wrote: > I did some experimentation and concluded that gcc is screwing up > big-time on optimizing the CRC64 code for 32-bit Intel. It does much > better on every other architecture though. > > Anyone want to try it with non-gcc compilers? Solaris 9 x86 - Sun Workshop 6 update 2 C 5.3, gcc 3.2.3 gcc -O1 crctest.c .251422 gcc -O3 crctest.c .240223 gcc -O1 crctest64.c .281369 gcc -O3 crctest64.c .631290 cc -O crctest.c .268905 cc -fast crctest.c .242429 cc -O crctest64.c .283278 cc -fast crctest64.c .255560 Kris Jurka
On Mon, 16 May 2005 12:35:35 -0400, Tom Lane <tgl@sss.pgh.pa.us> wrote: >Anyone want to try it with non-gcc compilers? MS VC++ 6.0 with various predefined optimizer settings 2x32 64 Default (without any /O) 0.828125 0.906250 MinSize (contains /O1) 0.468750 0.593750 MaxSpeed (contains /O2) 0.312500 0.640625 Not that it really matters -- but at least this looks like another hint that letting the compiler emulate 64 bit operations on non 64 bit hardware is suboptimal. ServusManfred
Manfred Koizar wrote: > On Mon, 16 May 2005 12:35:35 -0400, Tom Lane <tgl@sss.pgh.pa.us> wrote: > >Anyone want to try it with non-gcc compilers? > > MS VC++ 6.0 with various predefined optimizer settings > > 2x32 64 > Default (without any /O) 0.828125 0.906250 > MinSize (contains /O1) 0.468750 0.593750 > MaxSpeed (contains /O2) 0.312500 0.640625 > > Not that it really matters -- but at least this looks like another hint > that letting the compiler emulate 64 bit operations on non 64 bit > hardware is suboptimal. I don't understand why we are testing 64-bit CRC when I thought we agreed that 32-bit was good enough for our purposes. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
Bruce Momjian <pgman@candle.pha.pa.us> writes: > I don't understand why we are testing 64-bit CRC when I thought we > agreed that 32-bit was good enough for our purposes. Well, we need to understand exactly what is going on here. I'd not like to think that we dropped back from 64 to 32 bit because of one possibly-minor optimization bug in one compiler on one platform. Even if that compiler+platform is 90% of the market. regards, tom lane
Tom Lane wrote: > Bruce Momjian <pgman@candle.pha.pa.us> writes: > > I don't understand why we are testing 64-bit CRC when I thought we > > agreed that 32-bit was good enough for our purposes. > > Well, we need to understand exactly what is going on here. I'd not > like to think that we dropped back from 64 to 32 bit because of one > possibly-minor optimization bug in one compiler on one platform. > Even if that compiler+platform is 90% of the market. But isn't it obvious that almost any problem that CRC64 is going to catch, CRC32 is going to catch, and we know CRC32 has to be faster than CRC64? -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
Bruce Momjian <pgman@candle.pha.pa.us> writes: > Tom Lane wrote: >> Well, we need to understand exactly what is going on here. I'd not >> like to think that we dropped back from 64 to 32 bit because of one >> possibly-minor optimization bug in one compiler on one platform. >> Even if that compiler+platform is 90% of the market. > But isn't it obvious that almost any problem that CRC64 is going to > catch, CRC32 is going to catch, and we know CRC32 has to be faster than > CRC64? Do we know that? The results I showed put at least one fundamentally 32bit platform (the PowerBook I'm typing this on) at dead par for 32bit and 64bit CRCs. We have also established that 64bit CRC can be done much faster on 32bit Intel than it's currently done by the default PG-on-gcc build (hint: don't use -O2 or above). So while Mark's report that 64bit CRC is an issue on Intel is certainly true, it doesn't immediately follow that the only sane response is to give up 64bit CRC. We need to study it and see what alternatives we have. I do personally feel that 32bit is the way to go, but that doesn't mean I think it's a done deal. We owe it to ourselves to understand what we are buying and what we are paying for it. regards, tom lane
I probably shouldn't jump in, because I do not know the nature of the usage of the CRC values. But if the birthday paradox can come into play, with a 32 bit CRC, you will get one false mismatch every 78,643 items or so. http://mathworld.wolfram.com/BirthdayProblem.html Probably you already knew that, and probably the birthday paradox does not apply. I generally use 64 bit CRCs (UMAC) for just about anything that needs a CRC. http://www.cs.ucdavis.edu/~rogaway/umac/ A plausible work-around is to compute two distinct 32-bit hash values for platforms with awful 64 bit math/emulation (e.g. [SDBM hash and FNV hash] or [Bob Jenkins hash and D. J. Bernstein hash]) to create two distinct 32 bit hash values -- both of which must match. > -----Original Message----- > From: pgsql-hackers-owner@postgresql.org [mailto:pgsql-hackers- > owner@postgresql.org] On Behalf Of Tom Lane > Sent: Tuesday, May 17, 2005 9:26 PM > To: Bruce Momjian > Cc: Manfred Koizar; Mark Cave-Ayland; pgsql-hackers@postgresql.org > Subject: Re: [HACKERS] Cost of XLogInsert CRC calculations > > Bruce Momjian <pgman@candle.pha.pa.us> writes: > > Tom Lane wrote: > >> Well, we need to understand exactly what is going on here. I'd not > >> like to think that we dropped back from 64 to 32 bit because of one > >> possibly-minor optimization bug in one compiler on one platform. > >> Even if that compiler+platform is 90% of the market. > > > But isn't it obvious that almost any problem that CRC64 is going to > > catch, CRC32 is going to catch, and we know CRC32 has to be faster than > > CRC64? > > Do we know that? The results I showed put at least one fundamentally > 32bit platform (the PowerBook I'm typing this on) at dead par for 32bit > and 64bit CRCs. We have also established that 64bit CRC can be done > much faster on 32bit Intel than it's currently done by the default > PG-on-gcc build (hint: don't use -O2 or above). So while Mark's report > that 64bit CRC is an issue on Intel is certainly true, it doesn't > immediately follow that the only sane response is to give up 64bit CRC. > We need to study it and see what alternatives we have. > > I do personally feel that 32bit is the way to go, but that doesn't > mean I think it's a done deal. We owe it to ourselves to understand > what we are buying and what we are paying for it. > > regards, tom lane > > ---------------------------(end of broadcast)--------------------------- > TIP 9: the planner will ignore your desire to choose an index scan if your > joining column's datatypes do not match
Tom Lane <tgl@sss.pgh.pa.us> writes: > Do we know that? The results I showed put at least one fundamentally > 32bit platform (the PowerBook I'm typing this on) at dead par for 32bit > and 64bit CRCs. Wait, par for 32-bit CRCs? Or for 64-bit CRCs calculated using 32-bit ints? -- greg
Greg Stark <gsstark@mit.edu> writes: > Tom Lane <tgl@sss.pgh.pa.us> writes: >> Do we know that? The results I showed put at least one fundamentally >> 32bit platform (the PowerBook I'm typing this on) at dead par for 32bit >> and 64bit CRCs. > Wait, par for 32-bit CRCs? Or for 64-bit CRCs calculated using 32-bit ints? Right, the latter. We haven't actually tried to measure the cost of plain 32bit CRCs... although I seem to recall that when we originally decided to use 64bit, someone put up some benchmarks purporting to show that there wasn't much difference. regards, tom lane
Tom Lane wrote: > Greg Stark <gsstark@mit.edu> writes: > > Tom Lane <tgl@sss.pgh.pa.us> writes: > >> Do we know that? The results I showed put at least one fundamentally > >> 32bit platform (the PowerBook I'm typing this on) at dead par for 32bit > >> and 64bit CRCs. > > > Wait, par for 32-bit CRCs? Or for 64-bit CRCs calculated using 32-bit ints? > > Right, the latter. We haven't actually tried to measure the cost of > plain 32bit CRCs... although I seem to recall that when we originally > decided to use 64bit, someone put up some benchmarks purporting to > show that there wasn't much difference. OK, thanks. I didn't know that. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
"Dann Corbit" <DCorbit@connx.com> writes: > Probably you already knew that, and probably the birthday paradox does > not apply. > > I generally use 64 bit CRCs (UMAC) for just about anything that needs a > CRC. > http://www.cs.ucdavis.edu/~rogaway/umac/ The birthday paradox doesn't come up here. The CRC has to match the actual data for that specific xlog, not just any CRC match with any xlog from a large list. So if an xlog is corrupted or truncated then the chances that a 64-bit CRC would match and the xlog be mishandled is one in 16 quadrillion or so. A 32-bit CRC will match the invalid data is about one in 4 billion. The chances that a 16-bit CRC would match would be one in 64 thousand. I mention 16-bit CRC because you use a system every day that uses 16-bit CRCs and you trust thousands of data blocks each day to this protection (actually probably thousands each second). I refer to TCP/IP. Every TCP/IP segment is protected by just a 16-bit CRC. Have you ever seen a corrupted TCP stream caused by the use of such a short CRC? Actually I have. A router with a bad memory card caused about 1% packet loss due to corrupted segments. Low enough not to be noticed but in a large FTP transfer it meant about one corrupted packet got through every 2.4GB of data or so. Now consider the data being protected by a the xlog CRC. If 1% of every disk write were being corrupted would one incorrect xlog being read in and mishandled about once every few gigabytes of logs really be the worst of your worries? More realistically, if you were losing power frequently and having truncated xlog writes frequently, say about once every 5 minutes (if you could get it to boot that fast). Would one incorrectly handled truncated log every 56 days be considered unacceptable? That would be the consequence of 16-bit checksums. If you ran the same experiment with 32-bit checksums it would mean the database wouldn't correctly replay once every two thousand five hundred and fifty three years. -- greg
On E, 2005-05-16 at 12:35 -0400, Tom Lane wrote: > Given the lack of impressive advantage to the 64-bit code even on 64-bit > architectures, it might be best to go with the 32-bit code everywhere, > but I also think we have grounds to file a gcc bug report. Maybe on other platforms , but 20% on Power is not something we should throw away. crc32 compiled as 32bit executable is 10% slower than crc64 as eithet 32 or 64 bit exe, but if you compile your backend as 64bit then the difference is almost 20%. crc64 is the same speed compiled either way. gcc version 3.4.3 20041212 (Red Hat 3.4.3-9.EL4) on OpenPower5 1.8GHz file ./crctest ./crctest: ELF 32-bit MSB executable, PowerPC or cisco 4500, version 1 (SYSV) cc -O1 crctest.c -o crctest -- time 0.584327 s cc -O2 crctest.c -o crctest -- time 0.594664 s cc -O3 crctest.c -o crctest -- time 0.594764 s file ./crctest ./crctest: ELF 64-bit MSB executable, cisco 7500, version 1 (SYSV) cc -O1 -m64 crctest.c -o crctest -- time 0.644473 s cc -O2 -m64 crctest.c -o crctest -- time 0.648033 s cc -O3 -m64 crctest.c -o crctest -- time 0.688682 s file ./crctest64 ./crctest64: ELF 64-bit MSB executable, cisco 7500, version 1 (SYSV) cc -O1 -m64 crctest64.c -o crctest64 -- time 0.545026 s cc -O2 -m64 crctest64.c -o crctest64 -- time 0.545470 s cc -O3 -m64 crctest64.c -o crctest64 -- time 0.545037 s file ./crctest64 ./crctest64: ELF 32-bit MSB executable, PowerPC or cisco 4500, version 1 (SYSV) cc -O1 crctest64.c -o crctest64 -- time 0.545364 s cc -O2 crctest64.c -o crctest64 -- time 0.644093 s cc -O3 crctest64.c -o crctest64 -- time 0.644155 s > Anyone want to try it with non-gcc compilers? I attach a slightly > cleaned-up version of Mark's original (doesn't draw compiler warnings > or errors on what I tried it on). I'll probably get a chance to try IBM's own compiler tomorrow -- Hannu Krosing <hannu@skype.net>
On T, 2005-05-17 at 22:37 -0400, Tom Lane wrote: > Bruce Momjian <pgman@candle.pha.pa.us> writes: > > I don't understand why we are testing 64-bit CRC when I thought we > > agreed that 32-bit was good enough for our purposes. > > Well, we need to understand exactly what is going on here. I'd not > like to think that we dropped back from 64 to 32 bit because of one > possibly-minor optimization bug in one compiler on one platform. > Even if that compiler+platform is 90% of the market. There are cases where 32bit is about 20% slower. I tried to send the folowing yesterday, but for some reason the mails I send from home where To: is Tom Lane get errors from "RCPT:tgl@sss.pgh.pa.us " and fail to go through to other destinations (like pgsql-hackers) after that :( ----- crc32 compiled as 32bit executable is 10% slower than crc64 as eithet 32 or 64 bit exe, but if you compile your backend as 64bit then the difference is almost 20%. crc64 is the same speed compiled either way. gcc version 3.4.3 20041212 (Red Hat 3.4.3-9.EL4) on OpenPower5 1.8GHz file ./crctest ./crctest: ELF 32-bit MSB executable, PowerPC or cisco 4500, version 1 (SYSV) cc -O1 crctest.c -o crctest -- time 0.584327 s cc -O2 crctest.c -o crctest -- time 0.594664 s cc -O3 crctest.c -o crctest -- time 0.594764 s file ./crctest ./crctest: ELF 64-bit MSB executable, cisco 7500, version 1 (SYSV) cc -O1 -m64 crctest.c -o crctest -- time 0.644473 s cc -O2 -m64 crctest.c -o crctest -- time 0.648033 s cc -O3 -m64 crctest.c -o crctest -- time 0.688682 s file ./crctest64 ./crctest64: ELF 64-bit MSB executable, cisco 7500, version 1 (SYSV) cc -O1 -m64 crctest64.c -o crctest64 -- time 0.545026 s cc -O2 -m64 crctest64.c -o crctest64 -- time 0.545470 s cc -O3 -m64 crctest64.c -o crctest64 -- time 0.545037 s file ./crctest64 ./crctest64: ELF 32-bit MSB executable, PowerPC or cisco 4500, version 1 (SYSV) cc -O1 crctest64.c -o crctest64 -- time 0.545364 s cc -O2 crctest64.c -o crctest64 -- time 0.644093 s cc -O3 crctest64.c -o crctest64 -- time 0.644155 s tgl@sss.pgh.pa.us -- Hannu Krosing <hannu@skype.net>
On K, 2005-05-18 at 10:24 +0300, Hannu Krosing wrote: > On T, 2005-05-17 at 22:37 -0400, Tom Lane wrote: > > Bruce Momjian <pgman@candle.pha.pa.us> writes: > > > I don't understand why we are testing 64-bit CRC when I thought we > > > agreed that 32-bit was good enough for our purposes. > > > > Well, we need to understand exactly what is going on here. I'd not > > like to think that we dropped back from 64 to 32 bit because of one > > possibly-minor optimization bug in one compiler on one platform. > > Even if that compiler+platform is 90% of the market. > > There are cases where 32bit is about 20% slower. > > I tried to send the folowing yesterday, but for some reason the mails I > send from home where To: is Tom Lane get errors from > "RCPT:tgl@sss.pgh.pa.us " and fail to go through to other destinations > (like pgsql-hackers) after that :( > ----- the same difference between 32bit and 64bit CRC when compiled as 64bit exe is there also on ibms own compiler (vac xlc 7.0) [hannu@power ~]$ c99 -O5 -qarch=pwr5 -qtune=pwr5 -qsmp=omp -qunroll=yes -q64 crctest64.c -o crctest64_c99_64 [hannu@power ~]$ ./crctest64_c99_64 Result of CRC64 (10000 runs): 782104059a01660 in time 0.545042 s [hannu@power ~]$ c99 -O5 -qarch=pwr5 -qtune=pwr5 -qsmp=omp -qunroll=yes\-q64 crctest.c -o crctest_c99_64 [hannu@power ~]$ ./crctest_c99_64 Result of CRC64 (10000 runs): 7821040 (high), 59a01660 (low) in time 0.644319 s > -- Hannu Krosing <hannu@skype.net>
> -----Original Message----- > From: Tom Lane [mailto:tgl@sss.pgh.pa.us] > Sent: 18 May 2005 06:12 > To: Greg Stark > Cc: Bruce Momjian; Manfred Koizar; Mark Cave-Ayland > (External); pgsql-hackers@postgresql.org > Subject: Re: [HACKERS] Cost of XLogInsert CRC calculations > > > Greg Stark <gsstark@mit.edu> writes: > > Tom Lane <tgl@sss.pgh.pa.us> writes: > >> Do we know that? The results I showed put at least one fundamentally > >> 32bit platform (the PowerBook I'm typing this on) at dead par for > >> 32bit and 64bit CRCs. > > > Wait, par for 32-bit CRCs? Or for 64-bit CRCs calculated using 32-bit > > ints? > > Right, the latter. We haven't actually tried to measure the > cost of plain 32bit CRCs... although I seem to recall that > when we originally decided to use 64bit, someone put up some > benchmarks purporting to show that there wasn't much difference. If all goes to plan, I shall post a test program for CRC32 tomorrow along with my results for other people to benchmark. The one thing this exercise has shown me is that you can't necessarily trust the compiler to make the right optimisation choices 100% of time, and so the only real way to find out about the performance is to test out what you are trying to do with some sample data. To quote Tom on several occasions: "Let's stop the hand waving...." Looking at the asm code produced by gcc, my feeling is that most of the time would be spent reading/writing to memory rather than doing the shift and xor. Using a 64-bit CRC, there aren't enough registers under x86-32 to store the result of each iteration of the algorithm and so it gets pushed out of the registers into memory at the end of each iteration and then read back in at the beginning of the next iteration. With a 32-bit CRC, the entire calculation could potentially be done entirely in the registers, with the final result being written to memory at the end. Combined with the fact that less cycles will be required for the shift (and that main memory is only read rather than written) then I would expect a 32-bit CRC to be significantly faster. I also think that since the last tests for 32-bit vs 64-bit CRC were done, compilers will have improved by several orders of magnitude making the difference between the two more noticeable. However, I shall wait until I have the code completed and working before I report on the results :) Kind regards, Mark. ------------------------ WebBased Ltd 17 Research Way Plymouth PL6 8BT T: +44 (0)1752 797131 F: +44 (0)1752 791023 W: http://www.webbased.co.uk
On Wed, 18 May 2005 01:12:26 -0400, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Wait, par for 32-bit CRCs? Or for 64-bit CRCs calculated using 32-bit ints? > >Right, the latter. We haven't actually tried to measure the cost of >plain 32bit CRCs... although I seem to recall that when we originally >decided to use 64bit, someone put up some benchmarks purporting to >show that there wasn't much difference. That someone wasn't me (I wasn't around here at that time), but I have done a few tests today on 32 bit Intel with VC6: Optimization | CRC algorithms Settings | 32 32a 32b 2x32 64 64a 64b -------------+----------------------------------------------- Default | 7.6 7.6 6.2 8.3 9.1 9.2 9.5 MinSize | 2.96 2.97 2.97 4.76 6.00 5.98 6.31 MaxSpeed | 2.92 2.92 2.97 3.13 6.32 6.33 6.22 32a and 32b are functionally equivalent variants of CRC32 where the crc is a plain uint32, not a struct with just one field. Same for 64a, 64b, and 64, respectively. The most important figure is, that at MaxSpeed (/O2) 2x32 is almost twice as fast as CRC32 while only being marginally slower than CRC32. In case anybody wants to repeat my tests or find any flaw therein, the source is attached. Servus Manfred
Attachment
On Wed, 18 May 2005 13:50:22 +0200, I wrote: >The most important figure is, that at MaxSpeed (/O2) 2x32 is almost >twice as fast as CRC32 while only being marginally slower than CRC32. ^^^^^ Silly typo! That should have been: The most important figure is, that at MaxSpeed (/O2) 2x32 is almost twice as fast as CRC64 while only being marginally slower than CRC32. ServusManfred
> -----Original Message----- > From: Manfred Koizar [mailto:mkoi-pg@aon.at] > Sent: 25 May 2005 20:25 > To: Manfred Koizar > Cc: Tom Lane; Greg Stark; Bruce Momjian; Mark Cave-Ayland > (External); pgsql-hackers@postgresql.org > Subject: Re: [HACKERS] Cost of XLogInsert CRC calculations (cut) > The most important figure is, that at MaxSpeed (/O2) 2x32 is > almost twice as fast as CRC64 while only being marginally > slower than CRC32. > > Servus > Manfred
> -----Original Message----- > From: Manfred Koizar [mailto:mkoi-pg@aon.at] > Sent: 25 May 2005 20:25 > To: Manfred Koizar > Cc: Tom Lane; Greg Stark; Bruce Momjian; Mark Cave-Ayland > (External); pgsql-hackers@postgresql.org > Subject: Re: [HACKERS] Cost of XLogInsert CRC calculations (cut) > The most important figure is, that at MaxSpeed (/O2) 2x32 is > almost twice as fast as CRC64 while only being marginally > slower than CRC32. > > Servus > Manfred Hi Manfred, Sorry about taking a while to respond on this one - the hard drive on my laptop crashed :(. I repeated your tests on my P4 laptop with gcc 3.2.3 and reproduced the results below: Opt 32 32a 32b 2x32 64 64a 64b -------------------------------------------------------- O1 4.91 4.86 5.43 6.00 11.4 11.39 11.39 O2 4.96 4.94 4.69 5.18 15.86 18.75 24.73 O3 4.82 4.83 4.64 5.18 15.14 13.77 14.73 ^^^^^^^^^^^^ So in summary I would say: - Calculating a CRC64 using 2 x 32 int can be 3 times as fast as using 1 x 64 int onmy 32-bit Intel laptop with gcc.- The time difference between CRC32 and CRC64 is about 0.5s in the worse caseshown during testing, so staying with CRC64 would not inflict too great a penalty. Kind regards, Mark. ------------------------ WebBased Ltd South West Technology Centre Tamar Science Park Plymouth PL6 8BT T: +44 (0)1752 797131 F: +44 (0)1752 791023 W: http://www.webbased.co.uk
"Mark Cave-Ayland" <m.cave-ayland@webbased.co.uk> writes: > Opt 32 32a 32b 2x32 64 64a 64b > -------------------------------------------------------- > O1 4.91 4.86 5.43 6.00 11.4 11.39 11.39 > O2 4.96 4.94 4.69 5.18 15.86 18.75 24.73 > O3 4.82 4.83 4.64 5.18 15.14 13.77 14.73 Not sure I believe these numbers. Shouldn't 2x32 be about twice as slow as just one 32-bit CRC? regards, tom lane
> -----Original Message----- > From: Tom Lane [mailto:tgl@sss.pgh.pa.us] > Sent: 27 May 2005 15:00 > To: Mark Cave-Ayland (External) > Cc: 'Manfred Koizar'; 'Greg Stark'; 'Bruce Momjian'; > pgsql-hackers@postgresql.org > Subject: Re: [HACKERS] Cost of XLogInsert CRC calculations > > > "Mark Cave-Ayland" <m.cave-ayland@webbased.co.uk> writes: > > Opt 32 32a 32b 2x32 64 64a 64b > > -------------------------------------------------------- > > O1 4.91 4.86 5.43 6.00 11.4 11.39 11.39 > > O2 4.96 4.94 4.69 5.18 15.86 18.75 24.73 > > O3 4.82 4.83 4.64 5.18 15.14 13.77 14.73 > > Not sure I believe these numbers. Shouldn't 2x32 be about > twice as slow as just one 32-bit CRC? Well it surprised me, although Manfred's results with VC6 on /MaxSpeed show a similar margin. The real killer has to be that I wrote a CRC32 routine in x86 inline assembler (which in comparison to the gcc-produced version stores the CRC for each iteration in registers instead of in memory as part of the current frame) which comes in at 6.5s.... Kind regards, Mark. ------------------------ WebBased Ltd South West Technology Centre Tamar Science Park Plymouth PL6 8BT T: +44 (0)1752 797131 F: +44 (0)1752 791023 W: http://www.webbased.co.uk
> -----Original Message----- > From: Tom Lane [mailto:tgl@sss.pgh.pa.us] > Sent: 27 May 2005 15:00 > To: Mark Cave-Ayland (External) > Cc: 'Manfred Koizar'; 'Greg Stark'; 'Bruce Momjian'; > pgsql-hackers@postgresql.org > Subject: Re: [HACKERS] Cost of XLogInsert CRC calculations (cut) > Not sure I believe these numbers. Shouldn't 2x32 be about twice as > slow as just one 32-bit CRC? Also I've just quickly tested on the Xeon Linux FC1 box I used with my original program using Manfred's program and the margin is even closer: Opt 32 32a 32b 2x32 64 64a 64b ------------------------------------------------------ O1 2.75 2.81 2.71 3.16 3.53 3.64 7.25 O2 2.75 2.78 2.87 2.94 7.63 10.61 11.93 O3 2.84 2.85 3.03 2.99 7.63 7.64 7.71 I don't know whether gcc is just producing an inefficient CRC32 compared to 2x32 but the results seem very odd.... There must be something else we are missing? Kind regards, Mark. ------------------------ WebBased Ltd South West Technology Centre Tamar Science Park Plymouth PL6 8BT T: +44 (0)1752 797131 F: +44 (0)1752 791023 W: http://www.webbased.co.uk
"Mark Cave-Ayland" <m.cave-ayland@webbased.co.uk> writes: > I don't know whether gcc is just producing an inefficient CRC32 compared to > 2x32 but the results seem very odd.... There must be something else we are > missing? I went back and looked at the code, and see that I was misled by terminology: what we've been calling "2x32" in this thread is not two independent CRC32 calculations, it is use of 32-bit arithmetic to execute one CRC64 calculation. The inner loop looks like while (__len-- > 0) { int __tab_index = ((int) (__crc1 >> 24) ^ *__data++) & 0xFF; __crc1 = crc_table1[__tab_index] ^ ((__crc1 << 8) | (__crc0 >> 24)); __crc0 = crc_table0[__tab_index] ^ (__crc0<< 8); } whereas a plain CRC32 looks like while (__len-- > 0) { int __tab_index = ((int) (crc >> 24) ^ *__data++) & 0xFF; crc = crc_table[__tab_index] ^ (crc << 8); } where the crc variables are uint32 in both cases. (The true 64-bit calculation looks like the latter, except that the crc variable is uint64, as is the crc_table, and the >> 24 becomes >> 56. The "2x32" code is an exact emulation of the true 64-bit code, with __crc1 and __crc0 holding the high and low halves of the 64-bit crc.) In my tests the second loop is about 10% faster than the first on an Intel machine, and maybe 20% faster on HPPA. So evidently the bulk of the cost is in the __tab_index calculation, and not so much in the table fetches. This is still a bit surprising, but it's not totally silly. Based on the numbers we've seen so far, one could argue for staying with the 64-bit CRC, but changing the rule we use for selecting which implementation code to use: use the true 64-bit code only when sizeof(unsigned long) == 64, and otherwise use the 2x32 code, even if there is a 64-bit unsigned long long type available. This essentially assumes that the unsigned long long type isn't very efficient, which isn't too unreasonable. This would buy most of the speedup without giving up anything at all in the error-detection department. Alternatively, we might say that 64-bit CRC was overkill from day one, and we'd rather get the additional 10% or 20% or so speedup. I'm kinda leaning in that direction, but only weakly. Comments? regards, tom lane
Tom Lane wrote: > Alternatively, we might say that 64-bit CRC was overkill from day one, > and we'd rather get the additional 10% or 20% or so speedup. I'm kinda > leaning in that direction, but only weakly. Yes, I lean in that direction too since the CRC calculation is showing up in our profiling. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
> -----Original Message----- > From: Tom Lane [mailto:tgl@sss.pgh.pa.us] > Sent: 27 May 2005 17:49 > To: Mark Cave-Ayland (External) > Cc: 'Manfred Koizar'; 'Greg Stark'; 'Bruce Momjian'; > pgsql-hackers@postgresql.org > Subject: Re: [HACKERS] Cost of XLogInsert CRC calculations (cut) > I went back and looked at the code, and see that I was misled by > terminology: what we've been calling "2x32" in this thread is > not two independent CRC32 calculations, it is use of 32-bit > arithmetic to execute one CRC64 calculation. Yeah, I did find the terminology a little confusing until I looked at the source itself. It doesn't make much sense publishing numbers if you don't know their meaning ;) > Based on the numbers we've seen so far, one could argue for > staying with the 64-bit CRC, but changing the rule we use for > selecting which implementation code to use: use the true > 64-bit code only when sizeof(unsigned long) == 64, and > otherwise use the 2x32 code, even if there is a 64-bit > unsigned long long type available. This essentially assumes > that the unsigned long long type isn't very efficient, which > isn't too unreasonable. This would buy most of the speedup > without giving up anything at all in the error-detection department. All our servers are x86 based Linux with gcc, so if a factor of 2 speedup for CPU calculations is the minimum improvement that we get as a result of this thread then I would be very happy. > Alternatively, we might say that 64-bit CRC was overkill from > day one, and we'd rather get the additional 10% or 20% or so > speedup. I'm kinda leaning in that direction, but only weakly. What would you need to persuade you either way? I believe that disk drives use CRCs internally to verify that the data has been read correctly from each sector. If the majority of the errors would be from a disk failure, then a corrupt sector would have to pass the drive CRC *and* the PostgreSQL CRC in order for an XLog entry to be considered valid. I would have thought the chances of this being able to happen would be reasonably small and so even with CRC32 this can be detected fairly accurately. In the case of an OS crash then we could argue that there may be a partially written sector to the disk, in which case again either one or both of the drive CRC and the PostgreSQL CRC would be incorrect and so this condition could also be reasonably detected using CRC32. As far as I can tell, the main impact of this would be that we would reduce the ability to accurately detect multiple random bit errors, which is more the type of error I would expect to occur in RAM (alpha particles etc.). How often would this be likely to occur? I believe that different generator polynomials have different characteristics that can make them more sensitive to a particular type of error. Perhaps Manfred can tell us the generator polynomial that was used to create the lookup tables? Kind regards, Mark. ------------------------ WebBased Ltd South West Technology Centre Tamar Science Park Plymouth PL6 8BT T: +44 (0)1752 797131 F: +44 (0)1752 791023 W: http://www.webbased.co.uk
"Mark Cave-Ayland" <m.cave-ayland@webbased.co.uk> writes: >> Alternatively, we might say that 64-bit CRC was overkill from >> day one, and we'd rather get the additional 10% or 20% or so >> speedup. I'm kinda leaning in that direction, but only weakly. > What would you need to persuade you either way? I believe that disk drives > use CRCs internally to verify that the data has been read correctly from > each sector. If the majority of the errors would be from a disk failure, > then a corrupt sector would have to pass the drive CRC *and* the PostgreSQL > CRC in order for an XLog entry to be considered valid. I would have thought > the chances of this being able to happen would be reasonably small and so > even with CRC32 this can be detected fairly accurately. It's not really a matter of backstopping the hardware's error detection; if we were trying to do that, we'd keep a CRC for every data page, which we don't. The real reason for the WAL CRCs is as a reliable method of identifying the end of WAL: when the "next record" doesn't checksum you know it's bogus. This is a nontrivial point because of the way that we re-use WAL files --- the pages beyond the last successfully written page aren't going to be zeroes, they'll be filled with random WAL data. Personally I think CRC32 is plenty for this job, but there were those arguing loudly for CRC64 back when we made the decision originally ... regards, tom lane
Tom Lane <tgl@sss.pgh.pa.us> writes: > It's not really a matter of backstopping the hardware's error detection; > if we were trying to do that, we'd keep a CRC for every data page, which > we don't. The real reason for the WAL CRCs is as a reliable method of > identifying the end of WAL: when the "next record" doesn't checksum you > know it's bogus. This is a nontrivial point because of the way that we > re-use WAL files --- the pages beyond the last successfully written page > aren't going to be zeroes, they'll be filled with random WAL data. Is the random WAL data really the concern? It seems like a more reliable way of dealing with that would be to just accompany every WAL entry with a sequential id and stop when the next id isn't the correct one. I thought the problem was that if the machine crashed in the middle of writing a WAL entry you wanted to be sure to detect that. And there's no guarantee the fsync will write out the WAL entry in order. So it's possible the end (and beginning) of the WAL entry will be there but the middle still have been unwritten. The only truly reliable way to handle this would require two fsyncs per transaction commit which would be really unfortunate. > Personally I think CRC32 is plenty for this job, but there were those > arguing loudly for CRC64 back when we made the decision originally ... So given the frequency of database crashes and WAL replays if having one failed replay every few million years is acceptable I think 32 bits is more than enough. Frankly I think 16 bits would be enough. -- greg
Greg Stark <gsstark@mit.edu> writes: > Tom Lane <tgl@sss.pgh.pa.us> writes: >> It's not really a matter of backstopping the hardware's error detection; >> if we were trying to do that, we'd keep a CRC for every data page, which >> we don't. The real reason for the WAL CRCs is as a reliable method of >> identifying the end of WAL: when the "next record" doesn't checksum you >> know it's bogus. > Is the random WAL data really the concern? It seems like a more reliable way > of dealing with that would be to just accompany every WAL entry with a > sequential id and stop when the next id isn't the correct one. We do that, too (the xl_prev links and page header addresses serve that purpose). But it's not sufficient given that WAL records can span pages and therefore may be incompletely written. > The only truly reliable way to handle this would require two fsyncs per > transaction commit which would be really unfortunate. How are two fsyncs going to be better than one? regards, tom lane
Tom Lane <tgl@sss.pgh.pa.us> writes: > > Is the random WAL data really the concern? It seems like a more reliable way > > of dealing with that would be to just accompany every WAL entry with a > > sequential id and stop when the next id isn't the correct one. > > We do that, too (the xl_prev links and page header addresses serve that > purpose). But it's not sufficient given that WAL records can span pages > and therefore may be incompletely written. Right, so the problem isn't that there may be stale data that would be unrecognizable from real data. The problem is that the real data may be partially there but not all there. > > The only truly reliable way to handle this would require two fsyncs per > > transaction commit which would be really unfortunate. > > How are two fsyncs going to be better than one? Well you fsync the WAL entry and only when that's complete do you flip a bit marking the WAL entry as commited and fsync again. Hm, you might need three fsyncs, one to make sure the bit isn't set before writing out the WAL record itself. -- greg
Greg Stark <gsstark@mit.edu> writes: > Tom Lane <tgl@sss.pgh.pa.us> writes: >>> Is the random WAL data really the concern? It seems like a more reliable way >>> of dealing with that would be to just accompany every WAL entry with a >>> sequential id and stop when the next id isn't the correct one. >> >> We do that, too (the xl_prev links and page header addresses serve that >> purpose). But it's not sufficient given that WAL records can span pages >> and therefore may be incompletely written. Actually, on reviewing the code I notice two missed bets here. 1. During WAL replay, we aren't actually verifying that xl_prev matches the address of the prior WAL record. This means we are depending only on the page header addresses to make sure we don't replay stale WAL data left over from the previous cycle of use of the physical WAL file. This is fairly dangerous, considering the likelihood of partial write of a WAL page during a power failure: the first 512-byte sector(s) of a page may have been updated but not the rest. If an old WAL record happens to start at exactly the sector boundary then we lose. 2. We store a separate CRC for each backup block attached to a WAL record. Therefore the same torn-page problem could hit us if a stale backup block starts exactly at a intrapage sector boundary --- there is nothing guaranteeing that the backup block really goes with the WAL record. #1 seems like a pretty critical, but also easily fixed, bug. To fix #2 I suppose we'd have to modify the WAL format to store just one CRC covering the whole of a WAL record and its attached backup block(s). I think the reasoning behind the separate CRCs was to put a limit on the amount of data guarded by one CRC, and thereby minimize the risk of undetected errors. But using the CRCs in this way is failing to guard against exactly the problem that we want the CRCs to guard against in the first place, namely torn WAL records ... so worrying about detection reliability seems misplaced. The odds of an actual failure from case #2 are fortunately not high, since a backup block will necessarily span across at least one WAL page boundary and so we should be able to detect stale data by noting that the next page's header address is wrong. (If it's not wrong, then at least the first sector of the next page is up-to-date, so if there is any tearing the CRC should tell us.) Therefore I don't feel any need to try to work out a back-patchable solution for #2. But I do think we ought to change the WAL format going forward to compute just one CRC across a WAL record and all attached backup blocks. There was talk of allowing compression of backup blocks, and if we do that we could no longer feel any certainty that a page crossing would occur. Thoughts? regards, tom lane
On Tue, 31 May 2005 12:07:53 +0100, "Mark Cave-Ayland" <m.cave-ayland@webbased.co.uk> wrote: >Perhaps Manfred can tell us the generator >polynomial that was used to create the lookup tables? 32 26 23 22 16 12 11 10 8 7 5 4 2 1 X + X + X + X + X + X + X + X + X + X + X + X + X + X + 1 -> http://www.opengroup.org/onlinepubs/009695399/utilities/cksum.html Or google for "04c11db7". ServusManfred
On Tue, 2005-05-31 at 12:27 -0400, Tom Lane wrote: > Greg Stark <gsstark@mit.edu> writes: > > Tom Lane <tgl@sss.pgh.pa.us> writes: > >>> Is the random WAL data really the concern? It seems like a more reliable way > >>> of dealing with that would be to just accompany every WAL entry with a > >>> sequential id and stop when the next id isn't the correct one. > >> > >> We do that, too (the xl_prev links and page header addresses serve that > >> purpose). But it's not sufficient given that WAL records can span pages > >> and therefore may be incompletely written. > > Actually, on reviewing the code I notice two missed bets here. > > 1. During WAL replay, we aren't actually verifying that xl_prev matches > the address of the prior WAL record. This means we are depending only > on the page header addresses to make sure we don't replay stale WAL data > left over from the previous cycle of use of the physical WAL file. This > is fairly dangerous, considering the likelihood of partial write of a > WAL page during a power failure: the first 512-byte sector(s) of a page > may have been updated but not the rest. If an old WAL record happens to > start at exactly the sector boundary then we lose. Hmmm. I seem to recall asking myself why xl_prev existed if it wasn't used, but passed that by. Damn. > 2. We store a separate CRC for each backup block attached to a WAL > record. Therefore the same torn-page problem could hit us if a stale > backup block starts exactly at a intrapage sector boundary --- there is > nothing guaranteeing that the backup block really goes with the WAL > record. > > #1 seems like a pretty critical, but also easily fixed, bug. To fix #2 > I suppose we'd have to modify the WAL format to store just one CRC > covering the whole of a WAL record and its attached backup block(s). > > I think the reasoning behind the separate CRCs was to put a limit on > the amount of data guarded by one CRC, and thereby minimize the risk > of undetected errors. But using the CRCs in this way is failing to > guard against exactly the problem that we want the CRCs to guard against > in the first place, namely torn WAL records ... so worrying about > detection reliability seems misplaced. > > The odds of an actual failure from case #2 are fortunately not high, > since a backup block will necessarily span across at least one WAL page > boundary and so we should be able to detect stale data by noting that > the next page's header address is wrong. (If it's not wrong, then at > least the first sector of the next page is up-to-date, so if there is > any tearing the CRC should tell us.) Therefore I don't feel any need > to try to work out a back-patchable solution for #2. But I do think we > ought to change the WAL format going forward to compute just one CRC > across a WAL record and all attached backup blocks. There was talk of > allowing compression of backup blocks, and if we do that we could no > longer feel any certainty that a page crossing would occur. > > Thoughts? PreAllocXLog was already a reason to have somebody prepare new xlog files ahead of them being used. Surely the right solution here is to have that agent prepare fresh/zeroed files prior to them being required. That way no stale data can ever occur and both of these bugs go away.... Fixing that can be backpatched so that the backend that switches files can do the work, rather than bgwriter [ or ?]. Best Regards, Simon Riggs
Simon Riggs <simon@2ndquadrant.com> writes: > Hmmm. I seem to recall asking myself why xl_prev existed if it wasn't > used, but passed that by. Damn. I couldn't believe it'd been overlooked this long, either. It's the sort of thing that you assume got done the first time :-( > PreAllocXLog was already a reason to have somebody prepare new xlog > files ahead of them being used. Surely the right solution here is to > have that agent prepare fresh/zeroed files prior to them being required. Uh, why? That doubles the amount of physical I/O required to maintain the WAL, and AFAICS it doesn't really add any safety that we can't get in a more intelligent fashion. regards, tom lane
On Tue, 2005-05-31 at 22:36 -0400, Tom Lane wrote: > Simon Riggs <simon@2ndquadrant.com> writes: > > Hmmm. I seem to recall asking myself why xl_prev existed if it wasn't > > used, but passed that by. Damn. > > I couldn't believe it'd been overlooked this long, either. It's the > sort of thing that you assume got done the first time :-( Guess it shows how infrequently PostgreSQL crashes and recovers. > > PreAllocXLog was already a reason to have somebody prepare new xlog > > files ahead of them being used. Surely the right solution here is to > > have that agent prepare fresh/zeroed files prior to them being required. > > Uh, why? That doubles the amount of physical I/O required to maintain > the WAL, and AFAICS it doesn't really add any safety that we can't get > in a more intelligent fashion. OK, I agree that the xl_prev linkage is the more intelligent way to go. If I/O is a problem, then surely you will agree that PreAllocXLog is still required and should not be run by a backend? Thats going to show as a big response time spike for that user. Thats the last bastion - the other changes are gonna smooth response times right down, so can we do something with PreAllocXLog too? Best Regards, Simon Riggs
> -----Original Message----- > From: Tom Lane [mailto:tgl@sss.pgh.pa.us] > Sent: 31 May 2005 17:27 > To: Greg Stark > Cc: Mark Cave-Ayland (External); 'Manfred Koizar'; 'Bruce > Momjian'; pgsql-hackers@postgresql.org > Subject: Re: [HACKERS] Cost of XLogInsert CRC calculations (cut) > The odds of an actual failure from case #2 are fortunately > not high, since a backup block will necessarily span across > at least one WAL page boundary and so we should be able to > detect stale data by noting that the next page's header > address is wrong. (If it's not wrong, then at least the > first sector of the next page is up-to-date, so if there is > any tearing the CRC should tell us.) Therefore I don't feel > any need to try to work out a back-patchable solution for #2. > But I do think we ought to change the WAL format going > forward to compute just one CRC across a WAL record and all > attached backup blocks. There was talk of allowing > compression of backup blocks, and if we do that we could no > longer feel any certainty that a page crossing would occur. I must admit I didn't realise that an XLog record consisted of a number of backup blocks which were also separately CRCd. I've been through the source, and while the XLog code is reasonably well commented, I couldn't find a README in the transam/ directory that explained the thinking behind the current implementation - is this something that was discussed on the mailing lists way back in the mists of time? I'm still a little nervous about dropping down to CRC32 from CRC64 and so was just wondering what the net saving would be using one CRC64 across the whole WAL record? For example, if an insert or an update uses 3 backup blocks then this one change alone would immediately reduce the CPU usage to one third of its original value? (something tells me that this is probably not the case as I imagine you would have picked this up a while back). In my view, having a longer CRC is like buying a holiday with insurance - you pay the extra cost knowing that should anything happen then you have something to fall back on. However, the hard part here is determining a reasonable balance betweem the cost and the risk. Kind regards, Mark. ------------------------ WebBased Ltd South West Technology Centre Tamar Science Park Plymouth PL6 8BT T: +44 (0)1752 797131 F: +44 (0)1752 791023 W: http://www.webbased.co.uk
Simon Riggs <simon@2ndquadrant.com> writes: > If I/O is a problem, then surely you will agree that PreAllocXLog is > still required and should not be run by a backend? It is still required, but it isn't run by backends --- it's fired off during checkpoints. I think there was some discussion recently about making it more aggressive about allocating future segments; which strikes me as a good idea. regards, tom lane
"Mark Cave-Ayland" <m.cave-ayland@webbased.co.uk> writes: > I'm still a little nervous about dropping down to CRC32 from CRC64 and so > was just wondering what the net saving would be using one CRC64 across the > whole WAL record? None to speak of; the startup/teardown time is trivial. It's the per-byte cost that hurts. regards, tom lane