Thread: Re: New CRC algorithm: Slicing by 8
Resending... The previous 3 attempt haven't succeeded, probably because of size restrictions; so sending in two parts.
--
gurjeet@EnterpriseDB.com
singh.gurjeet@{ gmail | hotmail | yahoo }.com
On 10/18/06, Gurjeet Singh < singh.gurjeet@gmail.com> wrote:
Hi all,
Please refer to the following link for a new algorithm to calculate CRC, developed by Intel.
http://www.intel.com/technology/magazine/communications/slicing-by-8-0306.htm
Please find attached the patch (pg_crc_sb8_4.diff.gz), that implements this algorithm (originally obtained from http://sourceforge.net/projects/slicing-by-8 ).
I tested it using pgbench, and found an average improvement of about 15 tps (291 vs. 307 tps) when running with a scaling-factor of 10, number of clients: 10, and 1000 transactions per client (I don't remember these pgbench params exactly; it's been quite a while now).
The author of the algo says that it works three times faster (theoretically) than the conventional CRC calculation methods, but the tps improvements didn't reflect that; so, to prove the speed, I extracted the source files (both, PG and SB8) and made a saparate project ( test.tar.gz).
The resulting binary's comand format is:
<snip>
<prog_name> <algo_name> [ <scaling_factor> [ <block_size> [ <list_size> ] ] ]
algo_name : PG or SB8 or BOTH (just the first char will also do!)
scaling_factor: the multiple to use when creating the list of buffer blocks.
block_size : buffer size (in KBs); default is 4.
list_size : number of data blocks to create for the test; default is 100.
So the command : a.out p 3 1024 300
will test the PG's algo over 900 blocks of 1 MB each.
</snip>
The test.sh script is a wrapper around this binary, to test each of PG and SB8, and both algos simultaneously, three times each. I performed two tests using this script to prove SB8's effectiveness on large and small sized data blocks:
sh test.sh 1 1024 800 -- 800 blocks of 1 MB each; effective data size 800 MB
sh test.sh 800 1 1024 -- 800 * 1024 blocks of 1 KB each; effective data size 800 MB
The new algo shined in both test configurations. It performs 3 to 4 times better than regular CRC algo in both types of tests. The results are in the files results_1_1024_800.out and results_800_1_1024.out, respectively.
To test it yourself, extract all the files from test.tar.gz, and issue the following commands (assuming bash):
sh make.sh
sh test.sh 1 1024 800
sh test.sh 800 1 1024
(Note: you would like to reduce the effective data size on a Windows box. I used 500 MB on windows+MinGW. I tested with these params since I have only 1 GB of RAM, and increasing the data size beyond these numbers caused the memory contents to be spilled over into swap-space/pagefile, and this disk I/O causes a severe distortion in results.)
The tests were performed on Fedora Core 5 and MinGW on WinXP Professional.
If possible, people should test it on different platforms, so as to ensure that it doesn't perform any worse than older implementation on any supported platform (please post the results, if you do test it). Also, recommendations for a better (than pgbench) benhmark are welcome, so that we can show the improvement when it is used as a part of PG.
Best regards,
--
gurjeet[.singh]@EnterpriseDB.com
singh.gurjeet@{ gmail | hotmail | yahoo }.com
--
gurjeet@EnterpriseDB.com
singh.gurjeet@{ gmail | hotmail | yahoo }.com
Attachment
Resending... The previous 3 attempt haven't succeeded, probably because of size restrictions; so sending in two parts.
This is part 2; unfortunately, the size of the gzipped file is still a bit big (26 KB), so I have uploaded it to
http://www.geocities.com/gurjeet79/crc_sb8_test.tar.gz
Please download and test it from there.
Can someone let us all know what is the limit on the attachments sent to the -hackers list?
Best regards,
--
gurjeet@EnterpriseDB.com
singh.gurjeet@{ gmail | hotmail | yahoo }.com
This is part 2; unfortunately, the size of the gzipped file is still a bit big (26 KB), so I have uploaded it to
http://www.geocities.com/gurjeet79/crc_sb8_test.tar.gz
Please download and test it from there.
Can someone let us all know what is the limit on the attachments sent to the -hackers list?
Best regards,
--
gurjeet@EnterpriseDB.com
singh.gurjeet@{ gmail | hotmail | yahoo }.com
On 10/18/06, Gurjeet Singh < singh.gurjeet@gmail.com> wrote:
Hi all,
Please refer to the following link for a new algorithm to calculate CRC, developed by Intel.
http://www.intel.com/technology/magazine/communications/slicing-by-8-0306.htm
Please find attached the patch (pg_crc_sb8_4.diff.gz), that implements this algorithm (originally obtained from http://sourceforge.net/projects/slicing-by-8 ).
I tested it using pgbench, and found an average improvement of about 15 tps (291 vs. 307 tps) when running with a scaling-factor of 10, number of clients: 10, and 1000 transactions per client (I don't remember these pgbench params exactly; it's been quite a while now).
The author of the algo says that it works three times faster (theoretically) than the conventional CRC calculation methods, but the tps improvements didn't reflect that; so, to prove the speed, I extracted the source files (both, PG and SB8) and made a saparate project ( test.tar.gz).
The resulting binary's comand format is:
<snip>
<prog_name> <algo_name> [ <scaling_factor> [ <block_size> [ <list_size> ] ] ]
algo_name : PG or SB8 or BOTH (just the first char will also do!)
scaling_factor: the multiple to use when creating the list of buffer blocks.
block_size : buffer size (in KBs); default is 4.
list_size : number of data blocks to create for the test; default is 100.
So the command : a.out p 3 1024 300
will test the PG's algo over 900 blocks of 1 MB each.
</snip>
The test.sh script is a wrapper around this binary, to test each of PG and SB8, and both algos simultaneously, three times each. I performed two tests using this script to prove SB8's effectiveness on large and small sized data blocks:
sh test.sh 1 1024 800 -- 800 blocks of 1 MB each; effective data size 800 MB
sh test.sh 800 1 1024 -- 800 * 1024 blocks of 1 KB each; effective data size 800 MB
The new algo shined in both test configurations. It performs 3 to 4 times better than regular CRC algo in both types of tests. The results are in the files results_1_1024_800.out and results_800_1_1024.out, respectively.
To test it yourself, extract all the files from test.tar.gz, and issue the following commands (assuming bash):
sh make.sh
sh test.sh 1 1024 800
sh test.sh 800 1 1024
(Note: you would like to reduce the effective data size on a Windows box. I used 500 MB on windows+MinGW. I tested with these params since I have only 1 GB of RAM, and increasing the data size beyond these numbers caused the memory contents to be spilled over into swap-space/pagefile, and this disk I/O causes a severe distortion in results.)
The tests were performed on Fedora Core 5 and MinGW on WinXP Professional.
If possible, people should test it on different platforms, so as to ensure that it doesn't perform any worse than older implementation on any supported platform (please post the results, if you do test it). Also, recommendations for a better (than pgbench) benhmark are welcome, so that we can show the improvement when it is used as a part of PG.
Best regards,
--
gurjeet[.singh]@EnterpriseDB.com
singh.gurjeet@{ gmail | hotmail | yahoo }.com
Gurjeet Singh wrote: > > > Can someone let us all know what is the limit on the attachments > sent to the -hackers list? > Generally it's probably best not to send attachments to -hackers at all. If you have a patch, send it to -hackers. Or you can publish it somewhere on the web and post a link to it (this is one of the things I think a developers' wiki is actually useful for). cheers andrew
Andrew Dunstan wrote: > If you have a patch, send it to -hackers. I mean -patches of course. Sorry. cheers andrew
"Gurjeet Singh" <singh.gurjeet@gmail.com> writes: >> Please refer to the following link for a new algorithm to calculate >> CRC, developed by Intel. >> >> http://www.intel.com/technology/magazine/communications/slicing-by-8-0306.htm I can't help wondering what the patent status of this algorithm is. (This from a guy who still remembers very well how Unisys published the LZW algorithm in a manner that made it look like it was free for anyone to use...) regards, tom lane
Tom Lane wrote: > "Gurjeet Singh" <singh.gurjeet@gmail.com> writes: > >> Please refer to the following link for a new algorithm to calculate > >> CRC, developed by Intel. > >> > >> http://www.intel.com/technology/magazine/communications/slicing-by-8-0306.htm > > I can't help wondering what the patent status of this algorithm is. > > (This from a guy who still remembers very well how Unisys published the > LZW algorithm in a manner that made it look like it was free for anyone > to use...) This article http://www.intel.com/technology/magazine/communications/slicing-by-8-0306.htm seems to imply it's provided "free". It doesn't say anything about being free of patents though. The Sourceforge project referred to in the article (but for which no link is given) seems to be this one: http://sourceforge.net/projects/slicing-by-8 The "files" section contains a zip archive, inside which there are three source files all of which state that the package, which is (c) Intel Corp. 2004-2006, is released under the BSD license, with this URL: http://www.opensource.org/licenses/bsd-license.html Again, no mention of patents anywhere. -- Alvaro Herrera http://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc.
On 10/20/06, Alvaro Herrera <alvherre@commandprompt.com> wrote:
Yes. I had already mentioned that link in my posts.
I'll try to get in touch with the author and get the clarification. It doesn't say anything about
being free of patents though.
The Sourceforge project referred to in the article (but for which no
link is given) seems to be this one:
http://sourceforge.net/projects/slicing-by-8
Yes. I had already mentioned that link in my posts.
The "files" section contains a zip archive, inside which there are three
source files all of which state that the package, which is (c) Intel
Corp. 2004-2006, is released under the BSD license, with this URL:
http://www.opensource.org/licenses/bsd-license.html
Again, no mention of patents anywhere.
Best regards,
--
gurjeet[.singh]@EnterpriseDB.com
singh.gurjeet@{ gmail | hotmail | yahoo }.com
Gurjeet Singh wrote: > On 10/20/06, Alvaro Herrera <alvherre@commandprompt.com> wrote: > > > > It doesn't say anything about > > being free of patents though. > > > > The Sourceforge project referred to in the article (but for which no > > link is given) seems to be this one: > > > > http://sourceforge.net/projects/slicing-by-8 > > > Yes. I had already mentioned that link in my posts. > > The "files" section contains a zip archive, inside which there are three > > source files all of which state that the package, which is (c) Intel > > Corp. 2004-2006, is released under the BSD license, with this URL: > > > > http://www.opensource.org/licenses/bsd-license.html > > > > Again, no mention of patents anywhere. > > > I'll try to get in touch with the author and get the clarification. Yep, the license and patent status seem to be separate issues. However, I am not sure getting a clarification from the author even helps us legally. Also, why are we more critical of an Intel-provided idea than any other idea we get from the community? The scary answer is that in many ways they are the same, and have the same risks. So unless we hear about a problem, I think we should use the code. -- Bruce Momjian bruce@momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
Bruce Momjian <bruce@momjian.us> writes: > However, I am not sure getting a clarification from the author even > helps us legally. Also, why are we more critical of an Intel-provided > idea than any other idea we get from the community? Bitter experience with other companies. > So unless we hear about a problem, I think we should use the code. It hasn't even been tested. One thing I'd want to know about is the performance effect on non-Intel machines. regards, tom lane
On 10/21/06, Tom Lane <tgl@sss.pgh.pa.us> wrote: [snip] > It hasn't even been tested. One thing I'd want to know about is the > performance effect on non-Intel machines. On Opteron 265 his test code shows SB8 (the intel alg) is 2.48x faster for checksum and 1.95x faster for verify for the 800 * 1024 blocks of 1 KB each workload. For 100000 blocks of 8k I got simmlar results as well. It looks like the new code may have a larger cache footprint, so actual performance may differ from the microbenchmark.
On 10/22/06, Gregory Maxwell <gmaxwell@gmail.com> wrote:
I think you meant
800*1024 blocks of 1 >MB< each workload.
On Opteron 265 his test code shows SB8 (the intel alg) is 2.48x faster
for checksum and 1.95x faster for verify for the 800 * 1024 blocks of
1 KB each workload. For 100000 blocks of 8k I got simmlar results as
well.
I think you meant
800*1024 blocks of 1 >MB< each workload.
--
gurjeet[.singh]@EnterpriseDB.com
singh.gurjeet@{ gmail | hotmail | yahoo }.com
"Gurjeet Singh" <singh.gurjeet@gmail.com> writes: >> If possible, people should test it on different platforms, so as to >> ensure that it doesn't perform any worse than older implementation on any >> supported platform (please post the results, if you do test it). I didn't particularly trust the timing calculations in your benchmark program, so I made up my own low-tech test version (attached). I get the following timings for 1 million iterations of INIT_CRC32/COMP_CRC32/FIN_CRC32 on different buffer sizes, using "gcc -O2" on some machines laying about the house: Std CRC Slice8 CRC HPPA (HPUX 10.20) 8192 bytes 44.897212 35.191499 1024 bytes 5.639081 4.669850 64 bytes 0.377416 0.613195 PPC (Mac G4, Darwin 10.4.8) 8192 bytes 12.663135 12.158293 1024 bytes 1.579940 1.614967 64 bytes 0.104310 0.229401 Intel Xeon EM64T (Fedora Core 5) 8192 bytes 4.420879 7.633120 1024 bytes 0.571794 0.819372 64 bytes 0.047354 0.071906 Intel Pentium 4 (Fedora Core 5) 8192 bytes 6.942324 28.848572 (yes, really) 1024 bytes 0.905259 3.625360 64 bytes 0.068065 0.260224 It's worth pointing out that this test program is biased in favor of the slice8 code about as far as it could possibly be: after the first iteration, the remaining 999999 will find the cache as hot as possible. Furthermore, the test doesn't exercise misaligned or odd-length cases, which would incur extra start/stop overhead for slice8. These numbers are um, not impressive. Considering that a large fraction of our WAL records are pretty short, the fact that slice8 consistently loses at short buffer lengths is especially discouraging. Much of that ground could be made up perhaps with tenser coding of the initialization and finalization code, but it'd still not be worth taking any legal risk for AFAICS. regards, tom lane
Attachment
Tom Lane wrote: > Bruce Momjian <bruce@momjian.us> writes: > > However, I am not sure getting a clarification from the author even > > helps us legally. Also, why are we more critical of an Intel-provided > > idea than any other idea we get from the community? > > Bitter experience with other companies. The problem is we have lots of companies involved, and I bet some we don't even know about (e.g. yahoo/gmail addresses), and with contributors who don't know that their employment agreement says anything they do while employed is company property. And we have Unisys involved now too. How worrying is that? :-( > > So unless we hear about a problem, I think we should use the code. > It hasn't even been tested. One thing I'd want to know about is the > performance effect on non-Intel machines. Sure. -- Bruce Momjian bruce@momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
Bruce Momjian <bruce@momjian.us> writes: > Tom Lane wrote: >> Bruce Momjian <bruce@momjian.us> writes: >>> Also, why are we more critical of an Intel-provided >>> idea than any other idea we get from the community? >> >> Bitter experience with other companies. > The problem is we have lots of companies involved, and I bet some we > don't even know about (e.g. yahoo/gmail addresses), It's not so much that I don't trust Intel as that a CRC algorithm is exactly the sort of nice little self-contained thing that people love to try to patent these days. What I am really afraid of is that someone else has already invented this same method (or something close enough to it) and filed for a patent that Intel doesn't know about either. I'd be wondering about that no matter where the code had come from. Given the numbers I posted earlier today, the proposal is dead in the water anyway, quite aside from any legal considerations. regards, tom lane
Tom Lane wrote: > Bruce Momjian <bruce@momjian.us> writes: > > Tom Lane wrote: > >> Bruce Momjian <bruce@momjian.us> writes: > >>> Also, why are we more critical of an Intel-provided > >>> idea than any other idea we get from the community? > >> > >> Bitter experience with other companies. > > > The problem is we have lots of companies involved, and I bet some we > > don't even know about (e.g. yahoo/gmail addresses), > > It's not so much that I don't trust Intel as that a CRC algorithm is > exactly the sort of nice little self-contained thing that people love > to try to patent these days. What I am really afraid of is that someone > else has already invented this same method (or something close enough > to it) and filed for a patent that Intel doesn't know about either. > I'd be wondering about that no matter where the code had come from. > > Given the numbers I posted earlier today, the proposal is dead in the > water anyway, quite aside from any legal considerations. Agreed. I just wanted to point out we have other sharks in the water. :-( -- Bruce Momjian bruce@momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
On Sun, Oct 22, 2006 at 06:06:13PM -0400, Tom Lane wrote: > Intel Xeon EM64T (Fedora Core 5) > > 8192 bytes 4.420879 7.633120 > 1024 bytes 0.571794 0.819372 > 64 bytes 0.047354 0.071906 > > Intel Pentium 4 (Fedora Core 5) > > 8192 bytes 6.942324 28.848572 (yes, really) > 1024 bytes 0.905259 3.625360 > 64 bytes 0.068065 0.260224 AMDX2 3800+ (Fedora Core 5) STD CRC SLICE8 CRC 8192 bytes 8.576635 7.170038 1024 bytes 1.504361 1.402446 64 bytes 0.154459 0.144209 Odd that the AMD shows opposite of the two Intel numbers above, and that it was an "Intel engineer" who wrote it. My first speculation is that you did your Intel numbers backwards. My second speculation is that you already thought of that and confirmed before posting. :-) So yeah - not too impressive... Cheers, mark -- mark@mielke.cc / markm@ncf.ca / markm@nortel.com __________________________ . . _ ._ . . .__ . . ._. .__ . . . .__ | Neighbourhood Coder |\/| |_| |_| |/ |_ |\/| | |_ | |/ |_ | | | | | | \ | \ |__ . | | .|. |__ |__ | \ |__ | Ottawa, Ontario, Canada One ring to rule them all, one ring to find them, one ring to bring them all and in the darkness bindthem... http://mark.mielke.cc/
mark@mark.mielke.cc writes: > ... My first speculation is > that you did your Intel numbers backwards. My second speculation is > that you already thought of that and confirmed before posting. :-) Yah, I checked. Several times... but if anyone else wants to repeat the experiment, please do. Or look for bugs in either my test case or Gurjeet's. regards, tom lane
Tom Lane wrote: > > > Yah, I checked. Several times... but if anyone else wants to repeat > the experiment, please do. Or look for bugs in either my test case > or Gurjeet's. FWIW - FreeBSD and Linux results using Tom's test program on almost identical hardware[1]: Std crc Slice-8 crc Intel P-III 1.26Ghz (FreeBSD 6.2) 8192 bytes 12.975314 14.503810 1024 bytes 1.633557 1.852322 64 bytes 0.111580 0.206975 Intel P-III 1.26Ghz (Gentoo 2006.1) 8192 bytes 12.967997 28.363876 1024 bytes 1.632317 3.626230 64 bytes 0.111513 0.326557 Interesting that the slice-8 algorithm seems to work noticeably better on FreeBSD than Linux - but still not as well as the standard one (for these tests anyway)... Cheers Mark [1] Both boxes have identical mobos, memory and CPUs (same sspec nos).
Mark Kirkwood <markir@paradise.net.nz> writes: > Interesting that the slice-8 algorithm seems to work noticeably better > on FreeBSD than Linux - Are you running similar gcc versions on both? I realize I forgot to document what I was using: HPPA: gcc version 2.95.3 20010315 (release) PPC: gcc version 4.0.1 (Apple Computer, Inc. build 5363) both Intels: gcc version 4.1.1 20060525 (Red Hat 4.1.1-1) Interesting that it's only the oldest-line tech that's showing a win for me ... regards, tom lane
Tom Lane wrote: > Mark Kirkwood <markir@paradise.net.nz> writes: >> Interesting that the slice-8 algorithm seems to work noticeably better >> on FreeBSD than Linux - > > Are you running similar gcc versions on both? I realize I forgot to > document what I was using: > > HPPA: gcc version 2.95.3 20010315 (release) > PPC: gcc version 4.0.1 (Apple Computer, Inc. build 5363) > both Intels: gcc version 4.1.1 20060525 (Red Hat 4.1.1-1) > > Interesting that it's only the oldest-line tech that's showing a win > for me ... > > Ah - good point, FreeBSD is using an older compiler: FreeBSD: gcc (GCC) 3.4.6 [FreeBSD] 20060305 Linux: gcc (GCC) 4.1.1 (Gentoo 4.1.1) Hmm - there is a FreeBSD port for 4.1.2, I might set that off to build itself and see if compiling with it changes the results any....
On Mon, 23 Oct 2006 07:35:08 +0200, Tom Lane <tgl@sss.pgh.pa.us> wrote: > I realize I forgot to document what I was using: > > HPPA: gcc version 2.95.3 20010315 (release) > PPC: gcc version 4.0.1 (Apple Computer, Inc. build 5363) > both Intels: gcc version 4.1.1 20060525 (Red Hat 4.1.1-1) > > Interesting that it's only the oldest-line tech that's showing a win > for me ... Does anyone have an Intel compiler availible? It would be interesting to see results on that compared to gcc (given that it is an Intel algorithm ;). \Anders
Tom Lane wrote: > > Are you running similar gcc versions on both? I realize I forgot to > document what I was using: > > HPPA: gcc version 2.95.3 20010315 (release) > PPC: gcc version 4.0.1 (Apple Computer, Inc. build 5363) > both Intels: gcc version 4.1.1 20060525 (Red Hat 4.1.1-1) > > Interesting that it's only the oldest-line tech that's showing a win > for me ... It would be interesting to see if the Intel compiler works better for this algorithm. Anyone have it installed? -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Mon, 23 Oct 2006, Mark Kirkwood wrote: > Tom Lane wrote: > > > > > > Yah, I checked. Several times... but if anyone else wants to repeat > > the experiment, please do. Or look for bugs in either my test case > > or Gurjeet's. > > Just for fun, I tried it out with both GCC and with Intel's C compiler with some agressive platform-specific flags on my 2.8Ghz Xeon running Gentoo. Std crc Slice-8 crc Intel P4 Xeon 2.8Ghz (Gentoo, gcc-3.4.5, -O2) 8192 bytes 4.697572 9.806341 1024 bytes 0.597429 1.181828 64 bytes 0.046636 0.086984 Intel P4 Xeon 2.8Ghz (Gentoo, icc-9.0.032, -O2 -xN -ipo -parallel) 8192 bytes 0.000004 0.001085 1024 bytes 0.000004 0.001292 64 bytes 0.000003 0.001078 So at this point I realize that intel's compiler is optimizing the loop away, at least for the std crc and probably for both. So I make mycrc an array of 2, and substript mycrc[j&1] in the loop. Std crc Slice-8 crc Intel P4 Xeon 2.8Ghz (Gentoo, gcc-3.4.5, -O2) 8192 bytes 51.397146 9.523182 1024 bytes 6.430986 1.229043 64 bytes 0.400062 0.128579 Intel P4 Xeon 2.8Ghz (Gentoo, icc-9.0.032, -O2 -xN -ipo -parallel) 8192 bytes 29.881708 0.001432 1024 bytes 3.750313 0.001432 64 bytes 0.238583 0.001431 So it looks like something fishy is still going on with the slice-8 with the intel compiler. I have attached my changed testcrc.c file. > FWIW - FreeBSD and Linux results using Tom's test program on almost identical > hardware[1]: > > Std crc Slice-8 crc > > Intel P-III 1.26Ghz (FreeBSD 6.2) > > 8192 bytes 12.975314 14.503810 > 1024 bytes 1.633557 1.852322 > 64 bytes 0.111580 0.206975 > > > Intel P-III 1.26Ghz (Gentoo 2006.1) > > > 8192 bytes 12.967997 28.363876 > 1024 bytes 1.632317 3.626230 > 64 bytes 0.111513 0.326557 > > > Interesting that the slice-8 algorithm seems to work noticeably better on > FreeBSD than Linux - but still not as well as the standard one (for these > tests anyway)... > > > Cheers > > Mark > > [1] Both boxes have identical mobos, memory and CPUs (same sspec nos). > > > ---------------------------(end of broadcast)--------------------------- > TIP 5: don't forget to increase your free space map settings > > -- You can tune a piano, but you can't tuna fish.
Mark Kirkwood wrote: > Tom Lane wrote: >> Are you running similar gcc versions on both? I realize I forgot to >> document what I was using: > Ah - good point, FreeBSD is using an older compiler: > > FreeBSD: gcc (GCC) 3.4.6 [FreeBSD] 20060305 > Linux: gcc (GCC) 4.1.1 (Gentoo 4.1.1) > > Hmm - there is a FreeBSD port for 4.1.2, I might set that off to build > itself and see if compiling with it changes the results any.... > Here are the results after building gcc 4.1.2 (repeating results for gcc 3.4.6 for comparison). I suspect that performance is probably impacted because gcc 4.1.2 (and also the rest of the tool-chain) is built with gcc 3.4.6 - but it certainly suggests that the newer gcc versions don't like the slice-8 algorithm for some reason. Std crc Slice-8 crc Intel P-III 1.26Ghz (FreeBSD 6.2 gcc 3.4.6) 8192 bytes 12.975314 14.503810 1024 bytes 1.633557 1.852322 64 bytes 0.111580 0.206975 Intel P-III 1.26Ghz (FreeBSD 6.2 gcc 4.1.2) 8192 bytes 19.516974 29.457739 1024 bytes 2.448570 3.742106 64 bytes 0.112644 0.335292
>>>>> "MK" == Mark Kirkwood <markir@paradise.net.nz> writes: MK> Here are the results after building gcc 4.1.2 (repeating results MK> for gcc 3.4.6 for comparison). I suspect that performance is MK> probably impacted because gcc 4.1.2 (and also the rest of the MK> tool-chain) is built with gcc 3.4.6 - but it certainly suggests MK> that the newer gcc versions don't like the slice-8 algorithm for MK> some reason. They don't seem to like the old CRC algorithms either. It is quite strange, such dramatic performance regressions from 3.x to 4.x are rare. /Benny
Jeremy Drake <pgsql@jdrake.com> writes: > So at this point I realize that intel's compiler is optimizing the loop > away, at least for the std crc and probably for both. So I make mycrc an > array of 2, and substript mycrc[j&1] in the loop. That's not a good workaround, because making mycrc expensive to access means your inner loop timing isn't credible at all. Instead try making the buffer array nonlocal --- malloc it, perhaps. regards, tom lane
On Sun, 2006-10-22 at 18:06 -0400, Tom Lane wrote: > These numbers are um, not impressive. Considering that a large fraction > of our WAL records are pretty short, the fact that slice8 consistently > loses at short buffer lengths is especially discouraging. Much of that > ground could be made up perhaps with tenser coding of the initialization > and finalization code, but it'd still not be worth taking any legal risk > for AFAICS. It doesn't look good for SB8, does it? Nor for gcc4.1 either. Presumably Intel themselves will have some come-back, but I'm not sure what they'll so to so many conclusive tests. Instead, I'd like to include a parameter to turn off CRC altogether, for heavily CPU bound operations and the WAL drive on trustworthy hardware. wal_checksum = off -- Simon Riggs EnterpriseDB http://www.enterprisedb.com
"Simon Riggs" <simon@2ndquadrant.com> writes: > Instead, I'd like to include a parameter to turn off CRC altogether, for > heavily CPU bound operations and the WAL drive on trustworthy hardware. No can do --- we rely on the checksums to be able to tell when we've hit the end of WAL during replay. You may as well propose not writing WAL at all (and no, I don't think it'd pass). regards, tom lane
On 10/23/06, Tom Lane <tgl@sss.pgh.pa.us> wrote: > It's not so much that I don't trust Intel as that a CRC algorithm is > exactly the sort of nice little self-contained thing that people love > to try to patent these days. What I am really afraid of is that someone > else has already invented this same method (or something close enough > to it) and filed for a patent that Intel doesn't know about either. > I'd be wondering about that no matter where the code had come from. > > Given the numbers I posted earlier today, the proposal is dead in the > water anyway, quite aside from any legal considerations. The horror, the horror. I wonder if changing to Slicing by 8 would be so self contained, so that people from software-patent free world would be able to just patch their distribution if they will. Regards, Dawid
Bruce, Tom, All: > > Given the numbers I posted earlier today, the proposal is dead in the > > water anyway, quite aside from any legal considerations. > > Agreed. I just wanted to point out we have other sharks in the water. *IF* Slice-by-8 turned out to be a winner, I could get the legal issues looked into and probably resolved. But we should only do that if it turns out we really want to use it. -- --Josh Josh Berkus PostgreSQL @ Sun San Francisco
On Mon, 2006-10-23 at 13:52 -0400, Tom Lane wrote: > "Simon Riggs" <simon@2ndquadrant.com> writes: > > Instead, I'd like to include a parameter to turn off CRC altogether, for > > heavily CPU bound operations and the WAL drive on trustworthy hardware. > > No can do --- we rely on the checksums to be able to tell when we've hit > the end of WAL during replay. No we don't: Zero length records are the trigger for EOF. Sure, a CRC failure will cause replay to end, but only if you calculate it and compare it to whats on disk - which is the bit I would turn off. We don't rely on that, however, so it is avoidable. In most cases, it would be foolish to avoid: but there are cases where the data is CRC checked by the hardware/system already, so I'd like to make an option to turn this off, defaulting to on, for safety. > You may as well propose not writing WAL > at all (and no, I don't think it'd pass). That would undo all of my efforts, so no I wouldn't consider that. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com
> In most cases, it would be foolish to avoid: but there are cases where > the data is CRC checked by the hardware/system already, so I'd like to > make an option to turn this off, defaulting to on, for safety. How would we know? What are those cases? Sounds like a foot gun to me. Sincerely, Joshua D. Drake > >> You may as well propose not writing WAL >> at all (and no, I don't think it'd pass). > > That would undo all of my efforts, so no I wouldn't consider that. > -- === The PostgreSQL Company: Command Prompt, Inc. === Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240 Providing the most comprehensive PostgreSQL solutionssince 1997 http://www.commandprompt.com/ Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate
"Simon Riggs" <simon@2ndquadrant.com> writes: > On Mon, 2006-10-23 at 13:52 -0400, Tom Lane wrote: >> No can do --- we rely on the checksums to be able to tell when we've hit >> the end of WAL during replay. > No we don't: Zero length records are the trigger for EOF. Only if the file happens to be all-zero already, which is not the normal operating state (see WAL-file recycling). Otherwise you have to be able to detect an invalid record. There are actually three checks used to detect end of WAL: zero record length, invalid checksum, and incorrect back-pointer. Zero length is the first and cleanest-looking test, but AFAICS we have to have both of the others to avoid obvious failure modes. regards, tom lane
On Mon, 23 Oct 2006, Tom Lane wrote: > Jeremy Drake <pgsql@jdrake.com> writes: > > So at this point I realize that intel's compiler is optimizing the loop > > away, at least for the std crc and probably for both. So I make mycrc an > > array of 2, and substript mycrc[j&1] in the loop. > > That's not a good workaround, because making mycrc expensive to access > means your inner loop timing isn't credible at all. Instead try making the > buffer array nonlocal --- malloc it, perhaps. That did not make any difference. The way I see it, the only way to convince the compiler it really needs to do this loop more than once is to make it think it is not overwriting the same variable every time. The subscript was the cheapest way I could think of to do that. Any other suggestions on how to do this are welcome. > > regards, tom lane > > ---------------------------(end of broadcast)--------------------------- > TIP 6: explain analyze is your friend > > -- I like being single. I'm always there when I need me. -- Art Leo
Jeremy Drake <pgsql@jdrake.com> writes: > On Mon, 23 Oct 2006, Tom Lane wrote: >> That's not a good workaround, because making mycrc expensive to access >> means your inner loop timing isn't credible at all. Instead try making the >> buffer array nonlocal --- malloc it, perhaps. > That did not make any difference. The way I see it, the only way to > convince the compiler it really needs to do this loop more than once is to > make it think it is not overwriting the same variable every time. The > subscript was the cheapest way I could think of to do that. Any other > suggestions on how to do this are welcome. Hmm. Maybe store the CRCs into a global array somewhere? uint32 results[NTESTS]; for ...{ INIT/COMP/FIN_CRC32... results[j] = mycrc;} This still adds a bit of overhead to the outer loop, but not much. Another possibility is to put the INIT/COMP/FIN_CRC32 into an external subroutine, thereby adding a call/return to the outer loop overhead. regards, tom lane
On Mon, 23 Oct 2006, Tom Lane wrote: > Hmm. Maybe store the CRCs into a global array somewhere? > > uint32 results[NTESTS]; > > for ... > { > INIT/COMP/FIN_CRC32... > results[j] = mycrc; > } > > This still adds a bit of overhead to the outer loop, but not much. > That seems to have worked. Std crc Slice-8 crc Intel P4 Xeon 2.8Ghz (Gentoo, gcc-3.4.5, -O2) 8192 bytes 26.765317 10.511143 1024 bytes 3.357843 1.280890 64 bytes 0.223213 0.103767 Intel P4 Xeon 2.8Ghz (Gentoo, icc-9.0.032, -O2 -xN -ipo -parallel) 8192 bytes 29.495836 0.007107 1024 bytes 3.708665 0.012183 64 bytes 0.242579 0.008700 So the gcc times are reasonable, but the icc times for the slice-by-8 are still too fast to be believed. I will have to take a look at the generated assembly later and see what gives. My changed testcrc.c is attached, again. -- "I'd love to go out with you, but I did my own thing and now I've got to undo it."
Benny Amorsen wrote: >>>>>> "MK" == Mark Kirkwood <markir@paradise.net.nz> writes: > > MK> Here are the results after building gcc 4.1.2 (repeating results > MK> for gcc 3.4.6 for comparison). I suspect that performance is > MK> probably impacted because gcc 4.1.2 (and also the rest of the > MK> tool-chain) is built with gcc 3.4.6 - but it certainly suggests > MK> that the newer gcc versions don't like the slice-8 algorithm for > MK> some reason. > > They don't seem to like the old CRC algorithms either. It is quite > strange, such dramatic performance regressions from 3.x to 4.x are > rare. > Right - I think the regression is caused by libc and kernel being built with gcc 3.4.6 and the test program being built with gcc 4.1.2. Rebuilding *everything* with 4.1.2 (which I'm not sure is possible for FreeBSD at the moment) would probably get us back to numbers that looked more like my Gentoo ones [1]. Cheers Mark [1] Note that the upgrade process for switching Gentoo from gcc 3.4 to 4.1 involves precisely this - build 4.1, then rebuild everything using 4.1 (including 4.1 itself!)
Mark Kirkwood <markir@paradise.net.nz> writes: > Right - I think the regression is caused by libc and kernel being built > with gcc 3.4.6 and the test program being built with gcc 4.1.2. Why do you think that? The performance of the CRC loop shouldn't depend at all on either libc or the kernel, because they're not invoked inside the loop. regards, tom lane
Tom Lane wrote: > Mark Kirkwood <markir@paradise.net.nz> writes: >> Right - I think the regression is caused by libc and kernel being built >> with gcc 3.4.6 and the test program being built with gcc 4.1.2. > > Why do you think that? The performance of the CRC loop shouldn't depend > at all on either libc or the kernel, because they're not invoked inside > the loop. > > Just come to the same conclusion myself... I've got gcc 3.4.6 on Gentoo, so tried using that, and observed no regressionat all there for the std case - which pretty much rules out any issues connected with "what did I build kernel + libc with vs what compiler am I using now": Intel P-III 1.26Ghz (Gentoo 2006.1) Std crc Slice-8 crc 8192 bytes 12.967997 28.363876 (gcc 4.1.1) 8192 bytes 12.956765 13.978495 (gcc 3.4.6) So Gentoo using gcc 3.4.6 looks like FreeBSD using 3.4.6, so the std vs slice-8 performance seems to be all about compiler version. Now the regression on FreeBSD for the std case might be (as Kenneth pointed out) due to 4.1.1 being built by 3.4.6.... but I guess it is just a nit at this point. Cheers Mark
On Mon, 2006-10-23 at 15:12 -0400, Tom Lane wrote: > "Simon Riggs" <simon@2ndquadrant.com> writes: > > On Mon, 2006-10-23 at 13:52 -0400, Tom Lane wrote: > >> No can do --- we rely on the checksums to be able to tell when we've hit > >> the end of WAL during replay. > > > No we don't: Zero length records are the trigger for EOF. > > Only if the file happens to be all-zero already, which is not the normal > operating state (see WAL-file recycling). Otherwise you have to be able > to detect an invalid record. OK. > There are actually three checks used to detect end of WAL: zero record > length, invalid checksum, and incorrect back-pointer. Zero length is > the first and cleanest-looking test, but AFAICS we have to have both of > the others to avoid obvious failure modes. Well, I understand the need for the zero-length test and the incorrect back pointer (which also incidentally tests that the current record was not left over from previous use of xlog file). The checksum protects from torn pages and disk errors. If you have full_page_writes set then you already believe yourself safe from torn pages and your disks could also already be CRC-checking the data. So you don't *need* the checksum in those cases. If we really think we need it we could put the xlprev pointer as the *last* field on the xlrec, just to make doubly sure - having a trailer as well as a header. CRC-checked disks are actually the industry norm and have been for around 5 years. ANSI SCSI Parallel Interface 3 (SPI-3) (UltraSCSI 160) defines the use of CRC and this is available as standard from all key manufacturers. (CRC-32 is the required level). So if you are using modern SCSI, SATA or SAS technologies you'll be just fine. (Those checks are a *requirement* of the underlying technology because of the error rates when the bus speeds are so high.) I checked this out in a conversation with the fine people at Seagate, who also publish a variety of technical manuals/details on this, e.g. http://www.maxtor.com/_files/maxtor/en_us/documentation/white_papers_technical/wp_ultra320.pdf ...You'll see that CRC is a Mandatory Feature of ANSI SPI-4 now. http://www.maxtor.com/_files/maxtor/en_us/documentation/white_papers_technical/sas_link_layer.pdf So, I'd like the *option* to turn our CRC off, please, somehow. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com
"Simon Riggs" <simon@2ndquadrant.com> writes: > On Mon, 2006-10-23 at 15:12 -0400, Tom Lane wrote: >> There are actually three checks used to detect end of WAL: zero record >> length, invalid checksum, and incorrect back-pointer. Zero length is >> the first and cleanest-looking test, but AFAICS we have to have both of >> the others to avoid obvious failure modes. > The checksum protects from torn pages and disk errors. If you have > full_page_writes set then you already believe yourself safe from torn > pages and your disks could also already be CRC-checking the data. No, because unlike tuples, WAL records can and do cross page boundaries. Unless you're prepared to posit that your system never crashes at all, you have to be able to detect the case where you've got a good front half of a WAL record and non-matching data in the next page. The two halves could be written out in widely separated write()s, indeed might never have been simultaneously resident in the WAL buffers at all. > CRC-checked disks are actually the industry norm and have been for > around 5 years. Huh? Disks have ALWAYS had CRCs, and this is in any case utterly irrelevant to the database-crash risk. regards, tom lane
On Tue, 2006-10-24 at 09:37 -0400, Tom Lane wrote: > "Simon Riggs" <simon@2ndquadrant.com> writes: > > On Mon, 2006-10-23 at 15:12 -0400, Tom Lane wrote: > >> There are actually three checks used to detect end of WAL: zero record > >> length, invalid checksum, and incorrect back-pointer. Zero length is > >> the first and cleanest-looking test, but AFAICS we have to have both of > >> the others to avoid obvious failure modes. > > > The checksum protects from torn pages and disk errors. If you have > > full_page_writes set then you already believe yourself safe from torn > > pages and your disks could also already be CRC-checking the data. > > No, because unlike tuples, WAL records can and do cross page boundaries. But not that often, with full_page_writes = off. So we could get away with just CRC checking the page-spanning ones and mark the records to show whether they have been CRC checked or not and need to be rechecked at recovery time. That would reduce the CRC overhead to about 1-5% of what it is now (as an option). Just a thought: Would there be benefit in not allowing page-spanning WAL records, when they are small? That would tend to reduce the number of WAL writes, even if it did cause some space wastage on disk. That would reduce the number of same block re-writes and might improve the sequential behaviour of WAL access. We never needed to think about this while full_page_writes=on was the only option. > > CRC-checked disks are actually the industry norm and have been for > > around 5 years. > > Huh? Disks have ALWAYS had CRCs, and this is in any case utterly > irrelevant to the database-crash risk. According to the man from Seagate... -- Simon Riggs EnterpriseDB http://www.enterprisedb.com
On Tue, Oct 24, 2006 at 02:52:36PM +0100, Simon Riggs wrote: > On Tue, 2006-10-24 at 09:37 -0400, Tom Lane wrote: > > "Simon Riggs" <simon@2ndquadrant.com> writes: > > > On Mon, 2006-10-23 at 15:12 -0400, Tom Lane wrote: > > >> There are actually three checks used to detect end of WAL: zero record > > >> length, invalid checksum, and incorrect back-pointer. Zero length is > > >> the first and cleanest-looking test, but AFAICS we have to have both of > > >> the others to avoid obvious failure modes. > > > The checksum protects from torn pages and disk errors. If you have > > > full_page_writes set then you already believe yourself safe from torn > > > pages and your disks could also already be CRC-checking the data. > > No, because unlike tuples, WAL records can and do cross page boundaries. > But not that often, with full_page_writes = off. So we could get away > with just CRC checking the page-spanning ones and mark the records to > show whether they have been CRC checked or not and need to be rechecked > at recovery time. That would reduce the CRC overhead to about 1-5% of > what it is now (as an option). WAL pages 8 Kbytes, and disk pages 512 bytes, correct? I don't see a guarantee in here that the 8 Kbytes worth of data will be written as sequential writes, nor that the 8 Kbytes of data will necessarily finish. If the operating system uses 8 Kbyte pages, or the RAID system uses 8 Kbytes or larger chunks, and they guarantee sequential writes, perhaps it is ok. Still, if the power goes out after writing the first 512 bytes, 2048 bytes, or 4096 bytes, then what? With RAID involved it might get better or worse, depending on the RAID configuration. I'm almost wondering whether the three numbers are enough. I'm too busy to sketch it all down and predict failure points... :-) > Just a thought: Would there be benefit in not allowing page-spanning WAL > records, when they are small? That would tend to reduce the number of > WAL writes, even if it did cause some space wastage on disk. That would > reduce the number of same block re-writes and might improve the > sequential behaviour of WAL access. We never needed to think about this > while full_page_writes=on was the only option. Probably. Might not be much though. > > > CRC-checked disks are actually the industry norm and have been for > > > around 5 years. > > Huh? Disks have ALWAYS had CRCs, and this is in any case utterly > > irrelevant to the database-crash risk. > According to the man from Seagate... Once upon a time when bits were stored the size of smarties (exaggeration).... Back in those days (before my time), they liked to use things like parity. I didn't read the page, but perhaps there is some confusion between CRC and error correction codes. Obviously, technologies were introduced over time. I don't remember ever having a hard disk that didn't have some sort of error detection. The 5 year claim seems decades too short unless they are talking about a newer technology. Even the old 5.25" DOS floppies seemed to be able to detect errors rather than return invalid corrupted bits. Cheers, mark -- mark@mielke.cc / markm@ncf.ca / markm@nortel.com __________________________ . . _ ._ . . .__ . . ._. .__ . . . .__ | Neighbourhood Coder |\/| |_| |_| |/ |_ |\/| | |_ | |/ |_ | | | | | | \ | \ |__ . | | .|. |__ |__ | \ |__ | Ottawa, Ontario, Canada One ring to rule them all, one ring to find them, one ring to bring them all and in the darkness bindthem... http://mark.mielke.cc/
mark@mark.mielke.cc writes: > The 5 year claim seems > decades too short unless they are talking about a newer technology. I think what Simon is on about is CRCs being routinely used on the cable between the disk drive and the CPU. When I was involved in this stuff you usually only got a parity bit on each byte transferred. CRCs for all disk command/result messages would definitely help close a risk area --- but that's only one risk of many. regards, tom lane
On Tue, 2006-10-24 at 10:18 -0400, mark@mark.mielke.cc wrote: > On Tue, Oct 24, 2006 at 02:52:36PM +0100, Simon Riggs wrote: > > On Tue, 2006-10-24 at 09:37 -0400, Tom Lane wrote: > > > No, because unlike tuples, WAL records can and do cross page boundaries. > > > But not that often, with full_page_writes = off. So we could get away > > with just CRC checking the page-spanning ones and mark the records to > > show whether they have been CRC checked or not and need to be rechecked > > at recovery time. That would reduce the CRC overhead to about 1-5% of > > what it is now (as an option). > > WAL pages 8 Kbytes, and disk pages 512 bytes, correct? I don't see a > guarantee in here that the 8 Kbytes worth of data will be written as > sequential writes, nor that the 8 Kbytes of data will necessarily > finish. > > If the operating system uses 8 Kbyte pages, or the RAID system uses 8 > Kbytes or larger chunks, and they guarantee sequential writes, perhaps > it is ok. Still, if the power goes out after writing the first 512 > bytes, 2048 bytes, or 4096 bytes, then what? With RAID involved it > might get better or worse, depending on the RAID configuration. That is the torn-page problem. If your system doesn't already protect you against this you have no business turning off full_page_writes, which was one of my starting assumptions. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com
On Tue, Oct 24, 2006 at 05:05:58PM +0100, Simon Riggs wrote: > On Tue, 2006-10-24 at 10:18 -0400, mark@mark.mielke.cc wrote: > > On Tue, Oct 24, 2006 at 02:52:36PM +0100, Simon Riggs wrote: > > > On Tue, 2006-10-24 at 09:37 -0400, Tom Lane wrote: > > > > No, because unlike tuples, WAL records can and do cross page boundaries. > > > But not that often, with full_page_writes = off. So we could get away > > > with just CRC checking the page-spanning ones and mark the records to > > > show whether they have been CRC checked or not and need to be rechecked > > > at recovery time. That would reduce the CRC overhead to about 1-5% of > > > what it is now (as an option). > > > > WAL pages 8 Kbytes, and disk pages 512 bytes, correct? I don't see a > > guarantee in here that the 8 Kbytes worth of data will be written as > > sequential writes, nor that the 8 Kbytes of data will necessarily > > finish. > > > > If the operating system uses 8 Kbyte pages, or the RAID system uses 8 > > Kbytes or larger chunks, and they guarantee sequential writes, perhaps > > it is ok. Still, if the power goes out after writing the first 512 > > bytes, 2048 bytes, or 4096 bytes, then what? With RAID involved it > > might get better or worse, depending on the RAID configuration. > > That is the torn-page problem. If your system doesn't already protect > you against this you have no business turning off full_page_writes, > which was one of my starting assumptions. I wasn't aware that a system could protect against this. :-) I write 8 Kbytes - how can I guarantee that the underlying disk writes all 8 Kbytes before it loses power? And why isn't the CRC a valid means of dealing with this? :-) I'm on wrong on one of these assumptions, I'm open to being educated. My opinion as of a few seconds ago, is that a write to a single disk sector is safe, but that a write that extends across several sectors is not. Cheers, mark -- mark@mielke.cc / markm@ncf.ca / markm@nortel.com __________________________ . . _ ._ . . .__ . . ._. .__ . . . .__ | Neighbourhood Coder |\/| |_| |_| |/ |_ |\/| | |_ | |/ |_ | | | | | | \ | \ |__ . | | .|. |__ |__ | \ |__ | Ottawa, Ontario, Canada One ring to rule them all, one ring to find them, one ring to bring them all and in the darkness bindthem... http://mark.mielke.cc/
On Tue, 2006-10-24 at 12:47 -0400, mark@mark.mielke.cc wrote: > On Tue, Oct 24, 2006 at 05:05:58PM +0100, Simon Riggs wrote: > > On Tue, 2006-10-24 at 10:18 -0400, mark@mark.mielke.cc wrote: > > > On Tue, Oct 24, 2006 at 02:52:36PM +0100, Simon Riggs wrote: > > > > On Tue, 2006-10-24 at 09:37 -0400, Tom Lane wrote: > > > > > No, because unlike tuples, WAL records can and do cross page boundaries. > > > > But not that often, with full_page_writes = off. So we could get away > > > > with just CRC checking the page-spanning ones and mark the records to > > > > show whether they have been CRC checked or not and need to be rechecked > > > > at recovery time. That would reduce the CRC overhead to about 1-5% of > > > > what it is now (as an option). > > > > > > WAL pages 8 Kbytes, and disk pages 512 bytes, correct? I don't see a > > > guarantee in here that the 8 Kbytes worth of data will be written as > > > sequential writes, nor that the 8 Kbytes of data will necessarily > > > finish. > > > > > > If the operating system uses 8 Kbyte pages, or the RAID system uses 8 > > > Kbytes or larger chunks, and they guarantee sequential writes, perhaps > > > it is ok. Still, if the power goes out after writing the first 512 > > > bytes, 2048 bytes, or 4096 bytes, then what? With RAID involved it > > > might get better or worse, depending on the RAID configuration. > > > > That is the torn-page problem. If your system doesn't already protect > > you against this you have no business turning off full_page_writes, > > which was one of my starting assumptions. > > I wasn't aware that a system could protect against this. :-) > > I write 8 Kbytes - how can I guarantee that the underlying disk writes > all 8 Kbytes before it loses power? And why isn't the CRC a valid means > of dealing with this? :-) > > I'm on wrong on one of these assumptions, I'm open to being educated. > My opinion as of a few seconds ago, is that a write to a single disk > sector is safe, but that a write that extends across several sectors > is not. http://developer.postgresql.org/pgdocs/postgres/runtime-config-wal.html full_page_writes = off I'm very happy to learn more myself... -- Simon Riggs EnterpriseDB http://www.enterprisedb.com
On Tue, 2006-10-24 at 14:52 +0100, Simon Riggs wrote: > On Tue, 2006-10-24 at 09:37 -0400, Tom Lane wrote: > > "Simon Riggs" <simon@2ndquadrant.com> writes: > > > On Mon, 2006-10-23 at 15:12 -0400, Tom Lane wrote: > > >> There are actually three checks used to detect end of WAL: zero record > > >> length, invalid checksum, and incorrect back-pointer. Zero length is > > >> the first and cleanest-looking test, but AFAICS we have to have both of > > >> the others to avoid obvious failure modes. > > > > > The checksum protects from torn pages and disk errors. If you have > > > full_page_writes set then you already believe yourself safe from torn > > > pages and your disks could also already be CRC-checking the data. > > > > No, because unlike tuples, WAL records can and do cross page boundaries. > > But not that often, with full_page_writes = off. So we could get away > with just CRC checking the page-spanning ones and mark the records to > show whether they have been CRC checked or not and need to be rechecked > at recovery time. That would reduce the CRC overhead to about 1-5% of > what it is now (as an option). Looking further, I see that the xlog page header already contains xlp_pageaddr which is a XLogRecPtr. So an xlrec that tried to span multiple pages yet failed in between would easily show up as a failure in ValidXLOGHeader(), even before the CRC check. [The XLogRecPtr contains both the offset within the file and a unique identification of the WAL file, so any data left over from previous uses of that data file will be easily recognised as such]. So we don't even need to CRC check the page-spanning ones either. So it all comes down to: do you trust your hardware? I accept that some hardware/OS combinations will give you high risk, others will reduce it considerably. All I'm looking to do is to pass on the savings for people that are confident they have invested wisely in hardware. Does anybody have a reasonable objection to introducing an option for people that are comfortable they are making the correct choice? wal_checksum = off (or other suggested naming...) I don't want to take blind risks, so shoot me down, please, if I err. I'm happy either way: either we speed up, or we're safer not to. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com
mark@mark.mielke.cc wrote: > I'm on wrong on one of these assumptions, I'm open to being educated. > My opinion as of a few seconds ago, is that a write to a single disk > sector is safe, but that a write that extends across several sectors > is not. Unless it's fsync'ed, which is what we do at CHECKPOINT. Keep in mind that we save full page images on WAL the first time we touch the page after a checkpoint. This means that if a partial write occured, we will restore it from WAL. So it's not safe in general, but it is safe in Postgres. -- Alvaro Herrera http://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc.
On 10/24/06, mark@mark.mielke.cc <mark@mark.mielke.cc> wrote: > I wasn't aware that a system could protect against this. :-) > > I write 8 Kbytes - how can I guarantee that the underlying disk writes > all 8 Kbytes before it loses power? And why isn't the CRC a valid means > of dealing with this? :-) [snip] A file system with an apropreiate transaction method could do this.. In *theory* reiser4 write()s are atomic. No one has verified, however, that there is no torn page risk introduced in some other part of the kernel. I'm not aware of any other system which can guaranteed the atomicity of 8k writes.
"Gregory Maxwell" <gmaxwell@gmail.com> writes: > I'm not aware of any other system which can guaranteed the atomicity > of 8k writes. The reasoning for supporting full_page_writes = off is that if you have a stable kernel and suitable backup power, battery backed write cache, etc, your risk of a partially completed write() may be low enough to be acceptable. Obviously there are no 100.00% guarantees, but that's what you keep backups for ... Simon is essentially arguing that if we are willing to assume no incomplete write() we may as well assume it for WAL too. This seems to me to be raising the risk significantly, but I admit that I can't put my finger on why exactly. One point not directly related to crash safety is whether CRC checking isn't still going to be a good idea when PITR is enabled. Archived WAL files are going to have been through a great deal more transferring than ones merely being used to recover from a crash. regards, tom lane
Sorry for getting into the conversation so late... It was a long weekend in India.
Any particular reason? (why and what did you doubt in it?).
I designed the prog. to be flexible to test different sized blocks (to cost single/less INIT/COMP/FIN iterations), and different size lists of data (to control the number of iterations). Please share you wisdom.
When I first saw your results, I had a strong feeling that function-call overhead was going against SB8. And then, Jeremy's trials, and subsequent success, on disabling loop optimizations also pointed to this possibility.
So, I have taken your tests and converted the SB8 function calls into macros. And the results are (please note that crc = 0 is explained later):
std_8192_noprintcrc.out
crc = 0, bufsize = 8192, loops = 1000000, elapsed = 8.471994
sb8_8192_noprintcrc.out
crc = 0, bufsize = 8192, loops = 1000000, elapsed = 0.000006
std_8192_printcrc.out
crc = 8228BB0E, bufsize = 8192, loops = 1000000, elapsed = 32.490704
sb8_8192_printcrc.out
crc = 7E67A22A, bufsize = 8192, loops = 1000000, elapsed = 22.349156
std_64_noprintcrc.out
crc = 0, bufsize = 64, loops = 1000000, elapsed = 0.151354
sb8_64_noprintcrc.out
crc = 0, bufsize = 64, loops = 1000000, elapsed = 0.000005
std_64_printcrc.out
crc = 9C9FBE2E, bufsize = 64, loops = 1000000, elapsed = 0.559315
sb8_64_printcrc.out
crc = F70BC6AE, bufsize = 64, loops = 1000000, elapsed = 0.357382
The result names are in the format: <algo_type>_<test_size>_<was_mycrc_referenced_in_printf>.out
crc = 0 in the result means that the mycrc variable was not refereced anywhere after the for-loop. As can be seen, if mycrc is not refrenced in the printf, that is, it's usage is limited to just inside the 'for' loop, then GCC ( 4.1) seems to be optimizing the loop heavily. In the case of SB8, if mycrc is not referenced later, it seems to have totally removed the loop!!!
The only difference between the <x>_noprintcrc and the <x>_printcrc tests was that in the printf() call, the first parameter after the format string was either a zero or mycrc variable, respectively.
I am highly apprehensive that I might have made some mistake while converting function calls to macros; though, I have not besen able to prove it thus far. Please check it's validity as compared to the function-call version.
If there's no mistake, then I think SB8 is back in the performance game now. These results were obtained with gcc 4.1 on FC5 running on Intel Pentium M 1.86 GHz, and OS starteted and running in runlevel 3.
Please dump the .c and .h files from the attachment on top of Tom's package, and test it as earlier.
Best regards,
--
gurjeet[.singh]@EnterpriseDB.com
singh.gurjeet@{ gmail | hotmail | yahoo }.com
On 10/23/06, Tom Lane < tgl@sss.pgh.pa.us > wrote:
I didn't particularly trust the timing calculations in your benchmark
program,
Any particular reason? (why and what did you doubt in it?).
I designed the prog. to be flexible to test different sized blocks (to cost single/less INIT/COMP/FIN iterations), and different size lists of data (to control the number of iterations). Please share you wisdom.
When I first saw your results, I had a strong feeling that function-call overhead was going against SB8. And then, Jeremy's trials, and subsequent success, on disabling loop optimizations also pointed to this possibility.
So, I have taken your tests and converted the SB8 function calls into macros. And the results are (please note that crc = 0 is explained later):
std_8192_noprintcrc.out
crc = 0, bufsize = 8192, loops = 1000000, elapsed = 8.471994
sb8_8192_noprintcrc.out
crc = 0, bufsize = 8192, loops = 1000000, elapsed = 0.000006
std_8192_printcrc.out
crc = 8228BB0E, bufsize = 8192, loops = 1000000, elapsed = 32.490704
sb8_8192_printcrc.out
crc = 7E67A22A, bufsize = 8192, loops = 1000000, elapsed = 22.349156
std_64_noprintcrc.out
crc = 0, bufsize = 64, loops = 1000000, elapsed = 0.151354
sb8_64_noprintcrc.out
crc = 0, bufsize = 64, loops = 1000000, elapsed = 0.000005
std_64_printcrc.out
crc = 9C9FBE2E, bufsize = 64, loops = 1000000, elapsed = 0.559315
sb8_64_printcrc.out
crc = F70BC6AE, bufsize = 64, loops = 1000000, elapsed = 0.357382
The result names are in the format: <algo_type>_<test_size>_<was_mycrc_referenced_in_printf>.out
crc = 0 in the result means that the mycrc variable was not refereced anywhere after the for-loop. As can be seen, if mycrc is not refrenced in the printf, that is, it's usage is limited to just inside the 'for' loop, then GCC ( 4.1) seems to be optimizing the loop heavily. In the case of SB8, if mycrc is not referenced later, it seems to have totally removed the loop!!!
The only difference between the <x>_noprintcrc and the <x>_printcrc tests was that in the printf() call, the first parameter after the format string was either a zero or mycrc variable, respectively.
I am highly apprehensive that I might have made some mistake while converting function calls to macros; though, I have not besen able to prove it thus far. Please check it's validity as compared to the function-call version.
If there's no mistake, then I think SB8 is back in the performance game now. These results were obtained with gcc 4.1 on FC5 running on Intel Pentium M 1.86 GHz, and OS starteted and running in runlevel 3.
Please dump the .c and .h files from the attachment on top of Tom's package, and test it as earlier.
Best regards,
--
gurjeet[.singh]@EnterpriseDB.com
singh.gurjeet@{ gmail | hotmail | yahoo }.com
Attachment
Tom Lane <tgl@sss.pgh.pa.us> writes: > Simon is essentially arguing that if we are willing to assume no > incomplete write() we may as well assume it for WAL too. This seems > to me to be raising the risk significantly, but I admit that I can't > put my finger on why exactly. Actually I think we can deal with torn pages in the WAL more easily than in database files anyways. In database files we need to get the entire page correctly one way or the other so we need full_page_writes in order to be deal properly. In the WAL we just need to be able to detect torn pages and stop reading WAL at that point. That's easier and doesn't really need a CRC. We could just adopt the Sybase strategy of storing a unique id number every 512 bytes throughout the WAL page. If those numbers don't match then we have a torn page; the system crashed at that point and we should stop reading WAL pages. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com
On Tue, 2006-10-24 at 14:07 -0400, Tom Lane wrote: > "Gregory Maxwell" <gmaxwell@gmail.com> writes: > > I'm not aware of any other system which can guaranteed the atomicity > > of 8k writes. > > The reasoning for supporting full_page_writes = off is that if you have > a stable kernel and suitable backup power, battery backed write cache, > etc, your risk of a partially completed write() may be low enough to > be acceptable. Obviously there are no 100.00% guarantees, but that's > what you keep backups for ... > > Simon is essentially arguing that if we are willing to assume no > incomplete write() we may as well assume it for WAL too. This seems > to me to be raising the risk significantly, but I admit that I can't > put my finger on why exactly. I agree about the significant additional risk, hence the additional parameter. I'll do some internal testing to see what the risk-reward is. If that seems worthwhile, then I'll post the patch for general testing/comment. (Incidentally, having GUCs that depend on other GUCs is bad news since they are set alphabetically. I'd want to only allow wal_checksum=off iff full_page_writes=off, which will work, but only because W comes after F and for no other reason. Generic solution for dependent GUCs would be great...) > One point not directly related to crash safety is whether CRC checking > isn't still going to be a good idea when PITR is enabled. Archived WAL > files are going to have been through a great deal more transferring than > ones merely being used to recover from a crash. Agreed. Both disks and tapes/other mechanisms must be known CRC-safe before this idea would be worth using in production. Many enterprises do already think they have bomb-proof kit, so we may as well support them in that belief. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com
"Gurjeet Singh" <singh.gurjeet@gmail.com> writes: > On 10/23/06, Tom Lane <tgl@sss.pgh.pa.us > wrote: >> I didn't particularly trust the timing calculations in your benchmark >> program, > Any particular reason? (why and what did you doubt in it?). Well, the specific thing that set off my bogometer was #define TV_DIFF_MILLI(tv1, tv2) ((tv2.tv_sec*1000+((tv2.tv_usec)/1000))-(tv1.tv_sec*1000+((tv1.tv_usec)/1000))) which is going to have overflow problems on any platform where tv_sec isn't a 64-bit type (which is still all of 'em AFAIK). But more generally, your test is timing a CRC across 100 4Kb segments, which isn't representative of PG's usage of CRCs. I don't think there are any XLogWrite calls that have more than about 5 segments, and in most cases those segments contain a few dozen bytes not a few K. So you need to be looking at much shorter loop runs. The test case I proposed uses timing code that I trusted (borrowed from long-established coding in postgres.c), and tests loop lengths that are somewhat representative for PG, but it is still biased in favor of slice8 because it repeats the CRC calculations consecutively without any other activity --- presumably this fact creates a bias for a method that needs more L2 cache space over one that doesn't need so much. I'd have tried harder to make an unbiased test case if this version had showed slice8 as competitive, but so far it seems that on a majority of current CPUs and compilers it's not competitive. regards, tom lane
On Tue, 2006-10-24 at 15:42 -0400, Gregory Stark wrote: > Tom Lane <tgl@sss.pgh.pa.us> writes: > > > Simon is essentially arguing that if we are willing to assume no > > incomplete write() we may as well assume it for WAL too. This seems > > to me to be raising the risk significantly, but I admit that I can't > > put my finger on why exactly. > > Actually I think we can deal with torn pages in the WAL more easily than in > database files anyways. In database files we need to get the entire page > correctly one way or the other so we need full_page_writes in order to be deal > properly. > > In the WAL we just need to be able to detect torn pages and stop reading WAL > at that point. That's easier and doesn't really need a CRC. We could just > adopt the Sybase strategy of storing a unique id number every 512 bytes > throughout the WAL page. If those numbers don't match then we have a torn > page; the system crashed at that point and we should stop reading WAL pages. Right, but I'm only suggesting using this if your safe from torn pages anyhow. I'm not suggesting anyone does wal_checksum = off when full_page_writes = on. I do think you're right: we're much less exposed by a torn page in WAL than we are for the stable database. I've looked into this in more depth following your suggestion: I think it seems straightforward to move the xl_prev field from being a header to a trailer. That way when we do the test on the back pointer we will be assured that there is no torn page effecting the remainder of the xlrec. That would make it safer with wal_checksum = off. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com
"Simon Riggs" <simon@2ndquadrant.com> writes: > I've looked into this in more depth following your suggestion: I think > it seems straightforward to move the xl_prev field from being a header > to a trailer. That way when we do the test on the back pointer we will > be assured that there is no torn page effecting the remainder of the > xlrec. That would make it safer with wal_checksum = off. Hm. I think in practice this may actually help reduce the exposure to torn pages. However in theory there's no particular reason to think the blocks will be written out in physical order. The kernel may sync its buffers in some order dictated by its in-memory data structure and may end up coming across the second half of the 8kb page before the first half. It may even lie earlier on disk than the first half if the filesystem started a new extent at that point. If they were 4kb pages there would be fewer ways it could be written out of order, but even then the hard drive could find a bad block and remap it. I'm not sure what level of granularity drives remap at, it may be less than 4kb. To eliminate the need for the CRC in the WAL for everyone and still be safe from torn pages I think you have to have something like xl_prev repeated every 512b throughout the page. But if this is only an option for systems that don't expect to suffer from torn pages then sure, putting it in a footer seems like a good way to reduce the exposure somewhat. Putting it in both a header *and* a footer might be even better. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com
On Mon, Oct 23, 2006 at 05:23:27PM -0400, Tom Lane wrote: > Mark Kirkwood <markir@paradise.net.nz> writes: > > Right - I think the regression is caused by libc and kernel being built > > with gcc 3.4.6 and the test program being built with gcc 4.1.2. > > Why do you think that? The performance of the CRC loop shouldn't depend > at all on either libc or the kernel, because they're not invoked inside > the loop. > I can believe that not re-building GCC 4.1.x with the 4.1.x compiler could result in it not taking full advantage of new features and functions. Ken
>> In the WAL we just need to be able to detect torn pages and stop >> reading WAL at that point. That's easier and doesn't really need a >> CRC. We could just adopt the Sybase strategy of storing a unique id >> number every 512 bytes throughout the WAL page. If those numbers don't >> match then we have a torn page; the system crashed at that point and we should stop reading WAL pages. > I've looked into this in more depth following your > suggestion: I think it seems straightforward to move the > xl_prev field from being a header to a trailer. That way when > we do the test on the back pointer we will be assured that > there is no torn page effecting the remainder of the xlrec. > That would make it safer with wal_checksum = off. I do not think we can assume any order of when a block is written to disk. I think all this can only be used on OS and hardware, that can guarantee that what is written by one IO call (typically 8k) from pg is safe. Those combinations do exist, so I think we want the switch. Putting xl_prev to the end helps only for direct IO WAL sync methods, else we would need it on every page. Andreas
On Fri, 2006-10-27 at 10:54 +0200, Zeugswetter Andreas ADI SD wrote: > >> In the WAL we just need to be able to detect torn pages and stop > >> reading WAL at that point. That's easier and doesn't really need a > >> CRC. We could just adopt the Sybase strategy of storing a unique id > >> number every 512 bytes throughout the WAL page. If those numbers > don't > >> match then we have a torn page; the system crashed at that point and > we should stop reading WAL pages. > > > I've looked into this in more depth following your > > suggestion: I think it seems straightforward to move the > > xl_prev field from being a header to a trailer. That way when > > we do the test on the back pointer we will be assured that > > there is no torn page effecting the remainder of the xlrec. > > That would make it safer with wal_checksum = off. > > I do not think we can assume any order of when a block is written to > disk. > > I think all this can only be used on OS and hardware, that can guarantee > that what is written by one IO call (typically 8k) from pg is safe. > Those combinations do exist, so I think we want the switch. OK, good. ... but from the various comments I'll not bother moving xl_prev to be a trailer; I'll just leave that where it is now. > Putting xl_prev to the end helps only for direct IO WAL sync methods, > else we would need it on every page. [There is already an XLogRecPtr on each 8k page.] -- Simon Riggs EnterpriseDB http://www.enterprisedb.com
On Fri, Oct 27, 2006 at 10:11:08AM +0100, Simon Riggs wrote: > > Putting xl_prev to the end helps only for direct IO WAL sync methods, > > else we would need it on every page. > > [There is already an XLogRecPtr on each 8k page.] Given that hardware sector size is still 512 bytes, should there be a way of detecting a missing 512 byte block in the middle of an 8K block. The idea of simply writing a serial counter every 512 bytes seems to be a good way to handle that... Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > From each according to his ability. To each according to his ability to litigate.
> > > Putting xl_prev to the end helps only for direct IO WAL sync > > > methods, else we would need it on every page. > > > > [There is already an XLogRecPtr on each 8k page.] > > Given that hardware sector size is still 512 bytes, should > there be a way of detecting a missing 512 byte block in the > middle of an 8K block. > The idea of simply writing a serial counter every 512 bytes > seems to be a good way to handle that... No, we have CRC for that. You are not supposed to turn it off when you see a chance, that an 8k block is not whole. Andreas