Thread: Re: New CRC algorithm: Slicing by 8

Re: New CRC algorithm: Slicing by 8

From

"Gurjeet Singh"

Date:

19 October 2006, 05:32:04

Resending... The previous 3 attempt haven't succeeded, probably because of size restrictions; so sending in two parts.

On 10/18/06, Gurjeet Singh < singh.gurjeet@gmail.com> wrote:

Hi all,

    Please refer to the following link for a new algorithm to calculate CRC, developed by Intel.

    http://www.intel.com/technology/magazine/communications/slicing-by-8-0306.htm

    Please find attached the patch (pg_crc_sb8_4.diff.gz), that implements this algorithm (originally obtained from http://sourceforge.net/projects/slicing-by-8 ).

    I tested it using pgbench, and found an average improvement of about 15 tps (291 vs. 307 tps) when running with a scaling-factor of 10, number of clients: 10, and 1000 transactions per client (I don't remember these pgbench params exactly; it's been quite a while now).

    The author of the algo says that it works three times faster (theoretically) than the conventional CRC calculation methods, but the tps improvements didn't reflect that; so, to prove the speed, I extracted the source files (both, PG and SB8) and made a saparate project ( test.tar.gz).

    The resulting binary's comand format is:

<snip>
<prog_name> <algo_name> [ <scaling_factor> [ <block_size> [ <list_size> ] ] ]

algo_name     : PG or SB8 or BOTH (just the first char will also do!)
scaling_factor: the multiple to use when creating the list of buffer blocks.
block_size    : buffer size (in KBs); default is 4.
list_size     : number of data blocks to create for the test; default is 100.

So the command : a.out p 3 1024 300
   will test the PG's algo over 900 blocks of 1 MB each.
</snip>

    The test.sh script is a wrapper around this binary, to test each of PG and SB8, and both algos simultaneously, three times each. I performed two tests using this script to prove SB8's effectiveness on large and small sized data blocks:

    sh test.sh 1 1024 800 -- 800 blocks of 1 MB each; effective data size 800 MB
    sh test.sh 800 1 1024 -- 800 * 1024 blocks of 1 KB each; effective data size 800 MB

    The new algo shined in both test configurations. It performs 3 to 4 times better than regular CRC algo in both types of tests. The results are in the files results_1_1024_800.out and results_800_1_1024.out, respectively.

    To test it yourself, extract all the files from test.tar.gz, and issue the following commands (assuming bash):

sh make.sh
sh test.sh 1 1024 800
sh test.sh 800 1 1024

(Note: you would like to reduce the effective data size on a Windows box. I used 500 MB on windows+MinGW. I tested with these params since I have only 1 GB of RAM, and increasing the data size beyond these numbers caused the memory contents to be spilled over into swap-space/pagefile, and this disk I/O causes a severe distortion in results.)

   The tests were performed on Fedora Core 5 and MinGW on WinXP Professional.

    If possible, people should test it on different platforms, so as to ensure that it doesn't perform any worse than older implementation on any supported platform (please post the results, if you do test it). Also, recommendations for a better (than pgbench) benhmark are welcome, so that we can show the improvement when it is used as a part of PG.

Best regards,

--
gurjeet[.singh]@EnterpriseDB.com
singh.gurjeet@{ gmail | hotmail | yahoo }.com

--
gurjeet@EnterpriseDB.com
singh.gurjeet@{ gmail | hotmail | yahoo }.com

Attachment

pg_crc_sb8_4.diff.gz

Fwd: New CRC algorithm: Slicing by 8

From

"Gurjeet Singh"

Date:

19 October 2006, 09:08:39

    Resending... The previous 3 attempt haven't succeeded, probably because of size restrictions; so sending in two parts.

    This is part 2; unfortunately, the size of the gzipped file is still a bit big (26 KB), so I have uploaded it to

http://www.geocities.com/gurjeet79/crc_sb8_test.tar.gz

    Please download and test it from there.

    Can someone let us all know what is the limit on the attachments sent to the -hackers list?

Best regards,
--
gurjeet@EnterpriseDB.com
singh.gurjeet@{ gmail | hotmail | yahoo }.com

On 10/18/06, Gurjeet Singh < singh.gurjeet@gmail.com> wrote:

Hi all,

    Please refer to the following link for a new algorithm to calculate CRC, developed by Intel.

    http://www.intel.com/technology/magazine/communications/slicing-by-8-0306.htm

    Please find attached the patch (pg_crc_sb8_4.diff.gz), that implements this algorithm (originally obtained from http://sourceforge.net/projects/slicing-by-8 ).

    I tested it using pgbench, and found an average improvement of about 15 tps (291 vs. 307 tps) when running with a scaling-factor of 10, number of clients: 10, and 1000 transactions per client (I don't remember these pgbench params exactly; it's been quite a while now).

    The author of the algo says that it works three times faster (theoretically) than the conventional CRC calculation methods, but the tps improvements didn't reflect that; so, to prove the speed, I extracted the source files (both, PG and SB8) and made a saparate project ( test.tar.gz).

    The resulting binary's comand format is:

<snip>
<prog_name> <algo_name> [ <scaling_factor> [ <block_size> [ <list_size> ] ] ]

algo_name     : PG or SB8 or BOTH (just the first char will also do!)
scaling_factor: the multiple to use when creating the list of buffer blocks.
block_size    : buffer size (in KBs); default is 4.
list_size     : number of data blocks to create for the test; default is 100.

So the command : a.out p 3 1024 300
   will test the PG's algo over 900 blocks of 1 MB each.
</snip>

    The test.sh script is a wrapper around this binary, to test each of PG and SB8, and both algos simultaneously, three times each. I performed two tests using this script to prove SB8's effectiveness on large and small sized data blocks:

    sh test.sh 1 1024 800 -- 800 blocks of 1 MB each; effective data size 800 MB
    sh test.sh 800 1 1024 -- 800 * 1024 blocks of 1 KB each; effective data size 800 MB

    The new algo shined in both test configurations. It performs 3 to 4 times better than regular CRC algo in both types of tests. The results are in the files results_1_1024_800.out and results_800_1_1024.out, respectively.

    To test it yourself, extract all the files from test.tar.gz, and issue the following commands (assuming bash):

sh make.sh
sh test.sh 1 1024 800
sh test.sh 800 1 1024

(Note: you would like to reduce the effective data size on a Windows box. I used 500 MB on windows+MinGW. I tested with these params since I have only 1 GB of RAM, and increasing the data size beyond these numbers caused the memory contents to be spilled over into swap-space/pagefile, and this disk I/O causes a severe distortion in results.)

   The tests were performed on Fedora Core 5 and MinGW on WinXP Professional.

    If possible, people should test it on different platforms, so as to ensure that it doesn't perform any worse than older implementation on any supported platform (please post the results, if you do test it). Also, recommendations for a better (than pgbench) benhmark are welcome, so that we can show the improvement when it is used as a part of PG.

Best regards,

--
gurjeet[.singh]@EnterpriseDB.com
singh.gurjeet@{ gmail | hotmail | yahoo }.com

Re: Fwd: New CRC algorithm: Slicing by 8

From

Andrew Dunstan

Date:

19 October 2006, 10:27:28

Gurjeet Singh wrote:
>    
>
>     Can someone let us all know what is the limit on the attachments 
> sent to the -hackers list?
>

Generally it's probably best not to send attachments to -hackers at all. 
If you have a patch, send it to -hackers. Or you can publish it 
somewhere on the web and post a link to it (this is one of the things I 
think a developers' wiki is actually useful for).

cheers

andrew

Re: Fwd: New CRC algorithm: Slicing by 8

From

Andrew Dunstan

Date:

19 October 2006, 10:53:48

Andrew Dunstan wrote:
> If you have a patch, send it to -hackers. 


I mean -patches of course. Sorry.

cheers

andrew

Re: New CRC algorithm: Slicing by 8

From

Tom Lane

Date:

19 October 2006, 23:04:07

"Gurjeet Singh" <singh.gurjeet@gmail.com> writes:
>> Please refer to the following link for a new algorithm to calculate
>> CRC, developed by Intel.
>> 
>> http://www.intel.com/technology/magazine/communications/slicing-by-8-0306.htm

I can't help wondering what the patent status of this algorithm is.

(This from a guy who still remembers very well how Unisys published the
LZW algorithm in a manner that made it look like it was free for anyone
to use...)
        regards, tom lane

Re: New CRC algorithm: Slicing by 8

From

Alvaro Herrera

Date:

19 October 2006, 23:56:58

Tom Lane wrote:
> "Gurjeet Singh" <singh.gurjeet@gmail.com> writes:
> >> Please refer to the following link for a new algorithm to calculate
> >> CRC, developed by Intel.
> >> 
> >> http://www.intel.com/technology/magazine/communications/slicing-by-8-0306.htm
> 
> I can't help wondering what the patent status of this algorithm is.
> 
> (This from a guy who still remembers very well how Unisys published the
> LZW algorithm in a manner that made it look like it was free for anyone
> to use...)

This article

http://www.intel.com/technology/magazine/communications/slicing-by-8-0306.htm

seems to imply it's provided "free".  It doesn't say anything about
being free of patents though.

The Sourceforge project referred to in the article (but for which no
link is given) seems to be this one:

http://sourceforge.net/projects/slicing-by-8

The "files" section contains a zip archive, inside which there are three
source files all of which state that the package, which is (c) Intel
Corp. 2004-2006, is released under the BSD license, with this URL:

http://www.opensource.org/licenses/bsd-license.html

Again, no mention of patents anywhere.

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

Re: New CRC algorithm: Slicing by 8

From

"Gurjeet Singh"

Date:

20 October 2006, 02:56:58

On 10/20/06, Alvaro Herrera <alvherre@commandprompt.com> wrote:

It doesn't say anything about
being free of patents though.

The Sourceforge project referred to in the article (but for which no
link is given) seems to be this one:

http://sourceforge.net/projects/slicing-by-8

Yes. I had already mentioned that link in my posts.

The "files" section contains a zip archive, inside which there are three
source files all of which state that the package, which is (c) Intel
Corp. 2004-2006, is released under the BSD license, with this URL:

http://www.opensource.org/licenses/bsd-license.html

Again, no mention of patents anywhere.

I'll try to get in touch with the author and get the clarification.

Best regards,

--
gurjeet[.singh]@EnterpriseDB.com
singh.gurjeet@{ gmail | hotmail | yahoo }.com

Re: New CRC algorithm: Slicing by 8

From

Bruce Momjian

Date:

21 October 2006, 19:14:26

Gurjeet Singh wrote:
> On 10/20/06, Alvaro Herrera <alvherre@commandprompt.com> wrote:
> >
> > It doesn't say anything about
> > being free of patents though.
> >
> > The Sourceforge project referred to in the article (but for which no
> > link is given) seems to be this one:
> >
> > http://sourceforge.net/projects/slicing-by-8
> 
> 
> Yes. I had already mentioned that link in my posts.
> 
> The "files" section contains a zip archive, inside which there are three
> > source files all of which state that the package, which is (c) Intel
> > Corp. 2004-2006, is released under the BSD license, with this URL:
> >
> > http://www.opensource.org/licenses/bsd-license.html
> >
> > Again, no mention of patents anywhere.
> 
> 
> I'll try to get in touch with the author and get the clarification.

Yep, the license and patent status seem to be separate issues.

However, I am not sure getting a clarification from the author even
helps us legally.  Also, why are we more critical of an Intel-provided
idea than any other idea we get from the community?  The scary answer
is that in many ways they are the same, and have the same risks.

So unless we hear about a problem, I think we should use the code.

--  Bruce Momjian   bruce@momjian.us EnterpriseDB    http://www.enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +

Re: New CRC algorithm: Slicing by 8

From

Tom Lane

Date:

21 October 2006, 23:01:55

Bruce Momjian <bruce@momjian.us> writes:
> However, I am not sure getting a clarification from the author even
> helps us legally.  Also, why are we more critical of an Intel-provided
> idea than any other idea we get from the community?

Bitter experience with other companies.

> So unless we hear about a problem, I think we should use the code.

It hasn't even been tested.  One thing I'd want to know about is the
performance effect on non-Intel machines.
        regards, tom lane

Re: New CRC algorithm: Slicing by 8

From

"Gregory Maxwell"

Date:

22 October 2006, 00:08:12

On 10/21/06, Tom Lane <tgl@sss.pgh.pa.us> wrote:
[snip]
> It hasn't even been tested.  One thing I'd want to know about is the
> performance effect on non-Intel machines.

On Opteron 265 his test code shows SB8 (the intel alg) is 2.48x faster
for checksum and 1.95x faster for verify for the 800 * 1024 blocks of
1 KB each workload. For 100000 blocks of 8k I got simmlar results as
well.

It looks like the new code may have a larger cache footprint, so
actual performance may differ from the microbenchmark.

Re: New CRC algorithm: Slicing by 8

From

"Gurjeet Singh"

Date:

22 October 2006, 11:47:03

On 10/22/06, Gregory Maxwell <gmaxwell@gmail.com> wrote:

On Opteron 265 his test code shows SB8 (the intel alg) is 2.48x faster
for checksum and 1.95x faster for verify for the 800 * 1024 blocks of
1 KB each workload. For 100000 blocks of 8k I got simmlar results as
well.

I think you meant

800*1024 blocks of 1 >MB< each workload.

--
gurjeet[.singh]@EnterpriseDB.com
singh.gurjeet@{ gmail | hotmail | yahoo }.com

Re: New CRC algorithm: Slicing by 8

From

Tom Lane

Date:

22 October 2006, 19:06:28

"Gurjeet Singh" <singh.gurjeet@gmail.com> writes:
>> If possible, people should test it on different platforms, so as to
>> ensure that it doesn't perform any worse than older implementation on any
>> supported platform (please post the results, if you do test it).

I didn't particularly trust the timing calculations in your benchmark
program, so I made up my own low-tech test version (attached).
I get the following timings for 1 million iterations of
INIT_CRC32/COMP_CRC32/FIN_CRC32 on different buffer sizes,
using "gcc -O2" on some machines laying about the house:

        Std CRC        Slice8 CRC

HPPA (HPUX 10.20)

8192 bytes    44.897212    35.191499
1024 bytes    5.639081    4.669850
64 bytes    0.377416    0.613195

PPC (Mac G4, Darwin 10.4.8)

8192 bytes    12.663135    12.158293
1024 bytes    1.579940    1.614967
64 bytes    0.104310    0.229401

Intel Xeon EM64T (Fedora Core 5)

8192 bytes    4.420879    7.633120
1024 bytes    0.571794    0.819372
64 bytes    0.047354    0.071906

Intel Pentium 4 (Fedora Core 5)

8192 bytes    6.942324    28.848572    (yes, really)
1024 bytes    0.905259    3.625360
64 bytes    0.068065    0.260224

It's worth pointing out that this test program is biased in favor of the
slice8 code about as far as it could possibly be: after the first
iteration, the remaining 999999 will find the cache as hot as possible.
Furthermore, the test doesn't exercise misaligned or odd-length cases,
which would incur extra start/stop overhead for slice8.

These numbers are um, not impressive.  Considering that a large fraction
of our WAL records are pretty short, the fact that slice8 consistently
loses at short buffer lengths is especially discouraging.  Much of that
ground could be made up perhaps with tenser coding of the initialization
and finalization code, but it'd still not be worth taking any legal risk
for AFAICS.

            regards, tom lane

Attachment

msg-21064-2.bin

Re: New CRC algorithm: Slicing by 8

From

Bruce Momjian

Date:

22 October 2006, 22:44:12

Tom Lane wrote:
> Bruce Momjian <bruce@momjian.us> writes:
> > However, I am not sure getting a clarification from the author even
> > helps us legally.  Also, why are we more critical of an Intel-provided
> > idea than any other idea we get from the community?
> 
> Bitter experience with other companies.

The problem is we have lots of companies involved, and I bet some we
don't even know about (e.g. yahoo/gmail addresses), and with
contributors who don't know that their employment agreement says
anything they do while employed is company property.  And we have Unisys
involved now too.  How worrying is that?  :-(

> > So unless we hear about a problem, I think we should use the code.
> It hasn't even been tested.  One thing I'd want to know about is the
> performance effect on non-Intel machines.

Sure.

-- Bruce Momjian   bruce@momjian.us EnterpriseDB    http://www.enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +

Re: New CRC algorithm: Slicing by 8

From

Tom Lane

Date:

22 October 2006, 23:17:32

Bruce Momjian <bruce@momjian.us> writes:
> Tom Lane wrote:
>> Bruce Momjian <bruce@momjian.us> writes:
>>> Also, why are we more critical of an Intel-provided
>>> idea than any other idea we get from the community?
>> 
>> Bitter experience with other companies.

> The problem is we have lots of companies involved, and I bet some we
> don't even know about (e.g. yahoo/gmail addresses),

It's not so much that I don't trust Intel as that a CRC algorithm is
exactly the sort of nice little self-contained thing that people love
to try to patent these days.  What I am really afraid of is that someone
else has already invented this same method (or something close enough
to it) and filed for a patent that Intel doesn't know about either.
I'd be wondering about that no matter where the code had come from.

Given the numbers I posted earlier today, the proposal is dead in the
water anyway, quite aside from any legal considerations.
        regards, tom lane

Re: New CRC algorithm: Slicing by 8

From

Bruce Momjian

Date:

22 October 2006, 23:18:53

Tom Lane wrote:
> Bruce Momjian <bruce@momjian.us> writes:
> > Tom Lane wrote:
> >> Bruce Momjian <bruce@momjian.us> writes:
> >>> Also, why are we more critical of an Intel-provided
> >>> idea than any other idea we get from the community?
> >> 
> >> Bitter experience with other companies.
> 
> > The problem is we have lots of companies involved, and I bet some we
> > don't even know about (e.g. yahoo/gmail addresses),
> 
> It's not so much that I don't trust Intel as that a CRC algorithm is
> exactly the sort of nice little self-contained thing that people love
> to try to patent these days.  What I am really afraid of is that someone
> else has already invented this same method (or something close enough
> to it) and filed for a patent that Intel doesn't know about either.
> I'd be wondering about that no matter where the code had come from.
> 
> Given the numbers I posted earlier today, the proposal is dead in the
> water anyway, quite aside from any legal considerations.

Agreed.  I just wanted to point out we have other sharks in the water.  :-(

--  Bruce Momjian   bruce@momjian.us EnterpriseDB    http://www.enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +

Re: New CRC algorithm: Slicing by 8

From

mark@mark.mielke.cc

Date:

23 October 2006, 00:18:11

On Sun, Oct 22, 2006 at 06:06:13PM -0400, Tom Lane wrote:
> Intel Xeon EM64T (Fedora Core 5)
> 
> 8192 bytes    4.420879    7.633120
> 1024 bytes    0.571794    0.819372
> 64 bytes    0.047354    0.071906
> 
> Intel Pentium 4 (Fedora Core 5)
> 
> 8192 bytes    6.942324    28.848572    (yes, really)
> 1024 bytes    0.905259    3.625360
> 64 bytes    0.068065    0.260224

AMDX2 3800+ (Fedora Core 5)
               STD CRC         SLICE8 CRC
 8192 bytes    8.576635        7.170038 1024 bytes    1.504361        1.402446 64 bytes      0.154459        0.144209


Odd that the AMD shows opposite of the two Intel numbers above, and
that it was an "Intel engineer" who wrote it. My first speculation is
that you did your Intel numbers backwards. My second speculation is
that you already thought of that and confirmed before posting. :-)

So yeah - not too impressive...

Cheers,
mark

-- 
mark@mielke.cc / markm@ncf.ca / markm@nortel.com     __________________________
.  .  _  ._  . .   .__    .  . ._. .__ .   . . .__  | Neighbourhood Coder
|\/| |_| |_| |/    |_     |\/|  |  |_  |   |/  |_   | 
|  | | | | \ | \   |__ .  |  | .|. |__ |__ | \ |__  | Ottawa, Ontario, Canada
 One ring to rule them all, one ring to find them, one ring to bring them all                      and in the darkness
bindthem...
 
                          http://mark.mielke.cc/

Re: New CRC algorithm: Slicing by 8

From

Tom Lane

Date:

23 October 2006, 00:41:37

mark@mark.mielke.cc writes:
> ... My first speculation is
> that you did your Intel numbers backwards. My second speculation is
> that you already thought of that and confirmed before posting. :-)

Yah, I checked.  Several times... but if anyone else wants to repeat
the experiment, please do.  Or look for bugs in either my test case
or Gurjeet's.
        regards, tom lane

Re: New CRC algorithm: Slicing by 8

From

Mark Kirkwood

Date:

23 October 2006, 02:21:10

Tom Lane wrote:
>
> 
> Yah, I checked.  Several times... but if anyone else wants to repeat
> the experiment, please do.  Or look for bugs in either my test case
> or Gurjeet's.


FWIW - FreeBSD and Linux results using Tom's test program on almost 
identical hardware[1]:
            Std crc     Slice-8 crc

Intel P-III 1.26Ghz (FreeBSD 6.2)

8192 bytes  12.975314   14.503810
1024 bytes  1.633557    1.852322
64 bytes    0.111580    0.206975


Intel P-III 1.26Ghz (Gentoo 2006.1)


8192 bytes  12.967997   28.363876
1024 bytes  1.632317    3.626230
64 bytes    0.111513    0.326557


Interesting that the slice-8 algorithm seems to work noticeably better 
on FreeBSD than Linux - but still not as well as the standard one (for 
these tests anyway)...


Cheers

Mark

[1] Both  boxes have identical mobos, memory and CPUs (same sspec nos).

Re: New CRC algorithm: Slicing by 8

From

Tom Lane

Date:

23 October 2006, 02:35:27

Mark Kirkwood <markir@paradise.net.nz> writes:
> Interesting that the slice-8 algorithm seems to work noticeably better 
> on FreeBSD than Linux -

Are you running similar gcc versions on both?  I realize I forgot to
document what I was using:

HPPA: gcc version 2.95.3 20010315 (release)
PPC: gcc version 4.0.1 (Apple Computer, Inc. build 5363)
both Intels: gcc version 4.1.1 20060525 (Red Hat 4.1.1-1)

Interesting that it's only the oldest-line tech that's showing a win
for me ...
        regards, tom lane

Re: New CRC algorithm: Slicing by 8

From

Mark Kirkwood

Date:

23 October 2006, 03:02:01

Tom Lane wrote:
> Mark Kirkwood <markir@paradise.net.nz> writes:
>> Interesting that the slice-8 algorithm seems to work noticeably better 
>> on FreeBSD than Linux -
> 
> Are you running similar gcc versions on both?  I realize I forgot to
> document what I was using:
> 
> HPPA: gcc version 2.95.3 20010315 (release)
> PPC: gcc version 4.0.1 (Apple Computer, Inc. build 5363)
> both Intels: gcc version 4.1.1 20060525 (Red Hat 4.1.1-1)
> 
> Interesting that it's only the oldest-line tech that's showing a win
> for me ...
> 
>

Ah - good point, FreeBSD is using an older compiler:

FreeBSD: gcc (GCC) 3.4.6 [FreeBSD] 20060305
Linux: gcc (GCC) 4.1.1 (Gentoo 4.1.1)

Hmm - there is a FreeBSD port for 4.1.2, I might set that off to build 
itself and see if compiling with it changes the results any....

Re: New CRC algorithm: Slicing by 8

From

Anders Steinlein

Date:

23 October 2006, 03:44:08

On Mon, 23 Oct 2006 07:35:08 +0200, Tom Lane <tgl@sss.pgh.pa.us> wrote:

> I realize I forgot to document what I was using:
>
> HPPA: gcc version 2.95.3 20010315 (release)
> PPC: gcc version 4.0.1 (Apple Computer, Inc. build 5363)
> both Intels: gcc version 4.1.1 20060525 (Red Hat 4.1.1-1)
>
> Interesting that it's only the oldest-line tech that's showing a win
> for me ...

Does anyone have an Intel compiler availible? It would be interesting to  
see results on that compared to gcc (given that it is an Intel algorithm  
;).

\Anders

Re: New CRC algorithm: Slicing by 8

From

Heikki Linnakangas

Date:

23 October 2006, 03:46:46

Tom Lane wrote:
> 
> Are you running similar gcc versions on both?  I realize I forgot to
> document what I was using:
> 
> HPPA: gcc version 2.95.3 20010315 (release)
> PPC: gcc version 4.0.1 (Apple Computer, Inc. build 5363)
> both Intels: gcc version 4.1.1 20060525 (Red Hat 4.1.1-1)
> 
> Interesting that it's only the oldest-line tech that's showing a win
> for me ...

It would be interesting to see if the Intel compiler works better for 
this algorithm. Anyone have it installed?

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com

Re: New CRC algorithm: Slicing by 8

From

Jeremy Drake

Date:

23 October 2006, 03:51:42

On Mon, 23 Oct 2006, Mark Kirkwood wrote:

> Tom Lane wrote:
> >
> >
> > Yah, I checked.  Several times... but if anyone else wants to repeat
> > the experiment, please do.  Or look for bugs in either my test case
> > or Gurjeet's.
>
>

Just for fun, I tried it out with both GCC and with Intel's C compiler
with some agressive platform-specific flags on my 2.8Ghz Xeon running
Gentoo.
           Std crc     Slice-8 crc

Intel P4 Xeon 2.8Ghz (Gentoo, gcc-3.4.5, -O2)

8192 bytes  4.697572    9.806341
1024 bytes  0.597429    1.181828
64 bytes    0.046636    0.086984

Intel P4 Xeon 2.8Ghz (Gentoo, icc-9.0.032, -O2 -xN -ipo -parallel)

8192 bytes  0.000004    0.001085
1024 bytes  0.000004    0.001292
64 bytes    0.000003    0.001078


So at this point I realize that intel's compiler is optimizing the loop
away, at least for the std crc and probably for both.  So I make mycrc an
array of 2, and substript mycrc[j&1] in the loop.
           Std crc     Slice-8 crc

Intel P4 Xeon 2.8Ghz (Gentoo, gcc-3.4.5, -O2)

8192 bytes  51.397146   9.523182
1024 bytes  6.430986    1.229043
64 bytes    0.400062    0.128579

Intel P4 Xeon 2.8Ghz (Gentoo, icc-9.0.032, -O2 -xN -ipo -parallel)

8192 bytes  29.881708   0.001432
1024 bytes  3.750313    0.001432
64 bytes    0.238583    0.001431

So it looks like something fishy is still going on with the slice-8 with
the intel compiler.

I have attached my changed testcrc.c file.


> FWIW - FreeBSD and Linux results using Tom's test program on almost identical
> hardware[1]:
>
>             Std crc     Slice-8 crc
>
> Intel P-III 1.26Ghz (FreeBSD 6.2)
>
> 8192 bytes  12.975314   14.503810
> 1024 bytes  1.633557    1.852322
> 64 bytes    0.111580    0.206975
>
>
> Intel P-III 1.26Ghz (Gentoo 2006.1)
>
>
> 8192 bytes  12.967997   28.363876
> 1024 bytes  1.632317    3.626230
> 64 bytes    0.111513    0.326557
>
>
> Interesting that the slice-8 algorithm seems to work noticeably better on
> FreeBSD than Linux - but still not as well as the standard one (for these
> tests anyway)...
>
>
> Cheers
>
> Mark
>
> [1] Both  boxes have identical mobos, memory and CPUs (same sspec nos).
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 5: don't forget to increase your free space map settings
>
>

-- 
You can tune a piano, but you can't tuna fish.

Re: New CRC algorithm: Slicing by 8

From

Mark Kirkwood

Date:

23 October 2006, 07:26:56

Mark Kirkwood wrote:
> Tom Lane wrote:

>> Are you running similar gcc versions on both?  I realize I forgot to
>> document what I was using:

> Ah - good point, FreeBSD is using an older compiler:
> 
> FreeBSD: gcc (GCC) 3.4.6 [FreeBSD] 20060305
> Linux: gcc (GCC) 4.1.1 (Gentoo 4.1.1)
> 
> Hmm - there is a FreeBSD port for 4.1.2, I might set that off to build 
> itself and see if compiling with it changes the results any....
>

Here are the results after building gcc 4.1.2 (repeating results for gcc 
3.4.6 for comparison). I suspect that performance is probably impacted 
because gcc 4.1.2 (and also the rest of the tool-chain) is built with 
gcc 3.4.6 - but it certainly suggests that the newer gcc versions don't  like the slice-8 algorithm for some reason.
            Std crc     Slice-8 crc

Intel P-III 1.26Ghz (FreeBSD 6.2 gcc 3.4.6)

8192 bytes  12.975314   14.503810
1024 bytes  1.633557    1.852322
64 bytes    0.111580    0.206975

Intel P-III 1.26Ghz (FreeBSD 6.2 gcc 4.1.2)

8192 bytes  19.516974   29.457739
1024 bytes  2.448570    3.742106
64 bytes    0.112644    0.335292

Re: New CRC algorithm: Slicing by 8

From

Benny Amorsen

Date:

23 October 2006, 08:59:59

>>>>> "MK" == Mark Kirkwood <markir@paradise.net.nz> writes:

MK> Here are the results after building gcc 4.1.2 (repeating results
MK> for gcc 3.4.6 for comparison). I suspect that performance is
MK> probably impacted because gcc 4.1.2 (and also the rest of the
MK> tool-chain) is built with gcc 3.4.6 - but it certainly suggests
MK> that the newer gcc versions don't like the slice-8 algorithm for
MK> some reason.

They don't seem to like the old CRC algorithms either. It is quite
strange, such dramatic performance regressions from 3.x to 4.x are
rare.


/Benny

Re: New CRC algorithm: Slicing by 8

From

Tom Lane

Date:

23 October 2006, 10:51:07

Jeremy Drake <pgsql@jdrake.com> writes:
> So at this point I realize that intel's compiler is optimizing the loop
> away, at least for the std crc and probably for both.  So I make mycrc an
> array of 2, and substript mycrc[j&1] in the loop.

That's not a good workaround, because making mycrc expensive to access
means your inner loop timing isn't credible at all.  Instead try making the
buffer array nonlocal --- malloc it, perhaps.
        regards, tom lane

Re: New CRC algorithm: Slicing by 8

From

"Simon Riggs"

Date:

23 October 2006, 14:45:43

On Sun, 2006-10-22 at 18:06 -0400, Tom Lane wrote:

> These numbers are um, not impressive.  Considering that a large fraction
> of our WAL records are pretty short, the fact that slice8 consistently
> loses at short buffer lengths is especially discouraging.  Much of that
> ground could be made up perhaps with tenser coding of the initialization
> and finalization code, but it'd still not be worth taking any legal risk
> for AFAICS.

It doesn't look good for SB8, does it? Nor for gcc4.1 either.

Presumably Intel themselves will have some come-back, but I'm not sure
what they'll so to so many conclusive tests.

Instead, I'd like to include a parameter to turn off CRC altogether, for
heavily CPU bound operations and the WAL drive on trustworthy hardware.

wal_checksum = off

--  Simon Riggs              EnterpriseDB   http://www.enterprisedb.com

Re: New CRC algorithm: Slicing by 8

From

Tom Lane

Date:

23 October 2006, 14:53:22

"Simon Riggs" <simon@2ndquadrant.com> writes:
> Instead, I'd like to include a parameter to turn off CRC altogether, for
> heavily CPU bound operations and the WAL drive on trustworthy hardware.

No can do --- we rely on the checksums to be able to tell when we've hit
the end of WAL during replay.  You may as well propose not writing WAL
at all (and no, I don't think it'd pass).
        regards, tom lane

Re: New CRC algorithm: Slicing by 8

From

"Dawid Kuroczko"

Date:

23 October 2006, 14:55:05

On 10/23/06, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> It's not so much that I don't trust Intel as that a CRC algorithm is
> exactly the sort of nice little self-contained thing that people love
> to try to patent these days.  What I am really afraid of is that someone
> else has already invented this same method (or something close enough
> to it) and filed for a patent that Intel doesn't know about either.
> I'd be wondering about that no matter where the code had come from.
>
> Given the numbers I posted earlier today, the proposal is dead in the
> water anyway, quite aside from any legal considerations.

The horror, the horror.  I wonder if changing to Slicing by 8 would be
so self contained, so that people from software-patent free world
would be able to just patch their distribution if they will.
  Regards,     Dawid

Re: New CRC algorithm: Slicing by 8

From

Josh Berkus

Date:

23 October 2006, 15:42:20

Bruce, Tom, All:

> > Given the numbers I posted earlier today, the proposal is dead in the
> > water anyway, quite aside from any legal considerations.
>
> Agreed.  I just wanted to point out we have other sharks in the water.

*IF* Slice-by-8 turned out to be a winner, I could get the legal issues
looked into and probably resolved.   But we should only do that if it
turns out we really want to use it.

--
--Josh

Josh Berkus
PostgreSQL @ Sun
San Francisco

Re: New CRC algorithm: Slicing by 8

From

"Simon Riggs"

Date:

23 October 2006, 15:47:07

On Mon, 2006-10-23 at 13:52 -0400, Tom Lane wrote:
> "Simon Riggs" <simon@2ndquadrant.com> writes:
> > Instead, I'd like to include a parameter to turn off CRC altogether, for
> > heavily CPU bound operations and the WAL drive on trustworthy hardware.
> 
> No can do --- we rely on the checksums to be able to tell when we've hit
> the end of WAL during replay.  

No we don't: Zero length records are the trigger for EOF.

Sure, a CRC failure will cause replay to end, but only if you calculate
it and compare it to whats on disk - which is the bit I would turn off.
We don't rely on that, however, so it is avoidable.

In most cases, it would be foolish to avoid: but there are cases where
the data is CRC checked by the hardware/system already, so I'd like to
make an option to turn this off, defaulting to on, for safety.

> You may as well propose not writing WAL
> at all (and no, I don't think it'd pass).

That would undo all of my efforts, so no I wouldn't consider that.

--  Simon Riggs              EnterpriseDB   http://www.enterprisedb.com

Re: New CRC algorithm: Slicing by 8

From

"Joshua D. Drake"

Date:

23 October 2006, 15:53:57

> In most cases, it would be foolish to avoid: but there are cases where
> the data is CRC checked by the hardware/system already, so I'd like to
> make an option to turn this off, defaulting to on, for safety.

How would we know? What are those cases?

Sounds like a foot gun to me.

Sincerely,

Joshua D. Drake

> 
>> You may as well propose not writing WAL
>> at all (and no, I don't think it'd pass).
> 
> That would undo all of my efforts, so no I wouldn't consider that.
> 


-- 
  === The PostgreSQL Company: Command Prompt, Inc. ===
Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240  Providing the most comprehensive  PostgreSQL
solutionssince 1997            http://www.commandprompt.com/
 

Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate

Re: New CRC algorithm: Slicing by 8

From

Tom Lane

Date:

23 October 2006, 16:12:31

"Simon Riggs" <simon@2ndquadrant.com> writes:
> On Mon, 2006-10-23 at 13:52 -0400, Tom Lane wrote:
>> No can do --- we rely on the checksums to be able to tell when we've hit
>> the end of WAL during replay.  

> No we don't: Zero length records are the trigger for EOF.

Only if the file happens to be all-zero already, which is not the normal
operating state (see WAL-file recycling).  Otherwise you have to be able
to detect an invalid record.

There are actually three checks used to detect end of WAL: zero record
length, invalid checksum, and incorrect back-pointer.  Zero length is
the first and cleanest-looking test, but AFAICS we have to have both of
the others to avoid obvious failure modes.
        regards, tom lane

Re: New CRC algorithm: Slicing by 8

From

Jeremy Drake

Date:

23 October 2006, 16:16:32

On Mon, 23 Oct 2006, Tom Lane wrote:

> Jeremy Drake <pgsql@jdrake.com> writes:
> > So at this point I realize that intel's compiler is optimizing the loop
> > away, at least for the std crc and probably for both.  So I make mycrc an
> > array of 2, and substript mycrc[j&1] in the loop.
>
> That's not a good workaround, because making mycrc expensive to access
> means your inner loop timing isn't credible at all.  Instead try making the
> buffer array nonlocal --- malloc it, perhaps.

That did not make any difference.  The way I see it, the only way to
convince the compiler it really needs to do this loop more than once is to
make it think it is not overwriting the same variable every time.  The
subscript was the cheapest way I could think of to do that.  Any other
suggestions on how to do this are welcome.

>
>             regards, tom lane
>
> ---------------------------(end of broadcast)---------------------------
> TIP 6: explain analyze is your friend
>
>

-- 
I like being single.  I'm always there when I need me.    -- Art Leo

Re: New CRC algorithm: Slicing by 8

From

Tom Lane

Date:

23 October 2006, 16:21:39

Jeremy Drake <pgsql@jdrake.com> writes:
> On Mon, 23 Oct 2006, Tom Lane wrote:
>> That's not a good workaround, because making mycrc expensive to access
>> means your inner loop timing isn't credible at all.  Instead try making the
>> buffer array nonlocal --- malloc it, perhaps.

> That did not make any difference.  The way I see it, the only way to
> convince the compiler it really needs to do this loop more than once is to
> make it think it is not overwriting the same variable every time.  The
> subscript was the cheapest way I could think of to do that.  Any other
> suggestions on how to do this are welcome.

Hmm.  Maybe store the CRCs into a global array somewhere?
uint32 results[NTESTS];
for ...{    INIT/COMP/FIN_CRC32...    results[j] = mycrc;}

This still adds a bit of overhead to the outer loop, but not much.

Another possibility is to put the INIT/COMP/FIN_CRC32 into an external
subroutine, thereby adding a call/return to the outer loop overhead.
        regards, tom lane

Re: New CRC algorithm: Slicing by 8

From

Jeremy Drake

Date:

23 October 2006, 16:44:11

On Mon, 23 Oct 2006, Tom Lane wrote:

> Hmm.  Maybe store the CRCs into a global array somewhere?
>
>     uint32 results[NTESTS];
>
>     for ...
>     {
>         INIT/COMP/FIN_CRC32...
>         results[j] = mycrc;
>     }
>
> This still adds a bit of overhead to the outer loop, but not much.
>

That seems to have worked.
           Std crc     Slice-8 crc

Intel P4 Xeon 2.8Ghz (Gentoo, gcc-3.4.5, -O2)

8192 bytes  26.765317   10.511143
1024 bytes  3.357843    1.280890
64 bytes    0.223213    0.103767

Intel P4 Xeon 2.8Ghz (Gentoo, icc-9.0.032, -O2 -xN -ipo -parallel)

8192 bytes  29.495836   0.007107
1024 bytes  3.708665    0.012183
64 bytes    0.242579    0.008700


So the gcc times are reasonable, but the icc times for the slice-by-8 are
still too fast to be believed.  I will have to take a look at the
generated assembly later and see what gives.

My changed testcrc.c is attached, again.


-- 
"I'd love to go out with you, but I did my own thing and now I've got
to undo it."

Re: New CRC algorithm: Slicing by 8

From

Mark Kirkwood

Date:

23 October 2006, 18:11:05

Benny Amorsen wrote:
>>>>>> "MK" == Mark Kirkwood <markir@paradise.net.nz> writes:
> 
> MK> Here are the results after building gcc 4.1.2 (repeating results
> MK> for gcc 3.4.6 for comparison). I suspect that performance is
> MK> probably impacted because gcc 4.1.2 (and also the rest of the
> MK> tool-chain) is built with gcc 3.4.6 - but it certainly suggests
> MK> that the newer gcc versions don't like the slice-8 algorithm for
> MK> some reason.
> 
> They don't seem to like the old CRC algorithms either. It is quite
> strange, such dramatic performance regressions from 3.x to 4.x are
> rare.
> 

Right - I think the regression is caused by libc and kernel being built 
with gcc 3.4.6 and the test program being built with gcc 4.1.2. 
Rebuilding *everything* with 4.1.2 (which I'm not sure is possible for 
FreeBSD at the moment) would probably get us back to numbers that looked 
more like my Gentoo ones [1].

Cheers

Mark

[1] Note that the upgrade process for switching Gentoo from gcc 3.4 to 
4.1 involves precisely this - build 4.1, then rebuild everything using 
4.1 (including 4.1 itself!)

Re: New CRC algorithm: Slicing by 8

From

Tom Lane

Date:

23 October 2006, 18:23:44

Mark Kirkwood <markir@paradise.net.nz> writes:
> Right - I think the regression is caused by libc and kernel being built 
> with gcc 3.4.6 and the test program being built with gcc 4.1.2. 

Why do you think that?  The performance of the CRC loop shouldn't depend
at all on either libc or the kernel, because they're not invoked inside
the loop.
        regards, tom lane

Re: New CRC algorithm: Slicing by 8

From

Mark Kirkwood

Date:

23 October 2006, 20:06:49

Tom Lane wrote:
> Mark Kirkwood <markir@paradise.net.nz> writes:
>> Right - I think the regression is caused by libc and kernel being built 
>> with gcc 3.4.6 and the test program being built with gcc 4.1.2. 
> 
> Why do you think that?  The performance of the CRC loop shouldn't depend
> at all on either libc or the kernel, because they're not invoked inside
> the loop.
> 
>

Just come to the same conclusion myself... I've got gcc 3.4.6 on Gentoo,  so tried using that, and observed no
regressionat all there for the 

std case - which pretty much rules out any issues connected with "what 
did I build kernel + libc with vs what compiler am I using now":

Intel P-III 1.26Ghz (Gentoo 2006.1)
            Std crc     Slice-8 crc

8192 bytes  12.967997   28.363876     (gcc 4.1.1)
8192 bytes  12.956765    13.978495    (gcc 3.4.6)

So Gentoo using gcc 3.4.6 looks like FreeBSD using 3.4.6, so the std vs 
slice-8 performance seems to be all about compiler version.

Now the regression on FreeBSD for the std case might be (as Kenneth 
pointed out) due to 4.1.1 being built by 3.4.6.... but I guess it is 
just a nit at this point.

Cheers

Mark

Re: New CRC algorithm: Slicing by 8

From

"Simon Riggs"

Date:

24 October 2006, 06:53:16

On Mon, 2006-10-23 at 15:12 -0400, Tom Lane wrote:
> "Simon Riggs" <simon@2ndquadrant.com> writes:
> > On Mon, 2006-10-23 at 13:52 -0400, Tom Lane wrote:
> >> No can do --- we rely on the checksums to be able to tell when we've hit
> >> the end of WAL during replay.  
> 
> > No we don't: Zero length records are the trigger for EOF.
> 
> Only if the file happens to be all-zero already, which is not the normal
> operating state (see WAL-file recycling).  Otherwise you have to be able
> to detect an invalid record.

OK.

> There are actually three checks used to detect end of WAL: zero record
> length, invalid checksum, and incorrect back-pointer.  Zero length is
> the first and cleanest-looking test, but AFAICS we have to have both of
> the others to avoid obvious failure modes.

Well, I understand the need for the zero-length test and the incorrect
back pointer (which also incidentally tests that the current record was
not left over from previous use of xlog file).

The checksum protects from torn pages and disk errors. If you have
full_page_writes set then you already believe yourself safe from torn
pages and your disks could also already be CRC-checking the data. So you
don't *need* the checksum in those cases. If we really think we need it
we could put the xlprev pointer as the *last* field on the xlrec, just
to make doubly sure - having a trailer as well as a header.

CRC-checked disks are actually the industry norm and have been for
around 5 years. ANSI SCSI Parallel Interface 3 (SPI-3) (UltraSCSI 160)
defines the use of CRC and this is available as standard from all key
manufacturers. (CRC-32 is the required level). So if you are using
modern SCSI, SATA or SAS technologies you'll be just fine. (Those checks
are a *requirement* of the underlying technology because of the error
rates when the bus speeds are so high.)

I checked this out in a conversation with the fine people at Seagate,
who also publish a variety of technical manuals/details on this, e.g.
http://www.maxtor.com/_files/maxtor/en_us/documentation/white_papers_technical/wp_ultra320.pdf
...You'll see that CRC is a Mandatory Feature of ANSI SPI-4 now.
http://www.maxtor.com/_files/maxtor/en_us/documentation/white_papers_technical/sas_link_layer.pdf

So, I'd like the *option* to turn our CRC off, please, somehow.

--  Simon Riggs              EnterpriseDB   http://www.enterprisedb.com

Re: New CRC algorithm: Slicing by 8

From

Tom Lane

Date:

24 October 2006, 10:38:54

"Simon Riggs" <simon@2ndquadrant.com> writes:
> On Mon, 2006-10-23 at 15:12 -0400, Tom Lane wrote:
>> There are actually three checks used to detect end of WAL: zero record
>> length, invalid checksum, and incorrect back-pointer.  Zero length is
>> the first and cleanest-looking test, but AFAICS we have to have both of
>> the others to avoid obvious failure modes.

> The checksum protects from torn pages and disk errors. If you have
> full_page_writes set then you already believe yourself safe from torn
> pages and your disks could also already be CRC-checking the data.

No, because unlike tuples, WAL records can and do cross page boundaries.
Unless you're prepared to posit that your system never crashes at all,
you have to be able to detect the case where you've got a good front
half of a WAL record and non-matching data in the next page.  The two
halves could be written out in widely separated write()s, indeed might
never have been simultaneously resident in the WAL buffers at all.

> CRC-checked disks are actually the industry norm and have been for
> around 5 years.

Huh?  Disks have ALWAYS had CRCs, and this is in any case utterly
irrelevant to the database-crash risk.
        regards, tom lane

Re: New CRC algorithm: Slicing by 8

From

"Simon Riggs"

Date:

24 October 2006, 10:53:16

On Tue, 2006-10-24 at 09:37 -0400, Tom Lane wrote:
> "Simon Riggs" <simon@2ndquadrant.com> writes:
> > On Mon, 2006-10-23 at 15:12 -0400, Tom Lane wrote:
> >> There are actually three checks used to detect end of WAL: zero record
> >> length, invalid checksum, and incorrect back-pointer.  Zero length is
> >> the first and cleanest-looking test, but AFAICS we have to have both of
> >> the others to avoid obvious failure modes.
> 
> > The checksum protects from torn pages and disk errors. If you have
> > full_page_writes set then you already believe yourself safe from torn
> > pages and your disks could also already be CRC-checking the data.
> 
> No, because unlike tuples, WAL records can and do cross page boundaries.

But not that often, with full_page_writes = off. So we could get away
with just CRC checking the page-spanning ones and mark the records to
show whether they have been CRC checked or not and need to be rechecked
at recovery time. That would reduce the CRC overhead to about 1-5% of
what it is now (as an option).

Just a thought: Would there be benefit in not allowing page-spanning WAL
records, when they are small? That would tend to reduce the number of
WAL writes, even if it did cause some space wastage on disk. That would
reduce the number of same block re-writes and might improve the
sequential behaviour of WAL access. We never needed to think about this
while full_page_writes=on was the only option.

> > CRC-checked disks are actually the industry norm and have been for
> > around 5 years.
> 
> Huh?  Disks have ALWAYS had CRCs, and this is in any case utterly
> irrelevant to the database-crash risk.

According to the man from Seagate...

--  Simon Riggs              EnterpriseDB   http://www.enterprisedb.com

Re: New CRC algorithm: Slicing by 8

From

mark@mark.mielke.cc

Date:

24 October 2006, 11:18:55

On Tue, Oct 24, 2006 at 02:52:36PM +0100, Simon Riggs wrote:
> On Tue, 2006-10-24 at 09:37 -0400, Tom Lane wrote:
> > "Simon Riggs" <simon@2ndquadrant.com> writes:
> > > On Mon, 2006-10-23 at 15:12 -0400, Tom Lane wrote:
> > >> There are actually three checks used to detect end of WAL: zero record
> > >> length, invalid checksum, and incorrect back-pointer.  Zero length is
> > >> the first and cleanest-looking test, but AFAICS we have to have both of
> > >> the others to avoid obvious failure modes.
> > > The checksum protects from torn pages and disk errors. If you have
> > > full_page_writes set then you already believe yourself safe from torn
> > > pages and your disks could also already be CRC-checking the data.
> > No, because unlike tuples, WAL records can and do cross page boundaries.

> But not that often, with full_page_writes = off. So we could get away
> with just CRC checking the page-spanning ones and mark the records to
> show whether they have been CRC checked or not and need to be rechecked
> at recovery time. That would reduce the CRC overhead to about 1-5% of
> what it is now (as an option).

WAL pages 8 Kbytes, and disk pages 512 bytes, correct? I don't see a
guarantee in here that the 8 Kbytes worth of data will be written as
sequential writes, nor that the 8 Kbytes of data will necessarily
finish.

If the operating system uses 8 Kbyte pages, or the RAID system uses 8
Kbytes or larger chunks, and they guarantee sequential writes, perhaps
it is ok. Still, if the power goes out after writing the first 512
bytes, 2048 bytes, or 4096 bytes, then what? With RAID involved it
might get better or worse, depending on the RAID configuration.

I'm almost wondering whether the three numbers are enough. I'm too busy
to sketch it all down and predict failure points... :-)

> Just a thought: Would there be benefit in not allowing page-spanning WAL
> records, when they are small? That would tend to reduce the number of
> WAL writes, even if it did cause some space wastage on disk. That would
> reduce the number of same block re-writes and might improve the
> sequential behaviour of WAL access. We never needed to think about this
> while full_page_writes=on was the only option.

Probably. Might not be much though.

> > > CRC-checked disks are actually the industry norm and have been for
> > > around 5 years.
> > Huh?  Disks have ALWAYS had CRCs, and this is in any case utterly
> > irrelevant to the database-crash risk.
> According to the man from Seagate...

Once upon a time when bits were stored the size of smarties (exaggeration)....
Back in those days (before my time), they liked to use things like parity.

I didn't read the page, but perhaps there is some confusion between
CRC and error correction codes. Obviously, technologies were
introduced over time. I don't remember ever having a hard disk that
didn't have some sort of error detection. The 5 year claim seems
decades too short unless they are talking about a newer technology.
Even the old 5.25" DOS floppies seemed to be able to detect errors
rather than return invalid corrupted bits.

Cheers,
mark

-- 
mark@mielke.cc / markm@ncf.ca / markm@nortel.com     __________________________
.  .  _  ._  . .   .__    .  . ._. .__ .   . . .__  | Neighbourhood Coder
|\/| |_| |_| |/    |_     |\/|  |  |_  |   |/  |_   | 
|  | | | | \ | \   |__ .  |  | .|. |__ |__ | \ |__  | Ottawa, Ontario, Canada
 One ring to rule them all, one ring to find them, one ring to bring them all                      and in the darkness
bindthem...

                          http://mark.mielke.cc/

Re: New CRC algorithm: Slicing by 8

From

Tom Lane

Date:

24 October 2006, 11:37:41

mark@mark.mielke.cc writes:
> The 5 year claim seems
> decades too short unless they are talking about a newer technology.

I think what Simon is on about is CRCs being routinely used on the cable
between the disk drive and the CPU.  When I was involved in this stuff
you usually only got a parity bit on each byte transferred.  CRCs for
all disk command/result messages would definitely help close a risk area
--- but that's only one risk of many.
        regards, tom lane

Re: New CRC algorithm: Slicing by 8

From

"Simon Riggs"

Date:

24 October 2006, 13:06:34

On Tue, 2006-10-24 at 10:18 -0400, mark@mark.mielke.cc wrote:
> On Tue, Oct 24, 2006 at 02:52:36PM +0100, Simon Riggs wrote:
> > On Tue, 2006-10-24 at 09:37 -0400, Tom Lane wrote:
> > > No, because unlike tuples, WAL records can and do cross page boundaries.
> 
> > But not that often, with full_page_writes = off. So we could get away
> > with just CRC checking the page-spanning ones and mark the records to
> > show whether they have been CRC checked or not and need to be rechecked
> > at recovery time. That would reduce the CRC overhead to about 1-5% of
> > what it is now (as an option).
> 
> WAL pages 8 Kbytes, and disk pages 512 bytes, correct? I don't see a
> guarantee in here that the 8 Kbytes worth of data will be written as
> sequential writes, nor that the 8 Kbytes of data will necessarily
> finish.
> 
> If the operating system uses 8 Kbyte pages, or the RAID system uses 8
> Kbytes or larger chunks, and they guarantee sequential writes, perhaps
> it is ok. Still, if the power goes out after writing the first 512
> bytes, 2048 bytes, or 4096 bytes, then what? With RAID involved it
> might get better or worse, depending on the RAID configuration.

That is the torn-page problem. If your system doesn't already protect
you against this you have no business turning off full_page_writes,
which was one of my starting assumptions.

--  Simon Riggs              EnterpriseDB   http://www.enterprisedb.com

Re: New CRC algorithm: Slicing by 8

From

mark@mark.mielke.cc

Date:

24 October 2006, 13:47:31

On Tue, Oct 24, 2006 at 05:05:58PM +0100, Simon Riggs wrote:
> On Tue, 2006-10-24 at 10:18 -0400, mark@mark.mielke.cc wrote:
> > On Tue, Oct 24, 2006 at 02:52:36PM +0100, Simon Riggs wrote:
> > > On Tue, 2006-10-24 at 09:37 -0400, Tom Lane wrote:
> > > > No, because unlike tuples, WAL records can and do cross page boundaries.
> > > But not that often, with full_page_writes = off. So we could get away
> > > with just CRC checking the page-spanning ones and mark the records to
> > > show whether they have been CRC checked or not and need to be rechecked
> > > at recovery time. That would reduce the CRC overhead to about 1-5% of
> > > what it is now (as an option).
> > 
> > WAL pages 8 Kbytes, and disk pages 512 bytes, correct? I don't see a
> > guarantee in here that the 8 Kbytes worth of data will be written as
> > sequential writes, nor that the 8 Kbytes of data will necessarily
> > finish.
> > 
> > If the operating system uses 8 Kbyte pages, or the RAID system uses 8
> > Kbytes or larger chunks, and they guarantee sequential writes, perhaps
> > it is ok. Still, if the power goes out after writing the first 512
> > bytes, 2048 bytes, or 4096 bytes, then what? With RAID involved it
> > might get better or worse, depending on the RAID configuration.
> 
> That is the torn-page problem. If your system doesn't already protect
> you against this you have no business turning off full_page_writes,
> which was one of my starting assumptions.

I wasn't aware that a system could protect against this. :-)

I write 8 Kbytes - how can I guarantee that the underlying disk writes
all 8 Kbytes before it loses power? And why isn't the CRC a valid means
of dealing with this? :-)

I'm on wrong on one of these assumptions, I'm open to being educated.
My opinion as of a few seconds ago, is that a write to a single disk
sector is safe, but that a write that extends across several sectors
is not.

Cheers,
mark

-- 
mark@mielke.cc / markm@ncf.ca / markm@nortel.com     __________________________
.  .  _  ._  . .   .__    .  . ._. .__ .   . . .__  | Neighbourhood Coder
|\/| |_| |_| |/    |_     |\/|  |  |_  |   |/  |_   | 
|  | | | | \ | \   |__ .  |  | .|. |__ |__ | \ |__  | Ottawa, Ontario, Canada
 One ring to rule them all, one ring to find them, one ring to bring them all                      and in the darkness
bindthem...
 
                          http://mark.mielke.cc/

Re: New CRC algorithm: Slicing by 8

From

"Simon Riggs"

Date:

24 October 2006, 14:32:16

On Tue, 2006-10-24 at 12:47 -0400, mark@mark.mielke.cc wrote:
> On Tue, Oct 24, 2006 at 05:05:58PM +0100, Simon Riggs wrote:
> > On Tue, 2006-10-24 at 10:18 -0400, mark@mark.mielke.cc wrote:
> > > On Tue, Oct 24, 2006 at 02:52:36PM +0100, Simon Riggs wrote:
> > > > On Tue, 2006-10-24 at 09:37 -0400, Tom Lane wrote:
> > > > > No, because unlike tuples, WAL records can and do cross page boundaries.
> > > > But not that often, with full_page_writes = off. So we could get away
> > > > with just CRC checking the page-spanning ones and mark the records to
> > > > show whether they have been CRC checked or not and need to be rechecked
> > > > at recovery time. That would reduce the CRC overhead to about 1-5% of
> > > > what it is now (as an option).
> > > 
> > > WAL pages 8 Kbytes, and disk pages 512 bytes, correct? I don't see a
> > > guarantee in here that the 8 Kbytes worth of data will be written as
> > > sequential writes, nor that the 8 Kbytes of data will necessarily
> > > finish.
> > > 
> > > If the operating system uses 8 Kbyte pages, or the RAID system uses 8
> > > Kbytes or larger chunks, and they guarantee sequential writes, perhaps
> > > it is ok. Still, if the power goes out after writing the first 512
> > > bytes, 2048 bytes, or 4096 bytes, then what? With RAID involved it
> > > might get better or worse, depending on the RAID configuration.
> > 
> > That is the torn-page problem. If your system doesn't already protect
> > you against this you have no business turning off full_page_writes,
> > which was one of my starting assumptions.
> 
> I wasn't aware that a system could protect against this. :-)
> 
> I write 8 Kbytes - how can I guarantee that the underlying disk writes
> all 8 Kbytes before it loses power? And why isn't the CRC a valid means
> of dealing with this? :-)
> 
> I'm on wrong on one of these assumptions, I'm open to being educated.
> My opinion as of a few seconds ago, is that a write to a single disk
> sector is safe, but that a write that extends across several sectors
> is not.

http://developer.postgresql.org/pgdocs/postgres/runtime-config-wal.html

full_page_writes = off

I'm very happy to learn more myself...

--  Simon Riggs              EnterpriseDB   http://www.enterprisedb.com

Re: New CRC algorithm: Slicing by 8

From

"Simon Riggs"

Date:

24 October 2006, 14:51:23

On Tue, 2006-10-24 at 14:52 +0100, Simon Riggs wrote:
> On Tue, 2006-10-24 at 09:37 -0400, Tom Lane wrote:
> > "Simon Riggs" <simon@2ndquadrant.com> writes:
> > > On Mon, 2006-10-23 at 15:12 -0400, Tom Lane wrote:
> > >> There are actually three checks used to detect end of WAL: zero record
> > >> length, invalid checksum, and incorrect back-pointer.  Zero length is
> > >> the first and cleanest-looking test, but AFAICS we have to have both of
> > >> the others to avoid obvious failure modes.
> > 
> > > The checksum protects from torn pages and disk errors. If you have
> > > full_page_writes set then you already believe yourself safe from torn
> > > pages and your disks could also already be CRC-checking the data.
> > 
> > No, because unlike tuples, WAL records can and do cross page boundaries.
> 
> But not that often, with full_page_writes = off. So we could get away
> with just CRC checking the page-spanning ones and mark the records to
> show whether they have been CRC checked or not and need to be rechecked
> at recovery time. That would reduce the CRC overhead to about 1-5% of
> what it is now (as an option).

Looking further, I see that the xlog page header already contains
xlp_pageaddr which is a XLogRecPtr. So an xlrec that tried to span
multiple pages yet failed in between would easily show up as a failure
in ValidXLOGHeader(), even before the CRC check. [The XLogRecPtr
contains both the offset within the file and a unique identification of
the WAL file, so any data left over from previous uses of that data file
will be easily recognised as such].

So we don't even need to CRC check the page-spanning ones either.

So it all comes down to: do you trust your hardware? I accept that some
hardware/OS combinations will give you high risk, others will reduce it
considerably. All I'm looking to do is to pass on the savings for people
that are confident they have invested wisely in hardware.

Does anybody have a reasonable objection to introducing an option for
people that are comfortable they are making the correct choice?
wal_checksum = off (or other suggested naming...)

I don't want to take blind risks, so shoot me down, please, if I err.
I'm happy either way: either we speed up, or we're safer not to.

--  Simon Riggs              EnterpriseDB   http://www.enterprisedb.com

Re: New CRC algorithm: Slicing by 8

From

Alvaro Herrera

Date:

24 October 2006, 14:53:38

mark@mark.mielke.cc wrote:

> I'm on wrong on one of these assumptions, I'm open to being educated.
> My opinion as of a few seconds ago, is that a write to a single disk
> sector is safe, but that a write that extends across several sectors
> is not.

Unless it's fsync'ed, which is what we do at CHECKPOINT.  Keep in mind
that we save full page images on WAL the first time we touch the page
after a checkpoint.  This means that if a partial write occured, we will
restore it from WAL.

So it's not safe in general, but it is safe in Postgres.

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

Re: New CRC algorithm: Slicing by 8

From

"Gregory Maxwell"

Date:

24 October 2006, 14:58:18

On 10/24/06, mark@mark.mielke.cc <mark@mark.mielke.cc> wrote:
> I wasn't aware that a system could protect against this. :-)
>
> I write 8 Kbytes - how can I guarantee that the underlying disk writes
> all 8 Kbytes before it loses power? And why isn't the CRC a valid means
> of dealing with this? :-)

[snip]

A file system with an apropreiate transaction method could do this..
In *theory* reiser4 write()s are atomic. No one has verified, however,
that there is no torn page risk introduced in some other part of the
kernel.

I'm not aware of any other system which can guaranteed the atomicity
of 8k writes.

Re: New CRC algorithm: Slicing by 8

From

Tom Lane

Date:

24 October 2006, 15:07:50

"Gregory Maxwell" <gmaxwell@gmail.com> writes:
> I'm not aware of any other system which can guaranteed the atomicity
> of 8k writes.

The reasoning for supporting full_page_writes = off is that if you have
a stable kernel and suitable backup power, battery backed write cache,
etc, your risk of a partially completed write() may be low enough to
be acceptable.  Obviously there are no 100.00% guarantees, but that's
what you keep backups for ...

Simon is essentially arguing that if we are willing to assume no
incomplete write() we may as well assume it for WAL too.  This seems
to me to be raising the risk significantly, but I admit that I can't
put my finger on why exactly.

One point not directly related to crash safety is whether CRC checking
isn't still going to be a good idea when PITR is enabled.  Archived WAL
files are going to have been through a great deal more transferring than
ones merely being used to recover from a crash.
        regards, tom lane

Re: New CRC algorithm: Slicing by 8

From

"Gurjeet Singh"

Date:

24 October 2006, 15:10:31

Sorry for getting into the conversation so late... It was a long weekend in India.

On 10/23/06, Tom Lane < tgl@sss.pgh.pa.us > wrote:

I didn't particularly trust the timing calculations in your benchmark
program,

    Any particular reason? (why and what did you doubt in it?).

    I designed the prog. to be flexible to test different sized blocks (to cost single/less INIT/COMP/FIN iterations), and different size lists of data (to control the number of iterations). Please share you wisdom.

    When I first saw your results, I had a strong feeling that function-call overhead was going against SB8. And then, Jeremy's trials, and subsequent success, on disabling loop optimizations also pointed to this possibility.

    So, I have taken your tests and converted the SB8 function calls into macros. And the results are (please note that crc = 0 is explained later):

std_8192_noprintcrc.out
crc = 0, bufsize = 8192, loops = 1000000, elapsed = 8.471994

sb8_8192_noprintcrc.out
crc = 0, bufsize = 8192, loops = 1000000, elapsed = 0.000006

std_8192_printcrc.out
crc = 8228BB0E, bufsize = 8192, loops = 1000000, elapsed = 32.490704

sb8_8192_printcrc.out
crc = 7E67A22A, bufsize = 8192, loops = 1000000, elapsed = 22.349156

std_64_noprintcrc.out
crc = 0, bufsize = 64, loops = 1000000, elapsed = 0.151354

sb8_64_noprintcrc.out
crc = 0, bufsize = 64, loops = 1000000, elapsed = 0.000005

std_64_printcrc.out
crc = 9C9FBE2E, bufsize = 64, loops = 1000000, elapsed = 0.559315

sb8_64_printcrc.out
crc = F70BC6AE, bufsize = 64, loops = 1000000, elapsed = 0.357382

    The result names are in the format: <algo_type>_<test_size>_<was_mycrc_referenced_in_printf>.out

    crc = 0 in the result means that the mycrc variable was not refereced anywhere after the for-loop. As can be seen, if mycrc is not refrenced in the printf, that is, it's usage is limited to just inside the 'for' loop, then GCC ( 4.1) seems to be optimizing the loop heavily. In the case of SB8, if mycrc is not referenced later, it seems to have totally removed the loop!!!

    The only difference between the <x>_noprintcrc and the <x>_printcrc tests was that in the printf() call, the first parameter after the format string was either a zero or mycrc variable, respectively.

    I am highly apprehensive that I might have made some mistake while converting function calls to macros; though, I have not besen able to prove it thus far. Please check it's validity as compared to the function-call version.

    If there's no mistake, then I think SB8 is back in the performance game now. These results were obtained with gcc 4.1 on FC5 running on Intel Pentium M 1.86 GHz, and OS starteted and running in runlevel 3.

    Please dump the .c and .h files from the attachment on top of Tom's package, and test it as earlier.

Best regards,

--
gurjeet[.singh]@EnterpriseDB.com
singh.gurjeet@{ gmail | hotmail | yahoo }.com

Attachment

my-crctest.tar.gz

Re: New CRC algorithm: Slicing by 8

From

Gregory Stark

Date:

24 October 2006, 16:43:29

Tom Lane <tgl@sss.pgh.pa.us> writes:

> Simon is essentially arguing that if we are willing to assume no
> incomplete write() we may as well assume it for WAL too.  This seems
> to me to be raising the risk significantly, but I admit that I can't
> put my finger on why exactly.

Actually I think we can deal with torn pages in the WAL more easily than in
database files anyways. In database files we need to get the entire page
correctly one way or the other so we need full_page_writes in order to be deal
properly. 

In the WAL we just need to be able to detect torn pages and stop reading WAL
at that point. That's easier and doesn't really need a CRC. We could just
adopt the Sybase strategy of storing a unique id number every 512 bytes
throughout the WAL page. If those numbers don't match then we have a torn
page; the system crashed at that point and we should stop reading WAL pages.

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com

Re: New CRC algorithm: Slicing by 8

From

"Simon Riggs"

Date:

24 October 2006, 16:55:10

On Tue, 2006-10-24 at 14:07 -0400, Tom Lane wrote:
> "Gregory Maxwell" <gmaxwell@gmail.com> writes:
> > I'm not aware of any other system which can guaranteed the atomicity
> > of 8k writes.
> 
> The reasoning for supporting full_page_writes = off is that if you have
> a stable kernel and suitable backup power, battery backed write cache,
> etc, your risk of a partially completed write() may be low enough to
> be acceptable.  Obviously there are no 100.00% guarantees, but that's
> what you keep backups for ...
> 
> Simon is essentially arguing that if we are willing to assume no
> incomplete write() we may as well assume it for WAL too.  This seems
> to me to be raising the risk significantly, but I admit that I can't
> put my finger on why exactly.

I agree about the significant additional risk, hence the additional
parameter.

I'll do some internal testing to see what the risk-reward is. If that
seems worthwhile, then I'll post the patch for general testing/comment. 

(Incidentally, having GUCs that depend on other GUCs is bad news since
they are set alphabetically. I'd want to only allow wal_checksum=off iff
full_page_writes=off, which will work, but only because W comes after F
and for no other reason. Generic solution for dependent GUCs would be
great...)

> One point not directly related to crash safety is whether CRC checking
> isn't still going to be a good idea when PITR is enabled.  Archived WAL
> files are going to have been through a great deal more transferring than
> ones merely being used to recover from a crash.

Agreed. Both disks and tapes/other mechanisms must be known CRC-safe
before this idea would be worth using in production. Many enterprises do
already think they have bomb-proof kit, so we may as well support them
in that belief. 

--  Simon Riggs              EnterpriseDB   http://www.enterprisedb.com

Re: New CRC algorithm: Slicing by 8

From

Tom Lane

Date:

25 October 2006, 01:30:48

"Gurjeet Singh" <singh.gurjeet@gmail.com> writes:
> On 10/23/06, Tom Lane <tgl@sss.pgh.pa.us > wrote:
>> I didn't particularly trust the timing calculations in your benchmark
>> program,

>     Any particular reason? (why and what did you doubt in it?).

Well, the specific thing that set off my bogometer was

#define TV_DIFF_MILLI(tv1, tv2) ((tv2.tv_sec*1000+((tv2.tv_usec)/1000))-(tv1.tv_sec*1000+((tv1.tv_usec)/1000)))

which is going to have overflow problems on any platform where tv_sec
isn't a 64-bit type (which is still all of 'em AFAIK).  But more
generally, your test is timing a CRC across 100 4Kb segments, which
isn't representative of PG's usage of CRCs.  I don't think there are
any XLogWrite calls that have more than about 5 segments, and in most
cases those segments contain a few dozen bytes not a few K.  So you
need to be looking at much shorter loop runs.

The test case I proposed uses timing code that I trusted (borrowed from
long-established coding in postgres.c), and tests loop lengths that are
somewhat representative for PG, but it is still biased in favor of slice8
because it repeats the CRC calculations consecutively without any other
activity --- presumably this fact creates a bias for a method that needs
more L2 cache space over one that doesn't need so much.  I'd have tried
harder to make an unbiased test case if this version had showed slice8 as
competitive, but so far it seems that on a majority of current CPUs and
compilers it's not competitive.
        regards, tom lane

Re: New CRC algorithm: Slicing by 8

From

"Simon Riggs"

Date:

26 October 2006, 06:31:33

On Tue, 2006-10-24 at 15:42 -0400, Gregory Stark wrote:
> Tom Lane <tgl@sss.pgh.pa.us> writes:
> 
> > Simon is essentially arguing that if we are willing to assume no
> > incomplete write() we may as well assume it for WAL too.  This seems
> > to me to be raising the risk significantly, but I admit that I can't
> > put my finger on why exactly.
> 
> Actually I think we can deal with torn pages in the WAL more easily than in
> database files anyways. In database files we need to get the entire page
> correctly one way or the other so we need full_page_writes in order to be deal
> properly. 
> 
> In the WAL we just need to be able to detect torn pages and stop reading WAL
> at that point. That's easier and doesn't really need a CRC. We could just
> adopt the Sybase strategy of storing a unique id number every 512 bytes
> throughout the WAL page. If those numbers don't match then we have a torn
> page; the system crashed at that point and we should stop reading WAL pages.

Right, but I'm only suggesting using this if your safe from torn pages
anyhow. I'm not suggesting anyone does wal_checksum = off when
full_page_writes = on.

I do think you're right: we're much less exposed by a torn page in WAL
than we are for the stable database.

I've looked into this in more depth following your suggestion: I think
it seems straightforward to move the xl_prev field from being a header
to a trailer. That way when we do the test on the back pointer we will
be assured that there is no torn page effecting the remainder of the
xlrec. That would make it safer with wal_checksum = off.

--  Simon Riggs              EnterpriseDB   http://www.enterprisedb.com

Re: New CRC algorithm: Slicing by 8

From

Gregory Stark

Date:

26 October 2006, 10:48:55

"Simon Riggs" <simon@2ndquadrant.com> writes:

> I've looked into this in more depth following your suggestion: I think
> it seems straightforward to move the xl_prev field from being a header
> to a trailer. That way when we do the test on the back pointer we will
> be assured that there is no torn page effecting the remainder of the
> xlrec. That would make it safer with wal_checksum = off.

Hm. I think in practice this may actually help reduce the exposure to torn
pages. However in theory there's no particular reason to think the blocks will
be written out in physical order.

The kernel may sync its buffers in some order dictated by its in-memory data
structure and may end up coming across the second half of the 8kb page before
the first half. It may even lie earlier on disk than the first half if the
filesystem started a new extent at that point.

If they were 4kb pages there would be fewer ways it could be written out of
order, but even then the hard drive could find a bad block and remap it. I'm
not sure what level of granularity drives remap at, it may be less than 4kb.

To eliminate the need for the CRC in the WAL for everyone and still be safe
from torn pages I think you have to have something like xl_prev repeated every
512b throughout the page.

But if this is only an option for systems that don't expect to suffer from
torn pages then sure, putting it in a footer seems like a good way to reduce
the exposure somewhat. Putting it in both a header *and* a footer might be
even better.

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com

Re: New CRC algorithm: Slicing by 8

From

Kenneth Marshall

Date:

26 October 2006, 13:57:32

On Mon, Oct 23, 2006 at 05:23:27PM -0400, Tom Lane wrote:
> Mark Kirkwood <markir@paradise.net.nz> writes:
> > Right - I think the regression is caused by libc and kernel being built 
> > with gcc 3.4.6 and the test program being built with gcc 4.1.2. 
> 
> Why do you think that?  The performance of the CRC loop shouldn't depend
> at all on either libc or the kernel, because they're not invoked inside
> the loop.
> 

I can believe that not re-building GCC 4.1.x with the 4.1.x compiler
could result in it not taking full advantage of new features and functions.

Ken

Re: New CRC algorithm: Slicing by 8

From

"Zeugswetter Andreas ADI SD"

Date:

27 October 2006, 05:54:24

>> In the WAL we just need to be able to detect torn pages and stop
>> reading WAL at that point. That's easier and doesn't really need a
>> CRC. We could just adopt the Sybase strategy of storing a unique id
>> number every 512 bytes throughout the WAL page. If those numbers
don't
>> match then we have a torn page; the system crashed at that point and
we should stop reading WAL pages.

> I've looked into this in more depth following your
> suggestion: I think it seems straightforward to move the
> xl_prev field from being a header to a trailer. That way when
> we do the test on the back pointer we will be assured that
> there is no torn page effecting the remainder of the xlrec.
> That would make it safer with wal_checksum = off.

I do not think we can assume any order of when a block is written to
disk.

I think all this can only be used on OS and hardware, that can guarantee
that what is written by one IO call (typically 8k) from pg is safe.
Those combinations do exist, so I think we want the switch.
Putting xl_prev to the end helps only for direct IO WAL sync methods,
else we would need it on every page.

Andreas

Re: New CRC algorithm: Slicing by 8

From

"Simon Riggs"

Date:

27 October 2006, 06:11:55

On Fri, 2006-10-27 at 10:54 +0200, Zeugswetter Andreas ADI SD wrote:
> >> In the WAL we just need to be able to detect torn pages and stop 
> >> reading WAL at that point. That's easier and doesn't really need a 
> >> CRC. We could just adopt the Sybase strategy of storing a unique id 
> >> number every 512 bytes throughout the WAL page. If those numbers
> don't 
> >> match then we have a torn page; the system crashed at that point and
> we should stop reading WAL pages.
> 
> > I've looked into this in more depth following your 
> > suggestion: I think it seems straightforward to move the 
> > xl_prev field from being a header to a trailer. That way when 
> > we do the test on the back pointer we will be assured that 
> > there is no torn page effecting the remainder of the xlrec. 
> > That would make it safer with wal_checksum = off.
> 
> I do not think we can assume any order of when a block is written to
> disk.
> 
> I think all this can only be used on OS and hardware, that can guarantee
> that what is written by one IO call (typically 8k) from pg is safe.
> Those combinations do exist, so I think we want the switch.

OK, good.
... but from the various comments I'll not bother moving xl_prev to be a
trailer; I'll just leave that where it is now.

> Putting xl_prev to the end helps only for direct IO WAL sync methods,
> else we would need it on every page.

[There is already an XLogRecPtr on each 8k page.]

--  Simon Riggs              EnterpriseDB   http://www.enterprisedb.com

Re: New CRC algorithm: Slicing by 8

From

Martijn van Oosterhout

Date:

27 October 2006, 06:18:17

On Fri, Oct 27, 2006 at 10:11:08AM +0100, Simon Riggs wrote:
> > Putting xl_prev to the end helps only for direct IO WAL sync methods,
> > else we would need it on every page.
>
> [There is already an XLogRecPtr on each 8k page.]

Given that hardware sector size is still 512 bytes, should there be a
way of detecting a missing 512 byte block in the middle of an 8K block.
The idea of simply writing a serial counter every 512 bytes seems to
be a good way to handle that...

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> From each according to his ability. To each according to his ability to litigate.

Re: New CRC algorithm: Slicing by 8

From

"Zeugswetter Andreas ADI SD"

Date:

27 October 2006, 06:31:08

> > > Putting xl_prev to the end helps only for direct IO WAL sync
> > > methods, else we would need it on every page.
> >
> > [There is already an XLogRecPtr on each 8k page.]
>
> Given that hardware sector size is still 512 bytes, should
> there be a way of detecting a missing 512 byte block in the
> middle of an 8K block.
> The idea of simply writing a serial counter every 512 bytes
> seems to be a good way to handle that...

No, we have CRC for that. You are not supposed to turn it off
when you see a chance, that an 8k block is not whole.

Andreas