Thread: Cost of XLogInsert CRC calculations

Cost of XLogInsert CRC calculations

From

Tom Lane

Date:

06 March 2005, 05:18:09

I was profiling a case involving UPDATEs into a table with too many
indexes (brought to mind by mysql's sql-bench, about which more later)
and got this rather surprising result for routines costing more than
1% of the total runtime:

Each sample counts as 0.01 seconds. %   cumulative   self              self     total           time   seconds
seconds   calls   s/call   s/call  name    64.03     86.20    86.20   133608     0.00     0.00  XLogInsert 3.50
90.91    4.71  2484787     0.00     0.00  _bt_compare 2.92     94.84     3.93   839893     0.00     0.00  hash_search
2.77    98.57     3.73  1875815     0.00     0.00  LWLockAcquire 1.89    101.12     2.55  1887972     0.00     0.00
LWLockRelease1.27    102.83     1.71   125234     0.00     0.00  _bt_getroot 1.01    104.19     1.36   403342     0.00
  0.00  PinBuffer 1.00    105.54     1.35   840002     0.00     0.00  hash_any
 

I suppose that the bulk of the CPU cycles being attributed to XLogInsert
are going into the inlined CRC calculations.  Maybe we need to think
twice about the cost/benefit ratio of using 64-bit CRCs to protect xlog
records that are often only a few dozen bytes.
        regards, tom lane

Re: Cost of XLogInsert CRC calculations

From

Heikki Linnakangas

Date:

06 March 2005, 08:25:27

On Sun, 6 Mar 2005, Tom Lane wrote:

> I suppose that the bulk of the CPU cycles being attributed to XLogInsert
> are going into the inlined CRC calculations.  Maybe we need to think
> twice about the cost/benefit ratio of using 64-bit CRCs to protect xlog
> records that are often only a few dozen bytes.

Isn't the CRC quite important on recovery to recognize where the last 
valid log record is?

Is there any better implementations of CRC-64? Would using a different 
polynomial help?

Would it help to do the CRC calculation in a more wholesale fashion in 
XLogWrite?

How about switching to CRC-32 or even CRC-16? I searched the archives for 
the reason CRC-64 was chosen in the first place. It seems that the 
difference in computation time was not considered to be significant, and 
there was 8 bytes available in the record header anyway.

Just some thoughts...

- Heikki

Re: Cost of XLogInsert CRC calculations

From

Simon Riggs

Date:

06 March 2005, 10:11:53

On Sun, 2005-03-06 at 00:17 -0500, Tom Lane wrote:
> I was profiling a case involving UPDATEs into a table with too many
> indexes (brought to mind by mysql's sql-bench, about which more later)
> and got this rather surprising result for routines costing more than
> 1% of the total runtime:
> 
> Each sample counts as 0.01 seconds.
>   %   cumulative   self              self     total           
>  time   seconds   seconds    calls   s/call   s/call  name    
>  64.03     86.20    86.20   133608     0.00     0.00  XLogInsert
>   3.50     90.91     4.71  2484787     0.00     0.00  _bt_compare
>   2.92     94.84     3.93   839893     0.00     0.00  hash_search
>   2.77     98.57     3.73  1875815     0.00     0.00  LWLockAcquire
>   1.89    101.12     2.55  1887972     0.00     0.00  LWLockRelease
>   1.27    102.83     1.71   125234     0.00     0.00  _bt_getroot
>   1.01    104.19     1.36   403342     0.00     0.00  PinBuffer
>   1.00    105.54     1.35   840002     0.00     0.00  hash_any
> 
> I suppose that the bulk of the CPU cycles being attributed to XLogInsert
> are going into the inlined CRC calculations.  Maybe we need to think
> twice about the cost/benefit ratio of using 64-bit CRCs to protect xlog
> records that are often only a few dozen bytes.

Yes, in recent performance tests sponsored by Unisys, this result was
also very clear. In those tests we used Intel VTune to identify the
precise lines of code soaking up the cycles...it was the CRC checks.

More results should be available from the Unisys testing within a few
days.

I had assumed that the majority of the cost of CRC checking was as a
result of the need to log complete blocks, rather than the rather small
xlog records themselves?

Best Regards, Simon Riggs

Re: Cost of XLogInsert CRC calculations

From

"Mark Cave-Ayland"

Date:

07 March 2005, 11:04:30

Hi Tom,

> I was profiling a case involving UPDATEs into a table with too many
indexes (brought to > mind by mysql's sql-bench, about which more later) and
got this rather surprising result > for routines costing more than 1% of the
total runtime:

(cut)

> I suppose that the bulk of the CPU cycles being attributed to XLogInsert
are going into > the inlined CRC calculations.  Maybe we need to think twice
about the cost/benefit ratio > of using 64-bit CRCs to protect xlog records
that are often only a few dozen bytes.

Wow, a 64-bit CRC does seem excessive, especially when going back to Zmodem
days where a 50-100k file seemed to be easily protected by a 32-bit CRC. I'm
sure there are some error rates somewhere dependent upon the polynomial and
the types of error detected.... Try the following link towards the bottom:
http://www.ee.unb.ca/tervo/ee4253/crc.htm for some theory on detection rates
vs. CRC size.


Kind regards,

Mark.

------------------------
WebBased Ltd
South West Technology Centre
Tamar Science Park
Plymouth
PL6 8BT

T: +44 (0)1752 791021
F: +44 (0)1752 791023
W: http://www.webbased.co.uk

Re: Cost of XLogInsert CRC calculations

From

Tom Lane

Date:

07 March 2005, 14:40:10

"Mark Cave-Ayland" <m.cave-ayland@webbased.co.uk> writes:
> Wow, a 64-bit CRC does seem excessive, especially when going back to Zmodem
> days where a 50-100k file seemed to be easily protected by a 32-bit CRC. I'm
> sure there are some error rates somewhere dependent upon the polynomial and
> the types of error detected.... Try the following link towards the bottom:
> http://www.ee.unb.ca/tervo/ee4253/crc.htm for some theory on detection rates
> vs. CRC size.

When the CRC size was decided, I recall someone arguing that it would
really make a difference to have 1-in-2^64 chance of failure rather than
1-in-2^32.  I was dubious about this at the time, but didn't have any
evidence showing that we shouldn't go for 64.  I suppose we ought to try
the same example with a 32-bit CRC and see how much it helps.
        regards, tom lane

Re: Cost of XLogInsert CRC calculations

From

Gaetano Mendola

Date:

07 March 2005, 17:46:34

Tom Lane wrote:
> "Mark Cave-Ayland" <m.cave-ayland@webbased.co.uk> writes:
> 
>>Wow, a 64-bit CRC does seem excessive, especially when going back to Zmodem
>>days where a 50-100k file seemed to be easily protected by a 32-bit CRC. I'm
>>sure there are some error rates somewhere dependent upon the polynomial and
>>the types of error detected.... Try the following link towards the bottom:
>>http://www.ee.unb.ca/tervo/ee4253/crc.htm for some theory on detection rates
>>vs. CRC size.
> 
> 
> When the CRC size was decided, I recall someone arguing that it would
> really make a difference to have 1-in-2^64 chance of failure rather than
> 1-in-2^32.  I was dubious about this at the time, but didn't have any
> evidence showing that we shouldn't go for 64.  I suppose we ought to try
> the same example with a 32-bit CRC and see how much it helps.

Continuing this why not a 16-bit then ?


Regards
Gaetano Mendola

Re: Cost of XLogInsert CRC calculations

From

Simon Riggs

Date:

07 March 2005, 23:54:19

On Mon, 2005-03-07 at 09:39 -0500, Tom Lane wrote:
> "Mark Cave-Ayland" <m.cave-ayland@webbased.co.uk> writes:
> > Wow, a 64-bit CRC does seem excessive, especially when going back to Zmodem
> > days where a 50-100k file seemed to be easily protected by a 32-bit CRC. I'm
> > sure there are some error rates somewhere dependent upon the polynomial and
> > the types of error detected.... Try the following link towards the bottom:
> > http://www.ee.unb.ca/tervo/ee4253/crc.htm for some theory on detection rates
> > vs. CRC size.
> 
> When the CRC size was decided, I recall someone arguing that it would
> really make a difference to have 1-in-2^64 chance of failure rather than
> 1-in-2^32.  I was dubious about this at the time, but didn't have any
> evidence showing that we shouldn't go for 64.  I suppose we ought to try
> the same example with a 32-bit CRC and see how much it helps.

I think some of the additional run-time may be coming from processor
stalls associated with some of the constants used in the CRC checks.
I'll come back with more info on that later.

Well, we're using the CRC in 3 separate places...
(1) for xlog records
(2) for complete blocks copied to xlog
(3) for control files

For (1), records are so short that probably CRC16 would be sufficient
without increasing the error rate noticeably.

I think I'd like to keep (3) at CRC64...its just too important. Plus
thats slightly less code to change.

My money is on (2) being the source of most of that run-time anyway,
since when we enclose a whole block it takes a lot longer to CRC64 all
BLCKSZ bytes than it would do to CRC a single record in (1). But of
course, longer stretches of data need better error detection rates.

If Ethernet is using CRC32, it seems somewhat strange to use anything
higher than that, seeing as we're very likely to be sending xlog files
across the net anyway. Packet size is mostly comparable to BLCKSZ isn't
it?

So, yes CRC32 seems more reasonable. 

One of the things I was thinking about was whether we could use up those
cycles more effectively. If we were to include a compression routine
before we calculated the CRC that would 
- reduce the size of the blocks to be written, hence reduce size of xlog
- reduce the following CRC calculation

I was thinking about using a simple run-length encoding to massively
shrink half-empty blocks with lots of zero padding, but we've already
got code to LZW the data down also.

Best Regards, Simon Riggs

Re: Cost of XLogInsert CRC calculations

From

Tom Lane

Date:

08 March 2005, 01:50:08

Simon Riggs <simon@2ndquadrant.com> writes:
> Well, we're using the CRC in 3 separate places...
> (1) for xlog records
> (2) for complete blocks copied to xlog
> (3) for control files

> For (1), records are so short that probably CRC16 would be sufficient
> without increasing the error rate noticeably.

> I think I'd like to keep (3) at CRC64...its just too important. Plus
> thats slightly less code to change.

The control files are so short that CRC16 would be plenty.

> My money is on (2) being the source of most of that run-time anyway,

Undoubtedly, so there's not going to be much win from micro-optimization
by having several different CRC functions.  I would go for CRC32 across
the board, myself.
        regards, tom lane

Re: Cost of XLogInsert CRC calculations

From

Greg Stark

Date:

08 March 2005, 02:19:17

Tom Lane <tgl@sss.pgh.pa.us> writes:

> Simon Riggs <simon@2ndquadrant.com> writes:
> 
> > For (1), records are so short that probably CRC16 would be sufficient
> > without increasing the error rate noticeably.
> 
> The control files are so short that CRC16 would be plenty.

It's not really the size of the data that governs the reasonable size of the
CRC. It's the probability of there being errors. For that you have to think
about what possible causes of errors you're concerned with.

AFAIK there's no CRC check at all in tables or indexes. So it doesn't seem
like bad hardware like a bad drive or RAM is what you're concerned with here.

From the other post it sounded like the threat was the possibility of an
interrupted arriving after the transaction commit log entry is written but
before the fsync has written the rest of the log.

In that case it seems the probability of it occurring is independent of the
size of the log entry. Is a 1 in 64k chance of a power failure resulting in
data corruption acceptable? A 1 in 4 billion chance?

You could eliminate the hole completely by using two fsyncs. fsync the
transaction information once. Then once that completes, then write and fsync
the transaction commit log entry.

-- 
greg

Re: Cost of XLogInsert CRC calculations

From

Simon Riggs

Date:

08 March 2005, 08:31:57

On Mon, 2005-03-07 at 20:50 -0500, Tom Lane wrote:
> Simon Riggs <simon@2ndquadrant.com> writes:
> > Well, we're using the CRC in 3 separate places...
> > (1) for xlog records
> > (2) for complete blocks copied to xlog
> > (3) for control files
> 
> > For (1), records are so short that probably CRC16 would be sufficient
> > without increasing the error rate noticeably.
> 
> > I think I'd like to keep (3) at CRC64...its just too important. Plus
> > thats slightly less code to change.
> 
> The control files are so short that CRC16 would be plenty.
> 
> > My money is on (2) being the source of most of that run-time anyway,
> 
> Undoubtedly, so there's not going to be much win from micro-optimization
> by having several different CRC functions.  

Agreed.

> I would go for CRC32 across
> the board, myself.

Sold.

Best Regards, Simon Riggs

Cost of XLogInsert CRC calculations

From

tzirechnoy@hotpop.com

Date:

09 March 2005, 08:13:37

On Mon, Mar 07, 2005 at 11:53:59PM +0000, Simon Riggs wrote:
> On Mon, 2005-03-07 at 09:39 -0500, Tom Lane wrote:
> > "Mark Cave-Ayland" <m.cave-ayland@webbased.co.uk> writes:

[skipped]

> Well, we're using the CRC in 3 separate places...
> (1) for xlog records
> (2) for complete blocks copied to xlog
> (3) for control files
> 
> For (1), records are so short that probably CRC16 would be sufficient
> without increasing the error rate noticeably.
> 
> I think I'd like to keep (3) at CRC64...its just too important. Plus
> thats slightly less code to change.
> 
> My money is on (2) being the source of most of that run-time anyway,
> since when we enclose a whole block it takes a lot longer to CRC64 all
> BLCKSZ bytes than it would do to CRC a single record in (1). But of
> course, longer stretches of data need better error detection rates.
Well, if there is no need for error recovery, than
what about using more simple algorithm, like checksum? Perhaps,
it even could be attached to one of required memcpy calls.

Re: Cost of XLogInsert CRC calculations

From

Hans-Jürgen Schönig

Date:

11 March 2005, 18:33:23

> One of the things I was thinking about was whether we could use up those
> cycles more effectively. If we were to include a compression routine
> before we calculated the CRC that would 
> - reduce the size of the blocks to be written, hence reduce size of xlog
> - reduce the following CRC calculation
> 
> I was thinking about using a simple run-length encoding to massively
> shrink half-empty blocks with lots of zero padding, but we've already
> got code to LZW the data down also.
> 
> Best Regards, Simon Riggs
> 
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org


Simon,

I think having a compression routine in there could make real sense.
We have done some major I/O testing involving compression for a large 
customer some time ago. We have seen that compressing / decompressing on 
the fly is in MOST cases much faster than uncompressed I/O (try a simple 
"cat file | ..." vs." zcat file.gz | ...") - the zcat version will be 
faster on all platforms we have tried (Linux, AIX, Sun on some SAN 
system, etc. ...).
Also, when building up a large database within one transaction the xlog 
will eat a lot of storage - this can be quite annoying when you have to 
deal with a lot of data).
Are there any technical reasons which would prevent somebody from 
implementing compression?
Best regards,
    Hans

-- 
Cybertec Geschwinde u Schoenig
Schoengrabern 134, A-2020 Hollabrunn, Austria
Tel: +43/660/816 40 77
www.cybertec.at, www.postgresql.at

Re: Cost of XLogInsert CRC calculations

From

Simon Riggs

Date:

13 March 2005, 21:27:03

On Fri, 2005-03-11 at 19:31 +0100, Hans-Jürgen Schönig wrote:
> > One of the things I was thinking about was whether we could use up those
> > cycles more effectively. If we were to include a compression routine
> > before we calculated the CRC that would
> > - reduce the size of the blocks to be written, hence reduce size of xlog
> > - reduce the following CRC calculation
> >
> > I was thinking about using a simple run-length encoding to massively
> > shrink half-empty blocks with lots of zero padding, but we've already
> > got code to LZW the data down also.

> I think having a compression routine in there could make real sense.
> We have done some major I/O testing involving compression for a large
> customer some time ago. We have seen that compressing / decompressing on
> the fly is in MOST cases much faster than uncompressed I/O (try a simple
> "cat file | ..." vs." zcat file.gz | ...") - the zcat version will be
> faster on all platforms we have tried (Linux, AIX, Sun on some SAN
> system, etc. ...).
> Also, when building up a large database within one transaction the xlog
> will eat a lot of storage - this can be quite annoying when you have to
> deal with a lot of data).

Agreed.

> Are there any technical reasons which would prevent somebody from
> implementing compression?

Not currently, that I'm aware of. But the way additional blocks are
added to xlog records would need to be changed to allow for variable
length output.

It's on my list...

Best Regards, Simon Riggs

Re: Cost of XLogInsert CRC calculations

From

"Mark Cave-Ayland"

Date:

10 May 2005, 11:14:04

> -----Original Message-----
> From: Mark Cave-Ayland [mailto:m.cave-ayland@webbased.co.uk]
> Sent: 07 March 2005 11:04
> To: 'tgl@sss.pgh.pa.us'
> Cc: 'pgsql-hackers@postgreSQL.org'
> Subject: Re: Cost of XLogInsert CRC calculations

(cut)

> > I suppose that the bulk of the CPU cycles being attributed to
> > XLogInsert are going into > the inlined CRC calculations.  Maybe we
> > need to think twice about the cost/benefit ratio > of using 64-bit
> > CRCs to protect xlog records that are often only a few dozen bytes.
>
> Wow, a 64-bit CRC does seem excessive, especially when going
> back to Zmodem days where a 50-100k file seemed to be easily
> protected by a 32-bit CRC. I'm sure there are some error
> rates somewhere dependent upon the polynomial and the types
> of error detected.... Try the following link towards the
> bottom: http://www.ee.unb.ca/tervo/ee4253/crc.htm for some
> theory on detection rates vs. CRC size.


Hi Tom/Simon,

I was just researching some articles on compression (zlib) and I saw mention
of the Adler-32 algorithm which is supposed to be slightly less accurate
than an equivalent CRC calculation but significantly faster to compute. I
haven't located a good paper comparing the error rates of the two different
checksums, however extending the Adler-32 algorithm up to 64 bits may be a
way of increasing the speed at the expense of a slight loss in the accuracy
of error detection if we decided that we want to stay at 64 bits as opposed
to 32.

The following seems to indicate that Adler-32 is at least twice as fast as
optimised CRC32: http://www.winimage.com/misc/readfile_test.htm.


Kind regards,

Mark.

------------------------
WebBased Ltd
17 Research Way
Plymouth
PL6 8BT

T: +44 (0)1752 791021
F: +44 (0)1752 791023
W: http://www.webbased.co.uk

Re: Cost of XLogInsert CRC calculations

From

Tom Lane

Date:

10 May 2005, 11:30:59

"Mark Cave-Ayland" <m.cave-ayland@webbased.co.uk> writes:
> I was just researching some articles on compression (zlib) and I saw mention
> of the Adler-32 algorithm which is supposed to be slightly less accurate
> than an equivalent CRC calculation but significantly faster to compute. I
> haven't located a good paper comparing the error rates of the two different
> checksums,

... probably because there isn't one.  With all due respect to the Zip
guys, I doubt anyone has done anywhere near the analysis on Adler-32
that has been done on CRCs.  I'd much prefer to stick with true CRC
and drop it to 32 bits than go with a less-tested algorithm.  Throwing
more bits at the problem doesn't necessarily create a safer checksum.
        regards, tom lane

Re: Cost of XLogInsert CRC calculations

From

Bruce Momjian

Date:

10 May 2005, 11:34:49

Tom Lane wrote:
> "Mark Cave-Ayland" <m.cave-ayland@webbased.co.uk> writes:
> > I was just researching some articles on compression (zlib) and I saw mention
> > of the Adler-32 algorithm which is supposed to be slightly less accurate
> > than an equivalent CRC calculation but significantly faster to compute. I
> > haven't located a good paper comparing the error rates of the two different
> > checksums,
> 
> ... probably because there isn't one.  With all due respect to the Zip
> guys, I doubt anyone has done anywhere near the analysis on Adler-32
> that has been done on CRCs.  I'd much prefer to stick with true CRC
> and drop it to 32 bits than go with a less-tested algorithm.  Throwing
> more bits at the problem doesn't necessarily create a safer checksum.

Agreed.  64-bit was overkill when we added it, and it is now shown to be
a performance problem.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073

Re: Cost of XLogInsert CRC calculations

From

Simon Riggs

Date:

10 May 2005, 18:49:59

On Tue, 2005-05-10 at 10:34 -0400, Bruce Momjian wrote:
> Tom Lane wrote:
> > "Mark Cave-Ayland" <m.cave-ayland@webbased.co.uk> writes:
> > > I was just researching some articles on compression (zlib) and I saw mention
> > > of the Adler-32 algorithm which is supposed to be slightly less accurate
> > > than an equivalent CRC calculation but significantly faster to compute. I
> > > haven't located a good paper comparing the error rates of the two different
> > > checksums,
> > 
> > ... probably because there isn't one.  With all due respect to the Zip
> > guys, I doubt anyone has done anywhere near the analysis on Adler-32
> > that has been done on CRCs.  I'd much prefer to stick with true CRC
> > and drop it to 32 bits than go with a less-tested algorithm.  Throwing
> > more bits at the problem doesn't necessarily create a safer checksum.
> 
> Agreed.  64-bit was overkill when we added it, and it is now shown to be
> a performance problem.

Hold on... Tom has shown that there is a performance problem with the
existing CRC calculation. Thanks to Mark for staying on top of that with
some good ideas. 

The cause of the performance problem has been attributed to it being a
64-bit rather than 32-bit calculation. That is certainly part of it, but
I have seen evidence that there is an Intel processor stall associated
with the use of a single byte constant somewhere in the algorithm. So
I'm unclear as to what extent the poor performance is attributable to
either issue.

That's where my experience stops so I have highlighted that for somebody
with more hardware specific assembler experience to have a look at the
algorithm. Fixing that, if possible, could greatly improve the
performance whether or not we drop from 64 to 32 bits. My hope for
outside assistance on that looks like it is not now forthcoming.

My guess would be that the algorithm's use of the byte-by-byte
calculation together with a bitmask of &FF is responsible. Perhaps
varying the length of the bitmask to either &000000FF or longer may
help? (see src/include/xlog_internal.h)

Best Regards, Simon Riggs

Re: Cost of XLogInsert CRC calculations

From

Tom Lane

Date:

10 May 2005, 19:22:37

Simon Riggs <simon@2ndquadrant.com> writes:
> The cause of the performance problem has been attributed to it being a
> 64-bit rather than 32-bit calculation. That is certainly part of it, but
> I have seen evidence that there is an Intel processor stall associated
> with the use of a single byte constant somewhere in the algorithm.

That's awfully vague --- can't you give any more detail?

I have seen XLogInsert eating significant amounts of time (up to 10% of
total CPU time) on non-Intel architectures, so I think that dropping
down to 32 bits is warranted in any case.  But if you are correct then
that might not fix the problem on Intel machines.  We need more info.
        regards, tom lane

Re: Cost of XLogInsert CRC calculations

From

Simon Riggs

Date:

10 May 2005, 20:01:51

On Tue, 2005-05-10 at 18:22 -0400, Tom Lane wrote:
> Simon Riggs <simon@2ndquadrant.com> writes:
> > The cause of the performance problem has been attributed to it being a
> > 64-bit rather than 32-bit calculation. That is certainly part of it, but
> > I have seen evidence that there is an Intel processor stall associated
> > with the use of a single byte constant somewhere in the algorithm.
> 
> That's awfully vague --- can't you give any more detail?
> 
> I have seen XLogInsert eating significant amounts of time (up to 10% of
> total CPU time) on non-Intel architectures, so I think that dropping
> down to 32 bits is warranted in any case.  But if you are correct then
> that might not fix the problem on Intel machines.  We need more info.

I have seen an Intel VTune report that shows a memory stall causing high
latency associated with a single assembly instruction that in the
compiled code of the CRC calculation. The instruction was manipulating a
single byte only. I couldn't tell exactly which line of PostgreSQL code
produced the assembler. This could be either a partial register stall or
a memory order buffer stall (or another?)

Here's a discussion of this
http://www.gamasutra.com/features/19991221/barad_pfv.htm

Sorry, but thats all I know. I will try to obtain the report, which is
not in my possession.

I do *not* know with any certainty what the proportion of time lost from
the CRC calc proper in an idealised CPU against the time lost from this
hardware specific interaction. I don't know if non-Intel CPUs are
effected either.

Best Regards, Simon Riggs

Re: Cost of XLogInsert CRC calculations

From

"Mark Cave-Ayland"

Date:

11 May 2005, 09:40:36

> -----Original Message-----
> From: Tom Lane [mailto:tgl@sss.pgh.pa.us]
> Sent: 10 May 2005 23:22
> To: Simon Riggs
> Cc: Bruce Momjian; Mark Cave-Ayland (External);
> pgsql-hackers@postgresql.org
> Subject: Re: [HACKERS] Cost of XLogInsert CRC calculations

(cut)

> That's awfully vague --- can't you give any more detail?
>
> I have seen XLogInsert eating significant amounts of time (up
> to 10% of total CPU time) on non-Intel architectures, so I
> think that dropping down to 32 bits is warranted in any case.
>  But if you are correct then that might not fix the problem
> on Intel machines.  We need more info.
>
>             regards, tom lane


Hi Tom/Simon,

Just for the record, I found a better analysis of Adler-32 following some
links from Wikipedia. In summary, the problem with Adler-32 is that while it
is only slightly less sensitive than CRC-32, it requires roughly a 1k
"run-in" in order to attain full coverage of the bits (with respect to
sensitivity of the input). This compares to 4 bytes of "run-in" required for
CRC-32. So unless we can guarantee a minimum of 1k data per Xlog record then
Adler-32 won't be suitable. See the following two links for more
information:

http://en.wikipedia.org/wiki/Adler-32
http://www.ietf.org/rfc/rfc3309.txt

One other consideration would be that since CRC-32 calculations for Xlog
records occur so often, perhaps the CRC-32 routines could be written in
in-line assembler, falling back to C for unsupported processors. It would be
interesting to come up with some benchmarks to see if indeed this would be
faster than the current C implementation, since as the routine is called so
often it could add up to a significant saving under higher loads.


Kind regards,

Mark.

------------------------
WebBased Ltd
17 Research Way
Plymouth
PL6 8BT

T: +44 (0)1752 791021
F: +44 (0)1752 791023
W: http://www.webbased.co.uk

Re: Cost of XLogInsert CRC calculations

From

Simon Riggs

Date:

11 May 2005, 16:36:08

On Wed, 2005-05-11 at 13:40 +0100, Mark Cave-Ayland wrote:
> So unless we can guarantee a minimum of 1k data per Xlog record then
> Adler-32 won't be suitable. 

Most records are either 8KB or much less than 1KB. Is the benefit gained
from the 8KB records worth the loss on the more frequent smaller
records?

> perhaps the CRC-32 routines could be written in in-line assembler

If you can do this, step right up. :-)

Best Regards, Simon Riggs

Re: Cost of XLogInsert CRC calculations

From

Christopher Kings-Lynne

Date:

11 May 2005, 22:46:28

>>perhaps the CRC-32 routines could be written in in-line assembler
> 
> 
> If you can do this, step right up. :-)
> 
> Best Regards, Simon Riggs

Surely there's an open source code floating around somewhere?

Chris

Re: Cost of XLogInsert CRC calculations

From

"Mark Cave-Ayland"

Date:

12 May 2005, 09:49:27

> -----Original Message-----
> From: Christopher Kings-Lynne [mailto:chriskl@familyhealth.com.au] 
> Sent: 12 May 2005 02:46
> To: Simon Riggs
> Cc: Mark Cave-Ayland (External); 'Tom Lane'; 'Bruce Momjian'; 
> pgsql-hackers@postgresql.org
> Subject: Re: [HACKERS] Cost of XLogInsert CRC calculations
> 
> 
> >>perhaps the CRC-32 routines could be written in in-line assembler
> > 
> > 
> > If you can do this, step right up. :-)
> > 
> > Best Regards, Simon Riggs
> 
> Surely there's an open source code floating around somewhere?
> 
> Chris


Hi Chris,

I've found some x86 MASM sources using Google, but nothing in GCC-style
in-line assembler. It would be good to see how this compares in terms of
speed with the code generated by GCC.


Kind regards,

Mark.

------------------------
WebBased Ltd
17 Research Way
Plymouth
PL6 8BT 

T: +44 (0)1752 791021
F: +44 (0)1752 791023
W: http://www.webbased.co.uk

Re: Cost of XLogInsert CRC calculations

From

Simon Riggs

Date:

12 May 2005, 12:56:34

On Thu, 2005-05-12 at 13:48 +0100, Mark Cave-Ayland wrote:
> > From: Christopher Kings-Lynne [mailto:chriskl@familyhealth.com.au] 
> > 
> > >>perhaps the CRC-32 routines could be written in in-line assembler
> > > 
> > > 
> > > If you can do this, step right up. :-)
> > > 
> > > Best Regards, Simon Riggs
> > 
> > Surely there's an open source code floating around somewhere?
> > 

> I've found some x86 MASM sources using Google, but nothing in GCC-style
> in-line assembler. It would be good to see how this compares in terms of
> speed with the code generated by GCC.

Is it BSD? I looked for some other BSD code last month. There was some,
but it was clearly slower than the existing code.

It might be possible to speed things up by a factor of two using a two-
byte lookup table rather than a one byte lookup. This would take up
65536 table entries, which is only 256KB. If we held that in shared
memory we could calculate the entries at startup, or read them in from a
file. It would only use the same space as if we had 255 connections, so
not a great increase in memory usage overall. Need to look at the effect
of retrieving everything from memory rather than from cache, so it
probably wouldn't be twice as fast.

Best Regards, Simon Riggs

Re: Cost of XLogInsert CRC calculations

From

"Mark Cave-Ayland"

Date:

16 May 2005, 05:10:37

> -----Original Message-----
> From: Simon Riggs [mailto:simon@2ndquadrant.com]
> Sent: 12 May 2005 16:52
> To: Mark Cave-Ayland (External)
> Cc: 'Christopher Kings-Lynne'; 'Tom Lane'; 'Bruce Momjian';
> pgsql-hackers@postgresql.org
> Subject: RE: [HACKERS] Cost of XLogInsert CRC calculations

(cut)
> It might be possible to speed things up by a factor of two using a
> two- byte lookup table rather than a one byte lookup. This would take
> up 65536 table entries, which is only 256KB. If we held that in shared
> memory we could calculate the entries at startup, or read them in from
> a file. It would only use the same space as if we had 255 connections,
> so not a great increase in memory usage overall. Need to look at the
> effect of retrieving everything from memory rather than from
> cache, so it probably wouldn't be twice as fast.
>
> Best Regards, Simon Riggs


Hi everyone,

As per this thread, I decided to spend a little time over the weekend
looking at the performance of the existing CRC64 code in order to determine
whether we could code the algorithm in a more efficient manner. In order to
do this, I devised a couple of test programs using code cut/pasted from
pg_crc.h (and pg_resetxlog) in order perform some timings.

These tests were done on two systems: a P4 1.8GHz laptop with MingW (Win32)
and gcc 3.2.3, and a Xeon 2.8GHz server running FC1 and gcc 3.3.2.

The program used to time the CRC64 calculation simply did the following:
1) Create an 8k buffer of data
2) Fill the buffer with some repeatable values; in this case thealgorithm used was the following:
    for (i = 0 ; i < 8192 ; i++ )    {        *data = (char )(i % 256);    }
3) Calculate the CRC64 of the buffer 10,000 times and display the
result,making a note of the start and end times. Using this information an
estimate of the cost of a single CRC calculation can esaily be found.

I noticed looking at the code that there were two versions of the CRC code,
one using 32-bit arguments wrapped within a check for INT64_IS_BUSTED and
another using 64-bit unsigned integers. Since I wasn't sure initially which
one was being used, I decided to test both of them :) Both programs were
compiled using gcc's -O2 flag for optimisation.

The first thing to notice about the results was that I obtained different
CRC values using both algorithms on Win32, which were both different to the
results obtained under Linux:
Win32 MingW:1) INT64_IS_BUSTED code (32-bit):    0x541d4a27 (crc1) 0x8dfa57ae
(crc0)2) 64-bit code:                0x817f722b1f17b253 (crc0)
Linux (FC1):1) INT64_IS_BUSTED code (32-bit):    0x21d064a2 (crc1) 0xe7c74332
(crc0)2) 64-bit code:                0x21d064a2e7c74332 (crc0)

The second interesting thing to notice was the comparative speeds:
Win32 MingW:1) INT64_IS_BUSTED code (32-bit):    ~57us per calculation2) 64-bit code:                ~130us per
calculation
Linux (FC1):1) INT64_IS_BUSTED code (32-bit):    ~29us per calculation2) 64-bit code:                ~76us per
calculation


Looking at the flags in pg_config.h from both source builds, it would appear
that the 64-bit code is being used since the compiler defines
HAVE_LONG_LONG_INT_64. So that means there could be a potential 100%
performance increase by just using the existing INT64_IS_BUSTED code on x86
32 bit processors. I don't know given the rate of xlog records being written
whether or not this translates to more than a couple of minutes in an insert
of a million records or so.

I did have a look at the assembly code produced by the INT64_IS_BUSTED code
and it appears fairly tight; I'm no x86 expert but I think it would be hard
to optimise what was produced by the compiler. If anyone else is interested
in looking at these results then I will happily send the test programs
off-list - however my main concern is that the Win32 CRC64 results just
don't seem to be right :(


Kind regards,

Mark.

------------------------
WebBased Ltd
17 Research Way
Plymouth
PL6 8BT

T: +44 (0)1752 797131
F: +44 (0)1752 791023
W: http://www.webbased.co.uk

Re: Cost of XLogInsert CRC calculations

From

"Mark Cave-Ayland"

Date:

16 May 2005, 08:15:20

> -----Original Message-----
> From: Mark Cave-Ayland [mailto:m.cave-ayland@webbased.co.uk] 
> Sent: 16 May 2005 09:04
> To: 'Simon Riggs'
> Cc: 'Christopher Kings-Lynne'; 'Tom Lane'; 'Bruce Momjian'; 
> 'pgsql-hackers@postgresql.org'
> Subject: RE: [HACKERS] Cost of XLogInsert CRC calculations

(cut)

> The program used to time the CRC64 calculation simply did the 
> following:

(cut)

>     2) Fill the buffer with some repeatable values; in this case the
>     algorithm used was the following:
> 
>         for (i = 0 ; i < 8192 ; i++ )
>         {
>             *data = (char )(i % 256);
>         }

Sigh, so it would help if I had added the offset to the data pointer... ;)
    for (i = 0 ; i < 8192 ; i++ )        {            *(data + i) = (char )(i % 256);        }

This now gives the following (correct) result on both platforms:Win32: 1.8GHz P4, WinXPLinux: 2.8GHz Xeon, FC1

Win32 UINT64:    0x782104059a01660    (crc0)
~158usWin32 UINT32:    0x78210405 (crc1), 0x59a01660 (crc0)    ~58us
FC1 UINT64:        0x782104059a01660 (crc0)
~76usFC1 UINT32:        0x78210405 (crc1), 0x59a01660 (crc0)
~29us


Note that we still find that using the INT64_IS_BUSTED code is over 100%
quicker than the UINT64 code for CRC64 calculation, and I believe it is not
being used by default under Linux or Win32 for 32 bit platforms. Of course
Tom's suggestion of going for CRC32 across the board would hopefully solve
this entirely and bring the times down a little further too.


Kind regards,

Mark.

------------------------
WebBased Ltd
17 Research Way
Plymouth
PL6 8BT 

T: +44 (0)1752 797131
F: +44 (0)1752 791023
W: http://www.webbased.co.uk

Re: Cost of XLogInsert CRC calculations

From

Simon Riggs

Date:

16 May 2005, 15:56:18

On Mon, 2005-05-16 at 12:12 +0100, Mark Cave-Ayland wrote:
> This now gives the following (correct) result on both platforms:
>     Win32: 1.8GHz P4, WinXP
>     Linux: 2.8GHz Xeon, FC1
> 
> 
>     Win32 UINT64:    0x782104059a01660    (crc0)
> ~158us
>     Win32 UINT32:    0x78210405 (crc1), 0x59a01660 (crc0)    ~58us
> 
>     FC1 UINT64:        0x782104059a01660 (crc0)
> ~76us
>     FC1 UINT32:        0x78210405 (crc1), 0x59a01660 (crc0)
> ~29us
> 
> 
> Note that we still find that using the INT64_IS_BUSTED code is over 100%
> quicker than the UINT64 code for CRC64 calculation, and I believe it is not
> being used by default under Linux or Win32 for 32 bit platforms. Of course
> Tom's suggestion of going for CRC32 across the board would hopefully solve
> this entirely and bring the times down a little further too.

I think perhaps that the difference in hardware is the reason for the
difference in elapsed time, not the OS.

The performance gain is disturbing. I think its supposed to be the other
way around isn't it? Like having INT64 is supposed to be good...

Perhaps the BIOS on your systems don't correctly support 64-bit, so
mimicking it costs more. 

Best Regards, Simon Riggs

Re: Cost of XLogInsert CRC calculations

From

Hannu Krosing

Date:

16 May 2005, 16:35:13

On E, 2005-05-16 at 12:12 +0100, Mark Cave-Ayland wrote:
> > -----Original Message-----
> > From: Mark Cave-Ayland [mailto:m.cave-ayland@webbased.co.uk] 
> > Sent: 16 May 2005 09:04
> > To: 'Simon Riggs'
> > Cc: 'Christopher Kings-Lynne'; 'Tom Lane'; 'Bruce Momjian'; 
> > 'pgsql-hackers@postgresql.org'
> > Subject: RE: [HACKERS] Cost of XLogInsert CRC calculations
> 

> 
> Win32 UINT64: 0x782104059a01660 (crc0)             ~158us
> Win32 UINT32: 0x78210405 (crc1), 0x59a01660 (crc0)  ~58us
> FC1 UINT64:   0x782104059a01660 (crc0)              ~76us
> FC1 UINT32:   0x78210405 (crc1), 0x59a01660 (crc0)  ~29us

It would interesting to see how much of the time taken is actual algorithm 
and how much is getting at the data (i.e the fact that the page has to go 
through CPU at all, instead DMA'ing it to disk). 

In your test setup you probably have the whole thing in CPU cache already, 
but still it would be interesting to time the same thing with some simple 
algorithm, like XOR over the whole 8k page, for comparison.

-- 
Hannu Krosing <hannu@skype.net>

Re: Cost of XLogInsert CRC calculations

From

Tom Lane

Date:

16 May 2005, 17:27:02

"Mark Cave-Ayland" <m.cave-ayland@webbased.co.uk> writes:
> Sigh, so it would help if I had added the offset to the data pointer... ;)

Would you post the corrected program so people can try it on a few other
architectures?  No point in reinventing the wheel, even if it is a
pretty trivial wheel.
        regards, tom lane

Re: Cost of XLogInsert CRC calculations

From

Greg Stark

Date:

01 June 2005, 15:56:44

It occurs to me that at least on some OSes the WAL logs are being synced with
O_SYNC or its ilk. In those cases the writes should be guaranteed to be
written out in the order postgres wrote them. So if the tail end of the WAL
entry is there (is there any sort of footer?) then the entire entry must be
there. In that case is there any need to calculate the CRC at all?

I suppose it's a bit of a problem in that the database doing the replay might
not know which sync method was used to write the entries. The format would
have to stay the same. Some magic value would have to be defined to always be
correct.

-- 
greg

Re: Cost of XLogInsert CRC calculations

From

Tom Lane

Date:

01 June 2005, 16:04:05

Greg Stark <gsstark@mit.edu> writes:
> It occurs to me that at least on some OSes the WAL logs are being synced with
> O_SYNC or its ilk. In those cases the writes should be guaranteed to be
> written out in the order postgres wrote them. So if the tail end of the WAL
> entry is there (is there any sort of footer?) then the entire entry must be
> there. In that case is there any need to calculate the CRC at all?

Sure.  How else do you know that the entry is all there *and is valid*?
There's no "footer", and if there were it might still be garbage (eg
left over from a prior cycle).

Also, I doubt that O_SYNC could do anything to guarantee write order of
the individual sectors within a page.
        regards, tom lane