Thread: Cost of XLogInsert CRC calculations
I was profiling a case involving UPDATEs into a table with too many indexes (brought to mind by mysql's sql-bench, about which more later) and got this rather surprising result for routines costing more than 1% of the total runtime: Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls s/call s/call name 64.03 86.20 86.20 133608 0.00 0.00 XLogInsert 3.50 90.91 4.71 2484787 0.00 0.00 _bt_compare 2.92 94.84 3.93 839893 0.00 0.00 hash_search 2.77 98.57 3.73 1875815 0.00 0.00 LWLockAcquire 1.89 101.12 2.55 1887972 0.00 0.00 LWLockRelease1.27 102.83 1.71 125234 0.00 0.00 _bt_getroot 1.01 104.19 1.36 403342 0.00 0.00 PinBuffer 1.00 105.54 1.35 840002 0.00 0.00 hash_any I suppose that the bulk of the CPU cycles being attributed to XLogInsert are going into the inlined CRC calculations. Maybe we need to think twice about the cost/benefit ratio of using 64-bit CRCs to protect xlog records that are often only a few dozen bytes. regards, tom lane
On Sun, 6 Mar 2005, Tom Lane wrote: > I suppose that the bulk of the CPU cycles being attributed to XLogInsert > are going into the inlined CRC calculations. Maybe we need to think > twice about the cost/benefit ratio of using 64-bit CRCs to protect xlog > records that are often only a few dozen bytes. Isn't the CRC quite important on recovery to recognize where the last valid log record is? Is there any better implementations of CRC-64? Would using a different polynomial help? Would it help to do the CRC calculation in a more wholesale fashion in XLogWrite? How about switching to CRC-32 or even CRC-16? I searched the archives for the reason CRC-64 was chosen in the first place. It seems that the difference in computation time was not considered to be significant, and there was 8 bytes available in the record header anyway. Just some thoughts... - Heikki
On Sun, 2005-03-06 at 00:17 -0500, Tom Lane wrote: > I was profiling a case involving UPDATEs into a table with too many > indexes (brought to mind by mysql's sql-bench, about which more later) > and got this rather surprising result for routines costing more than > 1% of the total runtime: > > Each sample counts as 0.01 seconds. > % cumulative self self total > time seconds seconds calls s/call s/call name > 64.03 86.20 86.20 133608 0.00 0.00 XLogInsert > 3.50 90.91 4.71 2484787 0.00 0.00 _bt_compare > 2.92 94.84 3.93 839893 0.00 0.00 hash_search > 2.77 98.57 3.73 1875815 0.00 0.00 LWLockAcquire > 1.89 101.12 2.55 1887972 0.00 0.00 LWLockRelease > 1.27 102.83 1.71 125234 0.00 0.00 _bt_getroot > 1.01 104.19 1.36 403342 0.00 0.00 PinBuffer > 1.00 105.54 1.35 840002 0.00 0.00 hash_any > > I suppose that the bulk of the CPU cycles being attributed to XLogInsert > are going into the inlined CRC calculations. Maybe we need to think > twice about the cost/benefit ratio of using 64-bit CRCs to protect xlog > records that are often only a few dozen bytes. Yes, in recent performance tests sponsored by Unisys, this result was also very clear. In those tests we used Intel VTune to identify the precise lines of code soaking up the cycles...it was the CRC checks. More results should be available from the Unisys testing within a few days. I had assumed that the majority of the cost of CRC checking was as a result of the need to log complete blocks, rather than the rather small xlog records themselves? Best Regards, Simon Riggs
Hi Tom, > I was profiling a case involving UPDATEs into a table with too many indexes (brought to > mind by mysql's sql-bench, about which more later) and got this rather surprising result > for routines costing more than 1% of the total runtime: (cut) > I suppose that the bulk of the CPU cycles being attributed to XLogInsert are going into > the inlined CRC calculations. Maybe we need to think twice about the cost/benefit ratio > of using 64-bit CRCs to protect xlog records that are often only a few dozen bytes. Wow, a 64-bit CRC does seem excessive, especially when going back to Zmodem days where a 50-100k file seemed to be easily protected by a 32-bit CRC. I'm sure there are some error rates somewhere dependent upon the polynomial and the types of error detected.... Try the following link towards the bottom: http://www.ee.unb.ca/tervo/ee4253/crc.htm for some theory on detection rates vs. CRC size. Kind regards, Mark. ------------------------ WebBased Ltd South West Technology Centre Tamar Science Park Plymouth PL6 8BT T: +44 (0)1752 791021 F: +44 (0)1752 791023 W: http://www.webbased.co.uk
"Mark Cave-Ayland" <m.cave-ayland@webbased.co.uk> writes: > Wow, a 64-bit CRC does seem excessive, especially when going back to Zmodem > days where a 50-100k file seemed to be easily protected by a 32-bit CRC. I'm > sure there are some error rates somewhere dependent upon the polynomial and > the types of error detected.... Try the following link towards the bottom: > http://www.ee.unb.ca/tervo/ee4253/crc.htm for some theory on detection rates > vs. CRC size. When the CRC size was decided, I recall someone arguing that it would really make a difference to have 1-in-2^64 chance of failure rather than 1-in-2^32. I was dubious about this at the time, but didn't have any evidence showing that we shouldn't go for 64. I suppose we ought to try the same example with a 32-bit CRC and see how much it helps. regards, tom lane
Tom Lane wrote: > "Mark Cave-Ayland" <m.cave-ayland@webbased.co.uk> writes: > >>Wow, a 64-bit CRC does seem excessive, especially when going back to Zmodem >>days where a 50-100k file seemed to be easily protected by a 32-bit CRC. I'm >>sure there are some error rates somewhere dependent upon the polynomial and >>the types of error detected.... Try the following link towards the bottom: >>http://www.ee.unb.ca/tervo/ee4253/crc.htm for some theory on detection rates >>vs. CRC size. > > > When the CRC size was decided, I recall someone arguing that it would > really make a difference to have 1-in-2^64 chance of failure rather than > 1-in-2^32. I was dubious about this at the time, but didn't have any > evidence showing that we shouldn't go for 64. I suppose we ought to try > the same example with a 32-bit CRC and see how much it helps. Continuing this why not a 16-bit then ? Regards Gaetano Mendola
On Mon, 2005-03-07 at 09:39 -0500, Tom Lane wrote: > "Mark Cave-Ayland" <m.cave-ayland@webbased.co.uk> writes: > > Wow, a 64-bit CRC does seem excessive, especially when going back to Zmodem > > days where a 50-100k file seemed to be easily protected by a 32-bit CRC. I'm > > sure there are some error rates somewhere dependent upon the polynomial and > > the types of error detected.... Try the following link towards the bottom: > > http://www.ee.unb.ca/tervo/ee4253/crc.htm for some theory on detection rates > > vs. CRC size. > > When the CRC size was decided, I recall someone arguing that it would > really make a difference to have 1-in-2^64 chance of failure rather than > 1-in-2^32. I was dubious about this at the time, but didn't have any > evidence showing that we shouldn't go for 64. I suppose we ought to try > the same example with a 32-bit CRC and see how much it helps. I think some of the additional run-time may be coming from processor stalls associated with some of the constants used in the CRC checks. I'll come back with more info on that later. Well, we're using the CRC in 3 separate places... (1) for xlog records (2) for complete blocks copied to xlog (3) for control files For (1), records are so short that probably CRC16 would be sufficient without increasing the error rate noticeably. I think I'd like to keep (3) at CRC64...its just too important. Plus thats slightly less code to change. My money is on (2) being the source of most of that run-time anyway, since when we enclose a whole block it takes a lot longer to CRC64 all BLCKSZ bytes than it would do to CRC a single record in (1). But of course, longer stretches of data need better error detection rates. If Ethernet is using CRC32, it seems somewhat strange to use anything higher than that, seeing as we're very likely to be sending xlog files across the net anyway. Packet size is mostly comparable to BLCKSZ isn't it? So, yes CRC32 seems more reasonable. One of the things I was thinking about was whether we could use up those cycles more effectively. If we were to include a compression routine before we calculated the CRC that would - reduce the size of the blocks to be written, hence reduce size of xlog - reduce the following CRC calculation I was thinking about using a simple run-length encoding to massively shrink half-empty blocks with lots of zero padding, but we've already got code to LZW the data down also. Best Regards, Simon Riggs
Simon Riggs <simon@2ndquadrant.com> writes: > Well, we're using the CRC in 3 separate places... > (1) for xlog records > (2) for complete blocks copied to xlog > (3) for control files > For (1), records are so short that probably CRC16 would be sufficient > without increasing the error rate noticeably. > I think I'd like to keep (3) at CRC64...its just too important. Plus > thats slightly less code to change. The control files are so short that CRC16 would be plenty. > My money is on (2) being the source of most of that run-time anyway, Undoubtedly, so there's not going to be much win from micro-optimization by having several different CRC functions. I would go for CRC32 across the board, myself. regards, tom lane
Tom Lane <tgl@sss.pgh.pa.us> writes: > Simon Riggs <simon@2ndquadrant.com> writes: > > > For (1), records are so short that probably CRC16 would be sufficient > > without increasing the error rate noticeably. > > The control files are so short that CRC16 would be plenty. It's not really the size of the data that governs the reasonable size of the CRC. It's the probability of there being errors. For that you have to think about what possible causes of errors you're concerned with. AFAIK there's no CRC check at all in tables or indexes. So it doesn't seem like bad hardware like a bad drive or RAM is what you're concerned with here. From the other post it sounded like the threat was the possibility of an interrupted arriving after the transaction commit log entry is written but before the fsync has written the rest of the log. In that case it seems the probability of it occurring is independent of the size of the log entry. Is a 1 in 64k chance of a power failure resulting in data corruption acceptable? A 1 in 4 billion chance? You could eliminate the hole completely by using two fsyncs. fsync the transaction information once. Then once that completes, then write and fsync the transaction commit log entry. -- greg
On Mon, 2005-03-07 at 20:50 -0500, Tom Lane wrote: > Simon Riggs <simon@2ndquadrant.com> writes: > > Well, we're using the CRC in 3 separate places... > > (1) for xlog records > > (2) for complete blocks copied to xlog > > (3) for control files > > > For (1), records are so short that probably CRC16 would be sufficient > > without increasing the error rate noticeably. > > > I think I'd like to keep (3) at CRC64...its just too important. Plus > > thats slightly less code to change. > > The control files are so short that CRC16 would be plenty. > > > My money is on (2) being the source of most of that run-time anyway, > > Undoubtedly, so there's not going to be much win from micro-optimization > by having several different CRC functions. Agreed. > I would go for CRC32 across > the board, myself. Sold. Best Regards, Simon Riggs
On Mon, Mar 07, 2005 at 11:53:59PM +0000, Simon Riggs wrote: > On Mon, 2005-03-07 at 09:39 -0500, Tom Lane wrote: > > "Mark Cave-Ayland" <m.cave-ayland@webbased.co.uk> writes: [skipped] > Well, we're using the CRC in 3 separate places... > (1) for xlog records > (2) for complete blocks copied to xlog > (3) for control files > > For (1), records are so short that probably CRC16 would be sufficient > without increasing the error rate noticeably. > > I think I'd like to keep (3) at CRC64...its just too important. Plus > thats slightly less code to change. > > My money is on (2) being the source of most of that run-time anyway, > since when we enclose a whole block it takes a lot longer to CRC64 all > BLCKSZ bytes than it would do to CRC a single record in (1). But of > course, longer stretches of data need better error detection rates. Well, if there is no need for error recovery, than what about using more simple algorithm, like checksum? Perhaps, it even could be attached to one of required memcpy calls.
> One of the things I was thinking about was whether we could use up those > cycles more effectively. If we were to include a compression routine > before we calculated the CRC that would > - reduce the size of the blocks to be written, hence reduce size of xlog > - reduce the following CRC calculation > > I was thinking about using a simple run-length encoding to massively > shrink half-empty blocks with lots of zero padding, but we've already > got code to LZW the data down also. > > Best Regards, Simon Riggs > > > ---------------------------(end of broadcast)--------------------------- > TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org Simon, I think having a compression routine in there could make real sense. We have done some major I/O testing involving compression for a large customer some time ago. We have seen that compressing / decompressing on the fly is in MOST cases much faster than uncompressed I/O (try a simple "cat file | ..." vs." zcat file.gz | ...") - the zcat version will be faster on all platforms we have tried (Linux, AIX, Sun on some SAN system, etc. ...). Also, when building up a large database within one transaction the xlog will eat a lot of storage - this can be quite annoying when you have to deal with a lot of data). Are there any technical reasons which would prevent somebody from implementing compression? Best regards, Hans -- Cybertec Geschwinde u Schoenig Schoengrabern 134, A-2020 Hollabrunn, Austria Tel: +43/660/816 40 77 www.cybertec.at, www.postgresql.at
On Fri, 2005-03-11 at 19:31 +0100, Hans-Jürgen Schönig wrote: > > One of the things I was thinking about was whether we could use up those > > cycles more effectively. If we were to include a compression routine > > before we calculated the CRC that would > > - reduce the size of the blocks to be written, hence reduce size of xlog > > - reduce the following CRC calculation > > > > I was thinking about using a simple run-length encoding to massively > > shrink half-empty blocks with lots of zero padding, but we've already > > got code to LZW the data down also. > I think having a compression routine in there could make real sense. > We have done some major I/O testing involving compression for a large > customer some time ago. We have seen that compressing / decompressing on > the fly is in MOST cases much faster than uncompressed I/O (try a simple > "cat file | ..." vs." zcat file.gz | ...") - the zcat version will be > faster on all platforms we have tried (Linux, AIX, Sun on some SAN > system, etc. ...). > Also, when building up a large database within one transaction the xlog > will eat a lot of storage - this can be quite annoying when you have to > deal with a lot of data). Agreed. > Are there any technical reasons which would prevent somebody from > implementing compression? Not currently, that I'm aware of. But the way additional blocks are added to xlog records would need to be changed to allow for variable length output. It's on my list... Best Regards, Simon Riggs
> -----Original Message----- > From: Mark Cave-Ayland [mailto:m.cave-ayland@webbased.co.uk] > Sent: 07 March 2005 11:04 > To: 'tgl@sss.pgh.pa.us' > Cc: 'pgsql-hackers@postgreSQL.org' > Subject: Re: Cost of XLogInsert CRC calculations (cut) > > I suppose that the bulk of the CPU cycles being attributed to > > XLogInsert are going into > the inlined CRC calculations. Maybe we > > need to think twice about the cost/benefit ratio > of using 64-bit > > CRCs to protect xlog records that are often only a few dozen bytes. > > Wow, a 64-bit CRC does seem excessive, especially when going > back to Zmodem days where a 50-100k file seemed to be easily > protected by a 32-bit CRC. I'm sure there are some error > rates somewhere dependent upon the polynomial and the types > of error detected.... Try the following link towards the > bottom: http://www.ee.unb.ca/tervo/ee4253/crc.htm for some > theory on detection rates vs. CRC size. Hi Tom/Simon, I was just researching some articles on compression (zlib) and I saw mention of the Adler-32 algorithm which is supposed to be slightly less accurate than an equivalent CRC calculation but significantly faster to compute. I haven't located a good paper comparing the error rates of the two different checksums, however extending the Adler-32 algorithm up to 64 bits may be a way of increasing the speed at the expense of a slight loss in the accuracy of error detection if we decided that we want to stay at 64 bits as opposed to 32. The following seems to indicate that Adler-32 is at least twice as fast as optimised CRC32: http://www.winimage.com/misc/readfile_test.htm. Kind regards, Mark. ------------------------ WebBased Ltd 17 Research Way Plymouth PL6 8BT T: +44 (0)1752 791021 F: +44 (0)1752 791023 W: http://www.webbased.co.uk
"Mark Cave-Ayland" <m.cave-ayland@webbased.co.uk> writes: > I was just researching some articles on compression (zlib) and I saw mention > of the Adler-32 algorithm which is supposed to be slightly less accurate > than an equivalent CRC calculation but significantly faster to compute. I > haven't located a good paper comparing the error rates of the two different > checksums, ... probably because there isn't one. With all due respect to the Zip guys, I doubt anyone has done anywhere near the analysis on Adler-32 that has been done on CRCs. I'd much prefer to stick with true CRC and drop it to 32 bits than go with a less-tested algorithm. Throwing more bits at the problem doesn't necessarily create a safer checksum. regards, tom lane
Tom Lane wrote: > "Mark Cave-Ayland" <m.cave-ayland@webbased.co.uk> writes: > > I was just researching some articles on compression (zlib) and I saw mention > > of the Adler-32 algorithm which is supposed to be slightly less accurate > > than an equivalent CRC calculation but significantly faster to compute. I > > haven't located a good paper comparing the error rates of the two different > > checksums, > > ... probably because there isn't one. With all due respect to the Zip > guys, I doubt anyone has done anywhere near the analysis on Adler-32 > that has been done on CRCs. I'd much prefer to stick with true CRC > and drop it to 32 bits than go with a less-tested algorithm. Throwing > more bits at the problem doesn't necessarily create a safer checksum. Agreed. 64-bit was overkill when we added it, and it is now shown to be a performance problem. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
On Tue, 2005-05-10 at 10:34 -0400, Bruce Momjian wrote: > Tom Lane wrote: > > "Mark Cave-Ayland" <m.cave-ayland@webbased.co.uk> writes: > > > I was just researching some articles on compression (zlib) and I saw mention > > > of the Adler-32 algorithm which is supposed to be slightly less accurate > > > than an equivalent CRC calculation but significantly faster to compute. I > > > haven't located a good paper comparing the error rates of the two different > > > checksums, > > > > ... probably because there isn't one. With all due respect to the Zip > > guys, I doubt anyone has done anywhere near the analysis on Adler-32 > > that has been done on CRCs. I'd much prefer to stick with true CRC > > and drop it to 32 bits than go with a less-tested algorithm. Throwing > > more bits at the problem doesn't necessarily create a safer checksum. > > Agreed. 64-bit was overkill when we added it, and it is now shown to be > a performance problem. Hold on... Tom has shown that there is a performance problem with the existing CRC calculation. Thanks to Mark for staying on top of that with some good ideas. The cause of the performance problem has been attributed to it being a 64-bit rather than 32-bit calculation. That is certainly part of it, but I have seen evidence that there is an Intel processor stall associated with the use of a single byte constant somewhere in the algorithm. So I'm unclear as to what extent the poor performance is attributable to either issue. That's where my experience stops so I have highlighted that for somebody with more hardware specific assembler experience to have a look at the algorithm. Fixing that, if possible, could greatly improve the performance whether or not we drop from 64 to 32 bits. My hope for outside assistance on that looks like it is not now forthcoming. My guess would be that the algorithm's use of the byte-by-byte calculation together with a bitmask of &FF is responsible. Perhaps varying the length of the bitmask to either &000000FF or longer may help? (see src/include/xlog_internal.h) Best Regards, Simon Riggs
Simon Riggs <simon@2ndquadrant.com> writes: > The cause of the performance problem has been attributed to it being a > 64-bit rather than 32-bit calculation. That is certainly part of it, but > I have seen evidence that there is an Intel processor stall associated > with the use of a single byte constant somewhere in the algorithm. That's awfully vague --- can't you give any more detail? I have seen XLogInsert eating significant amounts of time (up to 10% of total CPU time) on non-Intel architectures, so I think that dropping down to 32 bits is warranted in any case. But if you are correct then that might not fix the problem on Intel machines. We need more info. regards, tom lane
On Tue, 2005-05-10 at 18:22 -0400, Tom Lane wrote: > Simon Riggs <simon@2ndquadrant.com> writes: > > The cause of the performance problem has been attributed to it being a > > 64-bit rather than 32-bit calculation. That is certainly part of it, but > > I have seen evidence that there is an Intel processor stall associated > > with the use of a single byte constant somewhere in the algorithm. > > That's awfully vague --- can't you give any more detail? > > I have seen XLogInsert eating significant amounts of time (up to 10% of > total CPU time) on non-Intel architectures, so I think that dropping > down to 32 bits is warranted in any case. But if you are correct then > that might not fix the problem on Intel machines. We need more info. I have seen an Intel VTune report that shows a memory stall causing high latency associated with a single assembly instruction that in the compiled code of the CRC calculation. The instruction was manipulating a single byte only. I couldn't tell exactly which line of PostgreSQL code produced the assembler. This could be either a partial register stall or a memory order buffer stall (or another?) Here's a discussion of this http://www.gamasutra.com/features/19991221/barad_pfv.htm Sorry, but thats all I know. I will try to obtain the report, which is not in my possession. I do *not* know with any certainty what the proportion of time lost from the CRC calc proper in an idealised CPU against the time lost from this hardware specific interaction. I don't know if non-Intel CPUs are effected either. Best Regards, Simon Riggs
> -----Original Message----- > From: Tom Lane [mailto:tgl@sss.pgh.pa.us] > Sent: 10 May 2005 23:22 > To: Simon Riggs > Cc: Bruce Momjian; Mark Cave-Ayland (External); > pgsql-hackers@postgresql.org > Subject: Re: [HACKERS] Cost of XLogInsert CRC calculations (cut) > That's awfully vague --- can't you give any more detail? > > I have seen XLogInsert eating significant amounts of time (up > to 10% of total CPU time) on non-Intel architectures, so I > think that dropping down to 32 bits is warranted in any case. > But if you are correct then that might not fix the problem > on Intel machines. We need more info. > > regards, tom lane Hi Tom/Simon, Just for the record, I found a better analysis of Adler-32 following some links from Wikipedia. In summary, the problem with Adler-32 is that while it is only slightly less sensitive than CRC-32, it requires roughly a 1k "run-in" in order to attain full coverage of the bits (with respect to sensitivity of the input). This compares to 4 bytes of "run-in" required for CRC-32. So unless we can guarantee a minimum of 1k data per Xlog record then Adler-32 won't be suitable. See the following two links for more information: http://en.wikipedia.org/wiki/Adler-32 http://www.ietf.org/rfc/rfc3309.txt One other consideration would be that since CRC-32 calculations for Xlog records occur so often, perhaps the CRC-32 routines could be written in in-line assembler, falling back to C for unsupported processors. It would be interesting to come up with some benchmarks to see if indeed this would be faster than the current C implementation, since as the routine is called so often it could add up to a significant saving under higher loads. Kind regards, Mark. ------------------------ WebBased Ltd 17 Research Way Plymouth PL6 8BT T: +44 (0)1752 791021 F: +44 (0)1752 791023 W: http://www.webbased.co.uk
On Wed, 2005-05-11 at 13:40 +0100, Mark Cave-Ayland wrote: > So unless we can guarantee a minimum of 1k data per Xlog record then > Adler-32 won't be suitable. Most records are either 8KB or much less than 1KB. Is the benefit gained from the 8KB records worth the loss on the more frequent smaller records? > perhaps the CRC-32 routines could be written in in-line assembler If you can do this, step right up. :-) Best Regards, Simon Riggs
>>perhaps the CRC-32 routines could be written in in-line assembler > > > If you can do this, step right up. :-) > > Best Regards, Simon Riggs Surely there's an open source code floating around somewhere? Chris
> -----Original Message----- > From: Christopher Kings-Lynne [mailto:chriskl@familyhealth.com.au] > Sent: 12 May 2005 02:46 > To: Simon Riggs > Cc: Mark Cave-Ayland (External); 'Tom Lane'; 'Bruce Momjian'; > pgsql-hackers@postgresql.org > Subject: Re: [HACKERS] Cost of XLogInsert CRC calculations > > > >>perhaps the CRC-32 routines could be written in in-line assembler > > > > > > If you can do this, step right up. :-) > > > > Best Regards, Simon Riggs > > Surely there's an open source code floating around somewhere? > > Chris Hi Chris, I've found some x86 MASM sources using Google, but nothing in GCC-style in-line assembler. It would be good to see how this compares in terms of speed with the code generated by GCC. Kind regards, Mark. ------------------------ WebBased Ltd 17 Research Way Plymouth PL6 8BT T: +44 (0)1752 791021 F: +44 (0)1752 791023 W: http://www.webbased.co.uk
On Thu, 2005-05-12 at 13:48 +0100, Mark Cave-Ayland wrote: > > From: Christopher Kings-Lynne [mailto:chriskl@familyhealth.com.au] > > > > >>perhaps the CRC-32 routines could be written in in-line assembler > > > > > > > > > If you can do this, step right up. :-) > > > > > > Best Regards, Simon Riggs > > > > Surely there's an open source code floating around somewhere? > > > I've found some x86 MASM sources using Google, but nothing in GCC-style > in-line assembler. It would be good to see how this compares in terms of > speed with the code generated by GCC. Is it BSD? I looked for some other BSD code last month. There was some, but it was clearly slower than the existing code. It might be possible to speed things up by a factor of two using a two- byte lookup table rather than a one byte lookup. This would take up 65536 table entries, which is only 256KB. If we held that in shared memory we could calculate the entries at startup, or read them in from a file. It would only use the same space as if we had 255 connections, so not a great increase in memory usage overall. Need to look at the effect of retrieving everything from memory rather than from cache, so it probably wouldn't be twice as fast. Best Regards, Simon Riggs
> -----Original Message----- > From: Simon Riggs [mailto:simon@2ndquadrant.com] > Sent: 12 May 2005 16:52 > To: Mark Cave-Ayland (External) > Cc: 'Christopher Kings-Lynne'; 'Tom Lane'; 'Bruce Momjian'; > pgsql-hackers@postgresql.org > Subject: RE: [HACKERS] Cost of XLogInsert CRC calculations (cut) > It might be possible to speed things up by a factor of two using a > two- byte lookup table rather than a one byte lookup. This would take > up 65536 table entries, which is only 256KB. If we held that in shared > memory we could calculate the entries at startup, or read them in from > a file. It would only use the same space as if we had 255 connections, > so not a great increase in memory usage overall. Need to look at the > effect of retrieving everything from memory rather than from > cache, so it probably wouldn't be twice as fast. > > Best Regards, Simon Riggs Hi everyone, As per this thread, I decided to spend a little time over the weekend looking at the performance of the existing CRC64 code in order to determine whether we could code the algorithm in a more efficient manner. In order to do this, I devised a couple of test programs using code cut/pasted from pg_crc.h (and pg_resetxlog) in order perform some timings. These tests were done on two systems: a P4 1.8GHz laptop with MingW (Win32) and gcc 3.2.3, and a Xeon 2.8GHz server running FC1 and gcc 3.3.2. The program used to time the CRC64 calculation simply did the following: 1) Create an 8k buffer of data 2) Fill the buffer with some repeatable values; in this case thealgorithm used was the following: for (i = 0 ; i < 8192 ; i++ ) { *data = (char )(i % 256); } 3) Calculate the CRC64 of the buffer 10,000 times and display the result,making a note of the start and end times. Using this information an estimate of the cost of a single CRC calculation can esaily be found. I noticed looking at the code that there were two versions of the CRC code, one using 32-bit arguments wrapped within a check for INT64_IS_BUSTED and another using 64-bit unsigned integers. Since I wasn't sure initially which one was being used, I decided to test both of them :) Both programs were compiled using gcc's -O2 flag for optimisation. The first thing to notice about the results was that I obtained different CRC values using both algorithms on Win32, which were both different to the results obtained under Linux: Win32 MingW:1) INT64_IS_BUSTED code (32-bit): 0x541d4a27 (crc1) 0x8dfa57ae (crc0)2) 64-bit code: 0x817f722b1f17b253 (crc0) Linux (FC1):1) INT64_IS_BUSTED code (32-bit): 0x21d064a2 (crc1) 0xe7c74332 (crc0)2) 64-bit code: 0x21d064a2e7c74332 (crc0) The second interesting thing to notice was the comparative speeds: Win32 MingW:1) INT64_IS_BUSTED code (32-bit): ~57us per calculation2) 64-bit code: ~130us per calculation Linux (FC1):1) INT64_IS_BUSTED code (32-bit): ~29us per calculation2) 64-bit code: ~76us per calculation Looking at the flags in pg_config.h from both source builds, it would appear that the 64-bit code is being used since the compiler defines HAVE_LONG_LONG_INT_64. So that means there could be a potential 100% performance increase by just using the existing INT64_IS_BUSTED code on x86 32 bit processors. I don't know given the rate of xlog records being written whether or not this translates to more than a couple of minutes in an insert of a million records or so. I did have a look at the assembly code produced by the INT64_IS_BUSTED code and it appears fairly tight; I'm no x86 expert but I think it would be hard to optimise what was produced by the compiler. If anyone else is interested in looking at these results then I will happily send the test programs off-list - however my main concern is that the Win32 CRC64 results just don't seem to be right :( Kind regards, Mark. ------------------------ WebBased Ltd 17 Research Way Plymouth PL6 8BT T: +44 (0)1752 797131 F: +44 (0)1752 791023 W: http://www.webbased.co.uk
> -----Original Message----- > From: Mark Cave-Ayland [mailto:m.cave-ayland@webbased.co.uk] > Sent: 16 May 2005 09:04 > To: 'Simon Riggs' > Cc: 'Christopher Kings-Lynne'; 'Tom Lane'; 'Bruce Momjian'; > 'pgsql-hackers@postgresql.org' > Subject: RE: [HACKERS] Cost of XLogInsert CRC calculations (cut) > The program used to time the CRC64 calculation simply did the > following: (cut) > 2) Fill the buffer with some repeatable values; in this case the > algorithm used was the following: > > for (i = 0 ; i < 8192 ; i++ ) > { > *data = (char )(i % 256); > } Sigh, so it would help if I had added the offset to the data pointer... ;) for (i = 0 ; i < 8192 ; i++ ) { *(data + i) = (char )(i % 256); } This now gives the following (correct) result on both platforms:Win32: 1.8GHz P4, WinXPLinux: 2.8GHz Xeon, FC1 Win32 UINT64: 0x782104059a01660 (crc0) ~158usWin32 UINT32: 0x78210405 (crc1), 0x59a01660 (crc0) ~58us FC1 UINT64: 0x782104059a01660 (crc0) ~76usFC1 UINT32: 0x78210405 (crc1), 0x59a01660 (crc0) ~29us Note that we still find that using the INT64_IS_BUSTED code is over 100% quicker than the UINT64 code for CRC64 calculation, and I believe it is not being used by default under Linux or Win32 for 32 bit platforms. Of course Tom's suggestion of going for CRC32 across the board would hopefully solve this entirely and bring the times down a little further too. Kind regards, Mark. ------------------------ WebBased Ltd 17 Research Way Plymouth PL6 8BT T: +44 (0)1752 797131 F: +44 (0)1752 791023 W: http://www.webbased.co.uk
On Mon, 2005-05-16 at 12:12 +0100, Mark Cave-Ayland wrote: > This now gives the following (correct) result on both platforms: > Win32: 1.8GHz P4, WinXP > Linux: 2.8GHz Xeon, FC1 > > > Win32 UINT64: 0x782104059a01660 (crc0) > ~158us > Win32 UINT32: 0x78210405 (crc1), 0x59a01660 (crc0) ~58us > > FC1 UINT64: 0x782104059a01660 (crc0) > ~76us > FC1 UINT32: 0x78210405 (crc1), 0x59a01660 (crc0) > ~29us > > > Note that we still find that using the INT64_IS_BUSTED code is over 100% > quicker than the UINT64 code for CRC64 calculation, and I believe it is not > being used by default under Linux or Win32 for 32 bit platforms. Of course > Tom's suggestion of going for CRC32 across the board would hopefully solve > this entirely and bring the times down a little further too. I think perhaps that the difference in hardware is the reason for the difference in elapsed time, not the OS. The performance gain is disturbing. I think its supposed to be the other way around isn't it? Like having INT64 is supposed to be good... Perhaps the BIOS on your systems don't correctly support 64-bit, so mimicking it costs more. Best Regards, Simon Riggs
On E, 2005-05-16 at 12:12 +0100, Mark Cave-Ayland wrote: > > -----Original Message----- > > From: Mark Cave-Ayland [mailto:m.cave-ayland@webbased.co.uk] > > Sent: 16 May 2005 09:04 > > To: 'Simon Riggs' > > Cc: 'Christopher Kings-Lynne'; 'Tom Lane'; 'Bruce Momjian'; > > 'pgsql-hackers@postgresql.org' > > Subject: RE: [HACKERS] Cost of XLogInsert CRC calculations > > > Win32 UINT64: 0x782104059a01660 (crc0) ~158us > Win32 UINT32: 0x78210405 (crc1), 0x59a01660 (crc0) ~58us > FC1 UINT64: 0x782104059a01660 (crc0) ~76us > FC1 UINT32: 0x78210405 (crc1), 0x59a01660 (crc0) ~29us It would interesting to see how much of the time taken is actual algorithm and how much is getting at the data (i.e the fact that the page has to go through CPU at all, instead DMA'ing it to disk). In your test setup you probably have the whole thing in CPU cache already, but still it would be interesting to time the same thing with some simple algorithm, like XOR over the whole 8k page, for comparison. -- Hannu Krosing <hannu@skype.net>
"Mark Cave-Ayland" <m.cave-ayland@webbased.co.uk> writes: > Sigh, so it would help if I had added the offset to the data pointer... ;) Would you post the corrected program so people can try it on a few other architectures? No point in reinventing the wheel, even if it is a pretty trivial wheel. regards, tom lane
It occurs to me that at least on some OSes the WAL logs are being synced with O_SYNC or its ilk. In those cases the writes should be guaranteed to be written out in the order postgres wrote them. So if the tail end of the WAL entry is there (is there any sort of footer?) then the entire entry must be there. In that case is there any need to calculate the CRC at all? I suppose it's a bit of a problem in that the database doing the replay might not know which sync method was used to write the entries. The format would have to stay the same. Some magic value would have to be defined to always be correct. -- greg
Greg Stark <gsstark@mit.edu> writes: > It occurs to me that at least on some OSes the WAL logs are being synced with > O_SYNC or its ilk. In those cases the writes should be guaranteed to be > written out in the order postgres wrote them. So if the tail end of the WAL > entry is there (is there any sort of footer?) then the entire entry must be > there. In that case is there any need to calculate the CRC at all? Sure. How else do you know that the entry is all there *and is valid*? There's no "footer", and if there were it might still be garbage (eg left over from a prior cycle). Also, I doubt that O_SYNC could do anything to guarantee write order of the individual sectors within a page. regards, tom lane