Re: Looked at TODO:Considering improving performance of computing CHAR() value lengths - Mailing list pgsql-hackers

From John Cochran
Subject Re: Looked at TODO:Considering improving performance of computing CHAR() value lengths
Date
Msg-id CAGQU3n5JpRmbU7nk4NtGQ-5qWP_Y_G-qDN3_aMdXW8POR5uDhw@mail.gmail.com
Whole thread Raw
In response to Re: Looked at TODO:Considering improving performance of computing CHAR() value lengths  (Heikki Linnakangas <hlinnakangas@vmware.com>)
List pgsql-hackers

Made a mistake in replying to commenter, not mailing list. Redoing reply.

On Mon, Aug 4, 2014 at 4:48 AM, Heikki Linnakangas <hlinnakangas@vmware.com> wrote:
On 08/02/2014 09:43 PM, John Cochran wrote:
I took at look at the TODO list and got interested in the possible
... SNIP ... 
Can we see the benchmark program and the actual data, please?

 Attached as gzipped tar file.

... SNIP ...

You can't modify *arg, even momentarily, because it might point directly to an on-disk buffer. The same Datum can be read by another process at the same time, or even written out to disk
Thank you for the information. Just made some changes to the code so the values pointed to by *arg aren't modified even momentarily. 
 
You could simply check if the byte in the header happens to be != 0x20. It's very likely that it is, so that you can use the optimized version.
I already do that.
 
The biggest issue with the previously discussed patches was the lack of performance testing. People posted results of various micro-benchmarks that showed different aspects, but no-one comprehensively covered all the interesting cases. I'd like to see a simple suite of performance tests that show:

1. The worst case performance.
I can't provide that to you. Looking at my analysis, the worse case would be encountered on a big endian machine. On a big endian machine, if the data is using a 4B style header, the worse case would be for a CHAR(n) where n is 28 plus any multiple of 256 (about 4 million possibilities in total). For a little endian machine, the problem case happens for CHAR(n) where n is between 128 megabytes and 132 megabytes. I seriously doubt that the worse case would ever be encountered on a little endian machine. But it's quite possible on a big endian machine.
 
2. Performance with a couple of typical cases. CHAR(n), where n is between 1-64 would be interesting, and where the number of spaces at the end varies.
Since I've modified the code, I'm rerunning the benchmark. When benchmark is finished, will have execution times for all CHAR(n) where n ranges from 1 to 256 with trailing spaces ranging from 0 to n for each case. Type 1B headers are tested for all n from 1 to 126 along with all possible alignments from being on an even 4 byte boundary to being offset by 1, 2, and 3. For type 4B headers, they are all aligned on a 4 byte boundary and a value for all n from 1 to 256. The benchmark has currently gotten up to CHAR(40) and will take a while to finish. Even so, the data available should be more than suitable for examination.

 

You said you did benchmarking with a custom program, which is good, but it would be even better to write the benchmark as a self-contained SQL or bash script or similar. Then post it, so that others can easily repeat the same benchmarks on different platforms, and try out different versions of the patch.
- Heikki

No can do. The benchmark program is testing 4 different variants of bcTruelen() at the same time. In addition, the benchmark is keeping track of the fixed overhead involved in actually testing each function so it can be taken into consideration when examining the numbers. If I incorporated the code into a functional database, I would be restricted to testing just one variant at a time and would have no control over other activities (buffer management, disk I/O, etc) that would add noise to the benchmark data. 

Currently with the available data, I would suggest the optimized by1sentinel version of the bcTruelen(). As currently written, it attempts the following methods in sequence.

1. If *arg is pointing to a type 1B header, it simply performs the scan of trailing spaces using the 3 opcode loop. No other checks. Reason is that a type 1B head can never have the value 0x20, so a suitable sentinel is automatically known to be available.

2. The byte just prior to the 1st character is examined to see if it's a non-space. If so, the 3 opcode loop is used since a suitable sentinel has been shown to exist. Only reason this is slightly slower is because of the overhead in checking for the existence of a sentinel. On average, a big endian machine will use this option 255 times out of 256. On a little endian machine, it is extremely likely that this will be the only code path executed if the data starts with a type 4B header.

3. The entire header is checked to see if it's not equal to 0x20202020. This is almost worse case. Still able to use a 3 opcode loop, but the pointer will go 1 to 3 bytes lower than it really needs to. Easily corrected after the loop with a simple if and assignment. A big endian machine will use this code path 1 time out of 256. A little endian machine will use this code path for CHAR(134217724) through CHAR(138412027). I seriously doubt that this code path will ever be encountered on a little endian machine. On a big endian machine, it will be used only with a type 4B header for CHAR(28) which is unlikely since CHAR(28) should be using a 1B header. It will also be used for CHAR(284), CHAR(540), CHAR(796), etc. 

4. Assuming none of the 3 options above are selected, the function fall back to the standard 5 opcode loop that's currently in use. This will only happen on a big endian machine using CHAR(538976284) or a little endian machine using CHAR(134744068). Rather unlikely, but it is covered. 

Format of output from vtest  (the benchmark program built by the attached tar file)
The output is just a series of lines of the following format
hdr,offset,maxlen,spaces,overhead,by1,by4,by4sentinel,by1sentinel

hdr = 1 or 4 to indicate test done using a type 1B or 4B header.
offset = 0 to 3 to indicate memory alignment of header. 0 = aligned to 4 byte boundary. 1 through 3 = offset from 4 byte boundary by that amount.
maxlen = length of character data. Equivalent to n in CHAR(n)
spaces = number of trailing spaces. Ranges from 0 to maxlen.
overhead = number of nanoseconds consumed by 100,000 function calls and benchmark loop overhead.
by1 = number of nanoseconds consumed by 100,000 functions calls to by1 bcTruelen() function. Time includes benchmark overhead. To properly compare numbers against other versions of bcTruelen functions, subtract the overhead number given earlier. This function represents the current version of bcTruelen() being used by postgresql.
by4 = Same as by1 except using the "4 bytes at a time" version that the patch in the email conversation provided as referenced by the TODO item.
by4sentinel = Same as by1 except it performs the search 4 bytes at a time and uses a sentinel for loop termination when possible. NOTE: function not fully tested. It should work on big and little endian machines, but it is definitely not fully optimized for edge cases.
by1sentinel = Same a by1. It performs the search 1 bytes at a time and uses a sentinel for loop termination. 

Frankly, I don't see any downside to using the by1sentinel version of bcTruelen() since it is as fast, or faster than the current version of bcTruelen() for all cases. I do see lots of problems with using either of the 4 bytes at a time versions of bcTruelen() since they are slower if there are not many spaces at the end of the character string. The break even point for 4 at a time style loops seems to be about 8 to 12 trailing spaces. Below that number, they're slower. Above, faster. 

I look forward to any discussions and would be extremely interested in seeing benchmark data on a big endian architecture. 

John Cochran

--

There are two ways of constructing a software design: One way is to make it so simple that there are obviously no deficiencies and the other way is to make it so complicated that there are no obvious deficiencies. — C.A.R. Hoare

Attachment

pgsql-hackers by date:

Previous
From: Alvaro Herrera
Date:
Subject: Re: pg_ctl non-idempotent behavior change
Next
From: Josh Berkus
Date:
Subject: Re: Proposal to add a QNX 6.5 port to PostgreSQL