Thread: PG14: Avoid checking output-buffer-length for every encoded byte during pg_hex_encode

During some development on encoding-related parts of postgres I stumbled over the new length-checking-code in
common/hex.c/pg_hex_encode.

Differently when comparing it to all versions before the output-buffer-length is checked during encoding of every
individualsource byte. 

This may impose quite a regression when using pg_dump on databases with many/big bytea or lo columns.

Because all criteria to check buffer-length are known in advance of encoding (srclen and destlen) I propose doing the
checkonly once before starting the while-loop. 

Please find the attached patch for pg14 and master, older versions did not have this behavior.

Tested on pg14-beta3, but applies also on master.

PS: This is my very first patch, I am in no way an experienced C-developer and don't have a smoothly running
developmentenvironment or experience yet. (originally mostly dealing with postgres on Windows). 

If it seems useful somebody could enter it as an open item / resolved item for pg14 after beta 3.

Thanks for looking!

Hans Buschmann


Attachment
Welcome.
Em seg., 16 de ago. de 2021 às 05:46, Hans Buschmann <buschmann@nidsa.net> escreveu:
During some development on encoding-related parts of postgres I stumbled over the new length-checking-code in common/hex.c/pg_hex_encode.

Differently when comparing it to all versions before the output-buffer-length is checked during encoding of every individual source byte.
Is always good to remove immutables from loops, if safe and secure.


This may impose quite a regression when using pg_dump on databases with many/big bytea or lo columns.
Decode has regression too.


Because all criteria to check buffer-length are known in advance of encoding (srclen and destlen) I propose doing the check only once before starting the while-loop.
Good.
 

Please find the attached patch for pg14 and master, older versions did not have this behavior.
I think that we can take one more step here.
pg_hex_decode can be improved too.
We can remove limitations from all functions that use hex functions.
byteaout from (varlena.c) has an artificial limitation to handle only MaxAllocSize length, but
nothing prevents her from being prepared for that limitation to be removed one day.


Tested on pg14-beta3, but applies also on master.

PS: This is my very first patch, I am in no way an experienced C-developer and don't have a smoothly running development environment or experience yet. (originally mostly dealing with postgres on Windows).
It seems good to me. 

Please, take a look at the new version attached.
If possible can you review the tests if there is an overflow at
pg_hex_encode and pg_hex_decode functions?

regards,
Ranier Vilela

Attachment



Von: Ranier Vilela <ranier.vf@gmail.com>
Gesendet: Montag, 16. August 2021 17:04
An: Hans Buschmann
Cc: pgsql-hackers@postgresql.org
Betreff: Re: PG14: Avoid checking output-buffer-length for every encoded byte during pg_hex_encode
 
Hello Ranier,

Thank you for your quick response.

>Is always good to remove immutables from loops, if safe and secure.

I think it's  worse because a loop changed variable is involved in the test, so it seems to be not immutable.

>Decode has regression too.

good catch, I overlooked it.

>I think that we can take one more step here.
>pg_hex_decode can be improved too.

+1

By looking a bit more precisely I detected the same checks in common/base64.c also involving loop-changed variables. Could you please make the same changes to base64.c?


>We can remove limitations from all functions that use hex functions.
>byteaout from (varlena.c) has an artificial limitation to handle only MaxAllocSize length, but
>nothing prevents her from being prepared for that limitation to be removed one day.

+1, but I don't know all implications of the type change to size_t

>Please, take a look at the new version attached.

Two remarks for decoding (by eyeball inspection of patch file only because of stilstruggling with git etc.):

1. The odd-number-of-digits-check for decoding is still part of the loop. It should be before the loop and called only once (by testing for an even number of srclen)
So we can eliminate the block if (s == srcend)​ {} inside the loop!

2. I noticed that the error messages for hex_decode should be changed as 

s/encoding/decoding/

(as in pg_log_fatal("overflow of destination buffer in hex encoding");)

>If possible can you review the tests if there is an overflow at
>pg_hex_encode and pg_hex_decode functions?

I would prefer to wait for another patch version from you (my development troubles), perhaps also dealing with base64.c

I don't know how and where any call to the encoding/decoding functions creates an overflow situation in normal operation.

I  will nonceless try the tests but I don't have any experience with it.

BTW the root cause for my work is an attempt to vastly accelerate these functions (hex and base64 encode/decode), but this is left for another day (not finished yet) and certainly only on master/new development.

Lateron support on this topic would be highly appreciated...

Best regards,
Hans Buschmann


Em seg., 16 de ago. de 2021 às 13:19, Hans Buschmann <buschmann@nidsa.net> escreveu:

Von: Ranier Vilela <ranier.vf@gmail.com>
Gesendet: Montag, 16. August 2021 17:04
An: Hans Buschmann
Cc: pgsql-hackers@postgresql.org
Betreff: Re: PG14: Avoid checking output-buffer-length for every encoded byte during pg_hex_encode
 
Hello Ranier,

Thank you for your quick response.

>Is always good to remove immutables from loops, if safe and secure.

I think it's  worse because a loop changed variable is involved in the test, so it seems to be not immutable.

>Decode has regression too.

good catch, I overlooked it.

>I think that we can take one more step here.
>pg_hex_decode can be improved too.

+1

By looking a bit more precisely I detected the same checks in common/base64.c also involving loop-changed variables. Could you please make the same changes to base64.c?
I will take a look.



>We can remove limitations from all functions that use hex functions.
>byteaout from (varlena.c) has an artificial limitation to handle only MaxAllocSize length, but
>nothing prevents her from being prepared for that limitation to be removed one day.

+1, but I don't know all implications of the type change to size_t
uint64 and size_t in 64 bits are equivalents.
uint64 and size_t in 32 bits can be weird, but anyway size_t at 32 bits can handle 1GB.
 

>Please, take a look at the new version attached.

Two remarks for decoding (by eyeball inspection of patch file only because of stilstruggling with git etc.):

1. The odd-number-of-digits-check for decoding is still part of the loop. It should be before the loop and called only once (by testing for an even number of srclen)
So we can eliminate the block if (s == srcend) {} inside the loop!
I'm afraid that is not possible.
I tested some variations for this test and regress fails at strings tests.
Anyway for test purposes, I changed it temporarily in this version, but I'm afraid we'll have to go back.


2. I noticed that the error messages for hex_decode should be changed as 

s/encoding/decoding/

(as in pg_log_fatal("overflow of destination buffer in hex encoding");)
Ok. Changed.


>If possible can you review the tests if there is an overflow at
>pg_hex_encode and pg_hex_decode functions?

I would prefer to wait for another patch version from you (my development troubles), perhaps also dealing with base64.c
base64.c can be made in another patch.


I don't know how and where any call to the encoding/decoding functions creates an overflow situation in normal operation.

I  will nonceless try the tests but I don't have any experience with it.
Thanks.


BTW the root cause for my work is an attempt to vastly accelerate these functions (hex and base64 encode/decode), but this is left for another day (not finished yet) and certainly only on master/new development.
I think this can speed up a little.


Lateron support on this topic would be highly appreciated...
If I can help, count on me.

regards,
Ranier Vilela
Attachment
On Sun, Aug 15, 2021 at 02:58:59PM +0000, Hans Buschmann wrote:
> If it seems useful somebody could enter it as an open item /
> resolved item for pg14 after beta 3.

Hmm.  Using SQLs like the following to stress pg_hex_encode(), I can
see a measurable difference easily, so no need of pg_dump or an
instance with many LOs:
set work_mem = '1GB';
with hex_values as materialized
  (select repeat(i::text, N)::bytea as i
     from generate_series(1, M) t(i))
  SELECT count(encode(i, 'hex')) from hex_values;

With N = 1000, M = 100000, perf shows a 34.88% use of pg_hex_encode().
On REL_13_STABLE, I get down ~27% for hex_encode(), the same backend
routine.

The runtime numbers are more interesting, HEAD gets up to 1600ms.
With the latest version of the patch, we get down to 1550ms.  Now, a
simple revert of ccf4e277 and aef8948 brings me down to the same
runtimes as ~13, closer to 1450ms.  There seem to be some noise in the
tests, but the difference is clear.

Considering that commit aef8948 also involved a cleanup of
src/common/hex_decode.c, I think that it would be saner to also revert
c3826f83, so as we have a clean state of the code to work on should
the work on the data encryption patch set move on in the future.  It
is not decided that this would actually use any of the hex
decode/encode code, either.

Honestly, I don't like much messing with this stuff after beta3, and
one of the motives of moving the hex handling code to src/common/ was
for the data encryption code whose state is not known as of late.
This does not prevent working on such refactorings in the future, of
course, keeping the performance impact in mind.

In short, I am planning to address this regression with the attached,
for a combined revert of 0d70d30, 5c33ba5 and 92436a7.
--
Michael

Attachment
On Mon, Aug 16, 2021 at 02:06:31PM -0300, Ranier Vilela wrote:
> uint64 and size_t in 64 bits are equivalents.
> uint64 and size_t in 32 bits can be weird, but anyway size_t at 32 bits can
> handle 1GB.

There is usually a reason why things are done as they are, so before
suggesting changing something I would advise to read the threads that
created those changes more carefully because they could be mentioned.
In this case, we are talking about aef8948, and this part of its
related thread:
https://www.postgresql.org/message-id/20210105034739.GG7432@momjian.us

> base64.c can be made in another patch.

This file is a set of code paths used by authentication and SCRAM, it
will never get as hot as the code paths that Hans has reported as
these are one-time operations.  Please note as well cfc40d3 for the
reasons why things are handled this way.  We absolutely have to use
safe routines here.
--
Michael

Attachment
Em ter., 17 de ago. de 2021 às 00:43, Michael Paquier <michael@paquier.xyz> escreveu:
On Mon, Aug 16, 2021 at 02:06:31PM -0300, Ranier Vilela wrote:
> uint64 and size_t in 64 bits are equivalents.
> uint64 and size_t in 32 bits can be weird, but anyway size_t at 32 bits can
> handle 1GB.

There is usually a reason why things are done as they are, so before
suggesting changing something I would advise to read the threads that
created those changes more carefully because they could be mentioned.
In this case, we are talking about aef8948, and this part of its
related thread:
https://www.postgresql.org/message-id/20210105034739.GG7432@momjian.us
Because things have always been done a certain way doesn't mean it's right.
Using int to address strings is an mistake.
Even with an artificial limitation to only deal with 1GB, the correct one to use would be size_t.



> base64.c can be made in another patch.

This file is a set of code paths used by authentication and SCRAM, it
will never get as hot as the code paths that Hans has reported as
these are one-time operations.  Please note as well cfc40d3 for the
reasons why things are handled this way.  We absolutely have to use
safe routines here.
I have no plans to touch base64.c

regards,
Ranier Vilela
Michael Paquier <michael@paquier.xyz> writes:
> In short, I am planning to address this regression with the attached,
> for a combined revert of 0d70d30, 5c33ba5 and 92436a7.

OK, but the commit message should explain why they're getting reverted.

            regards, tom lane



On Tue, Aug 17, 2021 at 09:39:30AM -0400, Tom Lane wrote:
> Michael Paquier <michael@paquier.xyz> writes:
> > In short, I am planning to address this regression with the attached,
> > for a combined revert of 0d70d30, 5c33ba5 and 92436a7.
> 
> OK, but the commit message should explain why they're getting reverted.

Uh, I don't see those commits, e.g.:

    $ git diff 0d70d30
    fatal: ambiguous argument '0d70d30': unknown revision or path not in the working tree.
    Use '--' to separate paths from revisions, like this:
    'git <command> [<revision>...] -- [<file>...]'

    $ git diff 5c33ba5
    fatal: ambiguous argument '5c33ba5': unknown revision or path not in the working tree.
    Use '--' to separate paths from revisions, like this:
    'git <command> [<revision>...] -- [<file>...]'
    
    $ git diff 92436a7
    fatal: ambiguous argument '92436a7': unknown revision or path not in the working tree.
    Use '--' to separate paths from revisions, like this:
    'git <command> [<revision>...] -- [<file>...]'

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  If only the physical world exists, free will is an illusion.




On Tue, Aug 17, 2021 at 11:26 PM Bruce Momjian <bruce@momjian.us> wrote:
>
> On Tue, Aug 17, 2021 at 09:39:30AM -0400, Tom Lane wrote:
> > Michael Paquier <michael@paquier.xyz> writes:
> > > In short, I am planning to address this regression with the attached,
> > > for a combined revert of 0d70d30, 5c33ba5 and 92436a7.
> >
> > OK, but the commit message should explain why they're getting reverted.
>
> Uh, I don't see those commits, e.g.:
>
>         $ git diff 0d70d30
>         fatal: ambiguous argument '0d70d30': unknown revision or path not in the working tree.
>         Use '--' to separate paths from revisions, like this:
>         'git <command> [<revision>...] -- [<file>...]'
>
>         $ git diff 5c33ba5
>         fatal: ambiguous argument '5c33ba5': unknown revision or path not in the working tree.
>         Use '--' to separate paths from revisions, like this:
>         'git <command> [<revision>...] -- [<file>...]'
>
>         $ git diff 92436a7
>         fatal: ambiguous argument '92436a7': unknown revision or path not in the working tree.
>         Use '--' to separate paths from revisions, like this:
>         'git <command> [<revision>...] -- [<file>...]'

Same here.  I'm assuming that the real commits are the one mentioned
in the patch, which are c3826f8,  aef8948 and ccf4e27.



On Wed, Aug 18, 2021 at 12:34:45AM +0800, Julien Rouhaud wrote:
> On Tue, Aug 17, 2021 at 11:26 PM Bruce Momjian <bruce@momjian.us> wrote:
>> Uh, I don't see those commits, e.g.:
>>
>>         $ git diff 0d70d30
>>         fatal: ambiguous argument '0d70d30': unknown revision or path not in the working tree.
>>         Use '--' to separate paths from revisions, like this:
>>         'git <command> [<revision>...] -- [<file>...]'
>>
>>         $ git diff 5c33ba5
>>         fatal: ambiguous argument '5c33ba5': unknown revision or path not in the working tree.
>>         Use '--' to separate paths from revisions, like this:
>>         'git <command> [<revision>...] -- [<file>...]'
>>
>>         $ git diff 92436a7
>>         fatal: ambiguous argument '92436a7': unknown revision or path not in the working tree.
>>         Use '--' to separate paths from revisions, like this:
>>         'git <command> [<revision>...] -- [<file>...]'
>
> Same here.  I'm assuming that the real commits are the one mentioned
> in the patch, which are c3826f8,  aef8948 and ccf4e27.

Oops, incorrect copy-paste from my side.  Yes, the commits impacted
here are aef8948, ccf4e27 and c3826f8.  The small-ish commit message
of the patch got thet right.
--
Michael

Attachment
On Tue, Aug 17, 2021 at 09:39:30AM -0400, Tom Lane wrote:
> OK, but the commit message should explain why they're getting reverted.

Sure.  aef8948 gets down because of the performance impact.  ccf4e27
was a cleanup following up aef8948, that loses its meaning.  And
c3826f8 cannot be let alone because of the reasons why aef8948 was
introduced, as it has no safety net for out-of-bound handling in the
result buffer allocated.
--
Michael

Attachment
On Wed, Aug 18, 2021 at 09:14:14AM +0900, Michael Paquier wrote:
> Sure.  aef8948 gets down because of the performance impact.  ccf4e27
> was a cleanup following up aef8948, that loses its meaning.  And
> c3826f8 cannot be let alone because of the reasons why aef8948 was
> introduced, as it has no safety net for out-of-bound handling in the
> result buffer allocated.

And done, with such a description in the commit log.
--
Michael

Attachment