Thread: pgbench test failing on 14beta1 on Debian/i386

pgbench test failing on 14beta1 on Debian/i386

From
Christoph Berg
Date:
Hi,

I can reproducibly get build failures in pgbench on 32-bit i386
Debian, both on sid and buster. (The older Debian stretch and Ubuntu
bionic are unaffected. Other architectures are also fine.)

https://pgdgbuild.dus.dg-i.net/view/Binaries/job/postgresql-14-binaries/635/


https://pgdgbuild.dus.dg-i.net/view/Binaries/job/postgresql-14-binaries/635/architecture=i386,distribution=sid/consoleFull

17:39:41 make[2]: Entering directory '/<<PKGBUILDDIR>>/build/src/bin/pgbench'
17:39:41 rm -rf '/<<PKGBUILDDIR>>/build/src/bin/pgbench'/tmp_check
17:39:41 /bin/mkdir -p '/<<PKGBUILDDIR>>/build/src/bin/pgbench'/tmp_check
17:39:41 cd /<<PKGBUILDDIR>>/build/../src/bin/pgbench && TESTDIR='/<<PKGBUILDDIR>>/build/src/bin/pgbench'
PATH="/<<PKGBUILDDIR>>/build/tmp_install/usr/lib/postgresql/14/bin:$PATH"
LD_LIBRARY_PATH="/<<PKGBUILDDIR>>/build/tmp_install/usr/lib/i386-linux-gnu" PGPORT='65432'
PG_REGRESS='/<<PKGBUILDDIR>>/build/src/bin/pgbench/../../../src/test/regress/pg_regress'
REGRESS_SHLIB='/<<PKGBUILDDIR>>/build/src/test/regress/regress.so'/usr/bin/prove -I
/<<PKGBUILDDIR>>/build/../src/test/perl/-I /<<PKGBUILDDIR>>/build/../src/bin/pgbench --verbose t/*.pl
 
17:39:50 
17:39:50 #   Failed test 'pgbench expressions stderr /(?^:command=113.: boolean true\b)/'
17:39:50 #   at t/001_pgbench_with_server.pl line 421.
17:39:50 #                   'pgbench: setting random seed to 5432
17:39:50 # starting vacuum...end.
17:39:50 # debug(script=0,command=1): int 13
17:39:50 # debug(script=0,command=2): int 116
17:39:50 # debug(script=0,command=3): int 1498
17:39:50 # debug(script=0,command=4): int 4
17:39:50 # debug(script=0,command=5): int 5
17:39:50 # debug(script=0,command=6): int 6
17:39:50 # debug(script=0,command=7): int 7
17:39:50 # debug(script=0,command=8): int 8
17:39:50 # debug(script=0,command=9): int 9
17:39:50 # debug(script=0,command=10): int 10
17:39:50 # debug(script=0,command=11): int 11
17:39:50 # debug(script=0,command=12): int 12
17:39:50 # debug(script=0,command=13): double 13.856406460551
17:39:50 # debug(script=0,command=14): double 14.8514851485149
17:39:50 # debug(script=0,command=15): double 15.39380400259
17:39:50 # debug(script=0,command=16): double 16
17:39:50 # debug(script=0,command=17): double 17.094
17:39:50 # debug(script=0,command=20): int 1
17:39:50 # debug(script=0,command=21): double -27
17:39:50 # debug(script=0,command=22): double 1024
17:39:50 # debug(script=0,command=23): double 1
17:39:50 # debug(script=0,command=24): double 1
17:39:50 # debug(script=0,command=25): double -0.125
17:39:50 # debug(script=0,command=26): double -0.125
17:39:50 # debug(script=0,command=27): double -0.00032
17:39:50 # debug(script=0,command=28): double 8.50705917302346e+37
17:39:50 # debug(script=0,command=29): double 1e+30
17:39:50 # debug(script=0,command=30): boolean false
17:39:50 # debug(script=0,command=31): boolean true
17:39:50 # debug(script=0,command=32): int 32
17:39:50 # debug(script=0,command=33): int 33
17:39:50 # debug(script=0,command=34): double 34
17:39:50 # debug(script=0,command=35): int 35
17:39:50 # debug(script=0,command=36): int 36
17:39:50 # debug(script=0,command=37): double 37.0000002
17:39:50 # debug(script=0,command=38): int 38
17:39:50 # debug(script=0,command=39): int 39
17:39:50 # debug(script=0,command=40): boolean true
17:39:50 # debug(script=0,command=41): null
17:39:50 # debug(script=0,command=42): null
17:39:50 # debug(script=0,command=43): boolean true
17:39:50 # debug(script=0,command=44): boolean true
17:39:50 # debug(script=0,command=45): boolean true
17:39:50 # debug(script=0,command=46): int 46
17:39:50 # debug(script=0,command=47): boolean true
17:39:50 # debug(script=0,command=48): boolean true
17:39:50 # debug(script=0,command=49): int -5817877081768721676
17:39:50 # debug(script=0,command=50): boolean true
17:39:50 # debug(script=0,command=51): int -7793829335365542153
17:39:50 # debug(script=0,command=52): int -1464711246773187029
17:39:50 # debug(script=0,command=53): boolean true
17:39:50 # debug(script=0,command=55): int -1
17:39:50 # debug(script=0,command=56): int -1
17:39:50 # debug(script=0,command=57): int 1
17:39:50 # debug(script=0,command=65): int 65
17:39:50 # debug(script=0,command=74): int 74
17:39:50 # debug(script=0,command=83): int 83
17:39:50 # debug(script=0,command=86): int 86
17:39:50 # debug(script=0,command=93): int 93
17:39:50 # debug(script=0,command=95): int 0
17:39:50 # debug(script=0,command=96): int 1
17:39:50 # debug(script=0,command=97): int 0
17:39:50 # debug(script=0,command=98): int 5432
17:39:50 # debug(script=0,command=99): int -9223372036854775808
17:39:50 # debug(script=0,command=100): int 9223372036854775807
17:39:50 # debug(script=0,command=101): boolean true
17:39:50 # debug(script=0,command=102): boolean true
17:39:50 # debug(script=0,command=103): boolean true
17:39:50 # debug(script=0,command=104): boolean true
17:39:50 # debug(script=0,command=105): boolean true
17:39:50 # debug(script=0,command=109): boolean true
17:39:50 # debug(script=0,command=110): boolean true
17:39:50 # debug(script=0,command=111): boolean true
17:39:50 # debug(script=0,command=112): int 9223372036854775797
17:39:50 # debug(script=0,command=113): boolean false
17:39:50 # '
17:39:50 #     doesn't match '(?^:command=113.: boolean true\b)'
17:39:52 # Looks like you failed 1 test of 415.
17:39:52 t/001_pgbench_with_server.pl .. 
17:39:52 ok 1 - concurrent OID generation status (got 0 vs expected 0)
17:39:52 ok 2 - concurrent OID generation stdout /(?^:processed: 125/125)/
17:39:52 ok 3 - concurrent OID generation stderr /(?^:^$)/
17:39:52 ok 4 - no such database status (got 1 vs expected 1)
17:39:52 ok 5 - no such database stdout /(?^:^$)/
17:39:52 ok 6 - no such database stderr /(?^:connection to server .* failed)/
17:39:52 ok 7 - no such database stderr /(?^:FATAL:  database "no-such-database" does not exist)/
17:39:52 ok 8 - run without init status (got 1 vs expected 1)
17:39:52 ok 9 - run without init stdout /(?^:^$)/
17:39:52 ok 10 - run without init stderr /(?^:Perhaps you need to do initialization)/
17:39:52 ok 11 - pgbench scale 1 initialization status (got 0 vs expected 0)
17:39:52 ok 12 - pgbench scale 1 initialization stdout /(?^:^$)/

[...]

17:39:52 ok 172 - pgbench expressions stderr /(?^:command=110.: boolean true\b)/
17:39:52 ok 173 - pgbench expressions stderr /(?^:command=111.: boolean true\b)/
17:39:52 ok 174 - pgbench expressions stderr /(?^:command=112.: int 9223372036854775797\b)/
17:39:52 not ok 175 - pgbench expressions stderr /(?^:command=113.: boolean true\b)/
17:39:52 ok 176 - random seeded with 733446049 status (got 0 vs expected 0)
17:39:52 ok 177 - random seeded with 733446049 stdout /(?^:processed: 1/1)/
17:39:52 ok 178 - random seeded with 733446049 stderr /(?^:setting random seed to 733446049\b)/

[...]

17:39:52 ok 415 - remove log files
17:39:52 1..415
17:39:52 Dubious, test returned 1 (wstat 256, 0x100)
17:39:52 Failed 1/415 subtests
17:39:53 t/002_pgbench_no_server.pl ....

Christoph



Re: pgbench test failing on 14beta1 on Debian/i386

From
Tom Lane
Date:
Christoph Berg <myon@debian.org> writes:
> I can reproducibly get build failures in pgbench on 32-bit i386
> Debian, both on sid and buster. (The older Debian stretch and Ubuntu
> bionic are unaffected. Other architectures are also fine.)

The test that's failing came in with

6b258e3d688db14aadb58dde2a72939362310684
Author: Dean Rasheed <dean.a.rasheed@gmail.com>
Date:   Tue Apr 6 11:50:42 2021 +0100

    pgbench: Function to generate random permutations.


            regards, tom lane



Re: pgbench test failing on 14beta1 on Debian/i386

From
Thomas Munro
Date:
On Wed, May 19, 2021 at 9:51 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Christoph Berg <myon@debian.org> writes:
> > I can reproducibly get build failures in pgbench on 32-bit i386
> > Debian, both on sid and buster. (The older Debian stretch and Ubuntu
> > bionic are unaffected. Other architectures are also fine.)
>
> The test that's failing came in with
>
> 6b258e3d688db14aadb58dde2a72939362310684
> Author: Dean Rasheed <dean.a.rasheed@gmail.com>
> Date:   Tue Apr 6 11:50:42 2021 +0100
>
>     pgbench: Function to generate random permutations.

FWIW this is reproducible on my local Debian/gcc box with -m32, but
not on my FreeBSD/clang box with -m32.  permute() produces different
values here:

\set t debug(permute(:size-1, :size, 5432) = 5301702756001087507 and \
             permute(:size-2, :size, 5432) = 8968485976055840695 and \
             permute(:size-3, :size, 5432) = 6708495591295582115 and \
             permute(:size-4, :size, 5432) = 2801794404574855121 and \
             permute(:size-5, :size, 5432) = 1489011409218895840 and \
             permute(:size-6, :size, 5432) = 2267749475878240183 and \
             permute(:size-7, :size, 5432) = 1300324176838786780)

I don't understand any of this stuff at all, but I added a bunch of
printfs and worked out that the first point its local variables
diverge is here:

        /* Random offset */
        r = (uint64) getrand(&random_state2, 0, size - 1);

... after 4 earlier getrand() produced matching values.  Hmm.



Re: pgbench test failing on 14beta1 on Debian/i386

From
Thomas Munro
Date:
On Wed, May 19, 2021 at 11:34 AM Thomas Munro <thomas.munro@gmail.com> wrote:
> I don't understand any of this stuff at all, but I added a bunch of
> printfs and worked out that the first point its local variables
> diverge is here:
>
>         /* Random offset */
>         r = (uint64) getrand(&random_state2, 0, size - 1);

Forgot to post the actual values:

      r = 2563421694876090368
      r = 2563421694876090365

Smells a bit like a precision problem in the workings of pg_erand48(),
but as soon as I saw floating point numbers I closed my laptop and ran
for the door.



Re: pgbench test failing on 14beta1 on Debian/i386

From
Tom Lane
Date:
Thomas Munro <thomas.munro@gmail.com> writes:
> Forgot to post the actual values:
>       r = 2563421694876090368
>       r = 2563421694876090365
> Smells a bit like a precision problem in the workings of pg_erand48(),
> but as soon as I saw floating point numbers I closed my laptop and ran
> for the door.

Yup.  This test has a touching, but entirely unwarranted, faith in
pg_erand48() producing bit-for-bit the same values everywhere.

            regards, tom lane



Re: pgbench test failing on 14beta1 on Debian/i386

From
Fabien COELHO
Date:
>> Forgot to post the actual values:
>>       r = 2563421694876090368
>>       r = 2563421694876090365
>> Smells a bit like a precision problem in the workings of pg_erand48(),
>> but as soon as I saw floating point numbers I closed my laptop and ran
>> for the door.
>
> Yup.  This test has a touching, but entirely unwarranted, faith in
> pg_erand48() producing bit-for-bit the same values everywhere.

Indeed.

I argued against involving any floats computation on principle, but Dean 
was confident it could work, and it did simplify the code, so it did not 
look that bad an option.

I see two simple approaches:

(1) use another PRNG inside pgbench, eg Knuth's which was used in some 
previous submission and is very simple and IMHO better than the rand48 
stuff.

(2) extend pg_*rand48() to provide an unsigned 64 bits out of the 48 bits
state.

Any preference?

-- 
Fabien.



Re: pgbench test failing on 14beta1 on Debian/i386

From
Michael Paquier
Date:
On Wed, May 19, 2021 at 09:06:16AM +0200, Fabien COELHO wrote:
> I see two simple approaches:
>
> (1) use another PRNG inside pgbench, eg Knuth's which was used in some
> previous submission and is very simple and IMHO better than the rand48
> stuff.
>
> (2) extend pg_*rand48() to provide an unsigned 64 bits out of the 48 bits
> state.

Or, (3) remove this test?  I am not quite sure what there is to gain
with this extra test considering all the other tests with permute()
already present in this script.
--
Michael

Attachment

Re: pgbench test failing on 14beta1 on Debian/i386

From
Dean Rasheed
Date:
On Wed, 19 May 2021 at 00:35, Thomas Munro <thomas.munro@gmail.com> wrote:
>
> FWIW this is reproducible on my local Debian/gcc box with -m32,

Confirmed, thanks for looking. I can reproduce it on my machine with
-m32. It's somewhat annoying that the buildfarm didn't pick it up
sooner :-(

On Wed, 19 May 2021 at 08:28, Michael Paquier <michael@paquier.xyz> wrote:
>
> On Wed, May 19, 2021 at 09:06:16AM +0200, Fabien COELHO wrote:
> > I see two simple approaches:
> >
> > (1) use another PRNG inside pgbench, eg Knuth's which was used in some
> > previous submission and is very simple and IMHO better than the rand48
> > stuff.
> >
> > (2) extend pg_*rand48() to provide an unsigned 64 bits out of the 48 bits
> > state.
>
> Or, (3) remove this test?  I am not quite sure what there is to gain
> with this extra test considering all the other tests with permute()
> already present in this script.

Yes, I think removing the test is the best option. It was originally
added because there was a separate code path for larger permutation
sizes that needed testing, but that's no longer the case so the test
really isn't adding anything.

Regards,
Dean



Re: pgbench test failing on 14beta1 on Debian/i386

From
Fabien COELHO
Date:

> Confirmed, thanks for looking. I can reproduce it on my machine with
> -m32. It's somewhat annoying that the buildfarm didn't pick it up
> sooner :-(
>
> On Wed, 19 May 2021 at 08:28, Michael Paquier <michael@paquier.xyz> wrote:
>>
>> On Wed, May 19, 2021 at 09:06:16AM +0200, Fabien COELHO wrote:
>>> I see two simple approaches:
>>>
>>> (1) use another PRNG inside pgbench, eg Knuth's which was used in some
>>> previous submission and is very simple and IMHO better than the rand48
>>> stuff.
>>>
>>> (2) extend pg_*rand48() to provide an unsigned 64 bits out of the 48 bits
>>> state.
>>
>> Or, (3) remove this test?  I am not quite sure what there is to gain
>> with this extra test considering all the other tests with permute()
>> already present in this script.
>
> Yes, I think removing the test is the best option. It was originally
> added because there was a separate code path for larger permutation
> sizes that needed testing, but that's no longer the case so the test
> really isn't adding anything.

Hmmm…

It is the one test which worked in actually detecting an issue, so I would 
not say that it is not adding anything, on the contrary, it did prove its 
value! The permute function is expected to be deterministic on different 
platforms and architectures, and it is not.

I agree that removing the test will hide the issue effectively:-) but 
ISTM more appropriate to solve the underlying issue and keep the test.

I'd agree with a two phases approach: drop the test in the short term and 
deal with the PRNG later. I'm sooooo unhappy with this 48 bit PRNG that I 
may be motivated enough to attempt to replace it, or at least add a better 
(faster?? larger state?? same/better quality?) alternative.

-- 
Fabien.

Re: pgbench test failing on 14beta1 on Debian/i386

From
Fabien COELHO
Date:
Hello Dean,

>>> Or, (3) remove this test?  I am not quite sure what there is to gain
>>> with this extra test considering all the other tests with permute()
>>> already present in this script.
>> 
>> Yes, I think removing the test is the best option. It was originally
>> added because there was a separate code path for larger permutation
>> sizes that needed testing, but that's no longer the case so the test
>> really isn't adding anything.
>
> Hmmm…
>
> It is the one test which worked in actually detecting an issue, so I would 
> not say that it is not adding anything, on the contrary, it did prove its 
> value! The permute function is expected to be deterministic on different 
> platforms and architectures, and it is not.
>
> I agree that removing the test will hide the issue effectively:-) but ISTM 
> more appropriate to solve the underlying issue and keep the test.
>
> I'd agree with a two phases approach: drop the test in the short term and 
> deal with the PRNG later. I'm sooooo unhappy with this 48 bit PRNG that I may 
> be motivated enough to attempt to replace it, or at least add a better 
> (faster?? larger state?? same/better quality?) alternative.

Attached patch disactivates the test with comments to outline that there 
is an issue to fix… so it is *not* removed.

I'm obviously okay with providing an alternate PRNG, let me know if this 
is the prefered option.

-- 
Fabien.
Attachment

Re: pgbench test failing on 14beta1 on Debian/i386

From
Dean Rasheed
Date:
On Wed, 19 May 2021 at 11:32, Fabien COELHO <coelho@cri.ensmp.fr> wrote:
>
> >> Or, (3) remove this test?  I am not quite sure what there is to gain
> >> with this extra test considering all the other tests with permute()
> >> already present in this script.
> >
> > Yes, I think removing the test is the best option. It was originally
> > added because there was a separate code path for larger permutation
> > sizes that needed testing, but that's no longer the case so the test
> > really isn't adding anything.
>
> Hmmm…
>
> It is the one test which worked in actually detecting an issue, so I would
> not say that it is not adding anything, on the contrary, it did prove its
> value! The permute function is expected to be deterministic on different
> platforms and architectures, and it is not.
>

In fact what it demonstrates is that the results from permute(), like
all the other pgbench random functions, will vary by platform for
sufficiently large size parameters.

> I'd agree with a two phases approach: drop the test in the short term and
> deal with the PRNG later. I'm sooooo unhappy with this 48 bit PRNG that I
> may be motivated enough to attempt to replace it, or at least add a better
> (faster?? larger state?? same/better quality?) alternative.
>

I don't necessarily have a problem with that provided the replacement
is well-chosen and has a proven track record (i.e., let's not invent
our own PRNG).

For now though, I'll go remove the test.

Regards,
Dean



Re: pgbench test failing on 14beta1 on Debian/i386

From
Dean Rasheed
Date:
On Wed, 19 May 2021 at 12:07, Fabien COELHO <coelho@cri.ensmp.fr> wrote:
>
> Attached patch disactivates the test with comments to outline that there
> is an issue to fix… so it is *not* removed.
>

I opted to just remove the test rather than comment it out, since the
issue highlighted isn't specific to permute(). Also changing the PRNG
will completely change the results, so all the test values would
require rewriting, rather than it just being a case of uncommenting
the test and expecting it to work.

> I'm obviously okay with providing an alternate PRNG, let me know if this
> is the prefered option.
>

That's something for consideration in v15. If we do decide we want a
new PRNG, it should apply across the board to all pgbench random
functions.

Regards,
Dean



Re: pgbench test failing on 14beta1 on Debian/i386

From
Fabien COELHO
Date:
>>>> Or, (3) remove this test?  I am not quite sure what there is to gain
>>>> with this extra test considering all the other tests with permute()
>>>> already present in this script.
>>>
>>> Yes, I think removing the test is the best option. It was originally
>>> added because there was a separate code path for larger permutation
>>> sizes that needed testing, but that's no longer the case so the test
>>> really isn't adding anything.
>>
>> Hmmm…
>>
>> It is the one test which worked in actually detecting an issue, so I would
>> not say that it is not adding anything, on the contrary, it did prove its
>> value! The permute function is expected to be deterministic on different
>> platforms and architectures, and it is not.
>>
>
> In fact what it demonstrates is that the results from permute(), like
> all the other pgbench random functions, will vary by platform for
> sufficiently large size parameters.

Indeed, it is the case if the underlying math use doubles & large numbers. 
For integer-only computations it should be safe though, and permute should 
be in this category.

>> I'd agree with a two phases approach: drop the test in the short term and
>> deal with the PRNG later. I'm sooooo unhappy with this 48 bit PRNG that I
>> may be motivated enough to attempt to replace it, or at least add a better
>> (faster?? larger state?? same/better quality?) alternative.
>
> I don't necessarily have a problem with that provided the replacement
> is well-chosen and has a proven track record (i.e., let's not invent
> our own PRNG).

Yes, obviously, I'm not daft enough to reinvent a PRNG. The question is to 
chose one, motivate the choice, and build the relevant API for what pg 
needs, possibly with some benchmarking.

> For now though, I'll go remove the test.

This also removes the reminder…

-- 
Fabien.

Re: pgbench test failing on 14beta1 on Debian/i386

From
Andrew Dunstan
Date:
On 5/19/21 6:32 AM, Fabien COELHO wrote:
>
>
>> Confirmed, thanks for looking. I can reproduce it on my machine with
>> -m32. It's somewhat annoying that the buildfarm didn't pick it up
>> sooner :-(
>>
>> On Wed, 19 May 2021 at 08:28, Michael Paquier <michael@paquier.xyz>
>> wrote:
>>>
>>> On Wed, May 19, 2021 at 09:06:16AM +0200, Fabien COELHO wrote:
>>>> I see two simple approaches:
>>>>
>>>> (1) use another PRNG inside pgbench, eg Knuth's which was used in some
>>>> previous submission and is very simple and IMHO better than the rand48
>>>> stuff.
>>>>
>>>> (2) extend pg_*rand48() to provide an unsigned 64 bits out of the
>>>> 48 bits
>>>> state.
>>>
>>> Or, (3) remove this test?  I am not quite sure what there is to gain
>>> with this extra test considering all the other tests with permute()
>>> already present in this script.
>>
>> Yes, I think removing the test is the best option. It was originally
>> added because there was a separate code path for larger permutation
>> sizes that needed testing, but that's no longer the case so the test
>> really isn't adding anything.
>
> Hmmm…
>
> It is the one test which worked in actually detecting an issue, so I
> would not say that it is not adding anything, on the contrary, it did
> prove its value! The permute function is expected to be deterministic
> on different platforms and architectures, and it is not.
>
> I agree that removing the test will hide the issue effectively:-) but
> ISTM more appropriate to solve the underlying issue and keep the test.
>
> I'd agree with a two phases approach: drop the test in the short term
> and deal with the PRNG later. I'm sooooo unhappy with this 48 bit PRNG
> that I may be motivated enough to attempt to replace it, or at least
> add a better (faster?? larger state?? same/better quality?) alternative.
>

Yeah, this does seem to be something that should be fixed rather than
hidden.


cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com




Re: pgbench test failing on 14beta1 on Debian/i386

From
Christoph Berg
Date:
In the meantime postgresql-14 has been accepted into Debian/experimental:

https://buildd.debian.org/status/logs.php?pkg=postgresql-14&ver=14%7Ebeta1-1

Interestingly, the test is only failing on i386 and none of the other
architectures, which could hint at 80-bit extended precision FP
problems.

(The sparc64 error there is something else, I'll try rerunning it.
command failed: "psql" -X -c "CREATE DATABASE \"isolation_regression\" TEMPLATE=template0" "postgres")

Christoph



Re: pgbench test failing on 14beta1 on Debian/i386

From
Tom Lane
Date:
Christoph Berg <myon@debian.org> writes:
> Interestingly, the test is only failing on i386 and none of the other
> architectures, which could hint at 80-bit extended precision FP
> problems.

Yeah, that's what I'd assumed it is.  We suppress that where we can
with -fexcess-precision=standard or -msse2, but I'm guessing that
doesn't help here for some reason.

            regards, tom lane