Thread: [HACKERS] tap tests on older branches fail if concurrency is used

[HACKERS] tap tests on older branches fail if concurrency is used

From
Andres Freund
Date:
Hi,

when using
$ cat ~/.proverc
-j9

some tests fail for me in 9.4 and 9.5.  E.g. src/bin/script's tests
yields a lot of fun like:
$ (cd ~/build/postgres/9.5-assert/vpath/src/bin/scripts/ && make check)
...
# LOG:  received immediate shutdown request
# WARNING:  terminating connection because of crash of another server process
# DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because
anotherserver process exited abnormally and possibly corrupted shared memory.
 
# HINT:  In a moment you should be able to reconnect to the database and repeat your command.
...

it appears as if various tests are trampling over each other.  If needed
I can provide detailed logs, but it appears to readily reproduce on
several machines...

See Michael, I'll provide the details and a reproducer ;)

Greetings,

Andres Freund



Re: [HACKERS] tap tests on older branches fail if concurrency is used

From
Craig Ringer
Date:
On 1 June 2017 at 08:15, Andres Freund <andres@anarazel.de> wrote:
> Hi,
>
> when using
> $ cat ~/.proverc
> -j9
>
> some tests fail for me in 9.4 and 9.5.  E.g. src/bin/script's tests
> yields a lot of fun like:
> $ (cd ~/build/postgres/9.5-assert/vpath/src/bin/scripts/ && make check)
> ...
> # LOG:  received immediate shutdown request
> # WARNING:  terminating connection because of crash of another server process
> # DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because
anotherserver process exited abnormally and possibly corrupted shared memory.
 
> # HINT:  In a moment you should be able to reconnect to the database and repeat your command.
> ...
>
> it appears as if various tests are trampling over each other.  If needed
> I can provide detailed logs, but it appears to readily reproduce on
> several machines...

I'll take a look at what's changed and why it's happening and get back to you.

-- Craig Ringer                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services



Re: [HACKERS] tap tests on older branches fail if concurrency is used

From
Craig Ringer
Date:
On 1 June 2017 at 08:15, Andres Freund <andres@anarazel.de> wrote:
> Hi,
>
> when using
> $ cat ~/.proverc
> -j9
>
> some tests fail for me in 9.4 and 9.5.  E.g. src/bin/script's tests
> yields a lot of fun like:
> $ (cd ~/build/postgres/9.5-assert/vpath/src/bin/scripts/ && make check)
> ...
> # LOG:  received immediate shutdown request
> # WARNING:  terminating connection because of crash of another server process
> # DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because
anotherserver process exited abnormally and possibly corrupted shared memory.
 
> # HINT:  In a moment you should be able to reconnect to the database and repeat your command.
> ...
>
> it appears as if various tests are trampling over each other.

None of those scripts use PostgresNode, which I thought was added in
9.5, but apparently was actually introduced in 9.6. They do all their
own setup/teardown using TestLib.pm routines. TestLib uses a unique
tempdir for each test run, sets it as the unix socket directory, and
disables listening on tcp, so the most obvious conflict is hidden.

The immediate problem appears to be that they all use
tmp_check/postmaster.log . So anything that examines the logs gets
confused by seeing some other postgres instance's logs, or a mixture,
trampling everywhere.

I'll be surprised if there aren't other problems though. Rather than
trying to fix it all up, this seems like a good argument for
backporting the updated suite from 9.6 or pg10, with PostgresNode etc.
I already have a working tree with that done to use src/test/recovery
in 9.5, but haven't updated src/bin/scripts etc yet.

-- Craig Ringer                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services



Re: [HACKERS] tap tests on older branches fail if concurrency is used

From
Tom Lane
Date:
Andres Freund <andres@anarazel.de> writes:
> when using
> $ cat ~/.proverc
> -j9
> some tests fail for me in 9.4 and 9.5.

Weren't there fixes specifically intended to make that safe, awhile ago?
        regards, tom lane



Re: [HACKERS] tap tests on older branches fail if concurrency is used

From
Michael Paquier
Date:
On Wed, May 31, 2017 at 8:45 PM, Craig Ringer <craig@2ndquadrant.com> wrote:
> On 1 June 2017 at 08:15, Andres Freund <andres@anarazel.de> wrote:
>> Hi,
>>
>> when using
>> $ cat ~/.proverc
>> -j9
>>
>> some tests fail for me in 9.4 and 9.5.  E.g. src/bin/script's tests
>> yields a lot of fun like:
>> $ (cd ~/build/postgres/9.5-assert/vpath/src/bin/scripts/ && make check)
>> ...
>> # LOG:  received immediate shutdown request
>> # WARNING:  terminating connection because of crash of another server process
>> # DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because
anotherserver process exited abnormally and possibly corrupted shared memory.
 
>> # HINT:  In a moment you should be able to reconnect to the database and repeat your command.
>> ...
>>
>> it appears as if various tests are trampling over each other.

They are. The problem can be easily reproduced on my side with that:
PROVE_FLAGS="-j 9" make check
It would be nice to get a minimum of stability for those tests in
back-branches even if PostgresNode.pm is not back-patched.

> The immediate problem appears to be that they all use
> tmp_check/postmaster.log . So anything that examines the logs gets
> confused by seeing some other postgres instance's logs, or a mixture,
> trampling everywhere.

Amen.

> I'll be surprised if there aren't other problems though. Rather than
> trying to fix it all up, this seems like a good argument for
> backporting the updated suite from 9.6 or pg10, with PostgresNode etc.
> I already have a working tree with that done to use src/test/recovery
> in 9.5, but haven't updated src/bin/scripts etc yet.

Yup. Even if PostgresNode.pm is not back-patched, a small trick is to
append the PID of the process running the TAP test to the log file
name as in the patch attached. This gives enough uniqueness for the
tests to pass with a high parallel degree.

A second error that I have spotted is in the tests of pg_rewind, which
would fail in parallel as the same data folders are used for each
test. Using the same trick with $$ makes the tests more stable.

A third error is a failure in contrib/test_decoding, and this has been
addressed by Andres in 60f826c.

Attached is a patch for the first two ones, which makes the tests more
robust. I am myself annoyed by parallel tests failing when working on
patches for back-branches, so having at least a minimal fix would be
nice.
-- 
Michael

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] tap tests on older branches fail if concurrency is used

From
Michael Paquier
Date:
On Thu, Jun 1, 2017 at 10:48 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Andres Freund <andres@anarazel.de> writes:
>> when using
>> $ cat ~/.proverc
>> -j9
>> some tests fail for me in 9.4 and 9.5.
>
> Weren't there fixes specifically intended to make that safe, awhile ago?

60f826c has not been back-patched. While this would fix parallel runs
with make's --jobs, PROVE_FLAGS="-j X" would still fail.
-- 
Michael



Re: [HACKERS] tap tests on older branches fail if concurrency is used

From
Craig Ringer
Date:
On 7 June 2017 at 13:39, Michael Paquier <michael.paquier@gmail.com> wrote:
> On Thu, Jun 1, 2017 at 10:48 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Andres Freund <andres@anarazel.de> writes:
>>> when using
>>> $ cat ~/.proverc
>>> -j9
>>> some tests fail for me in 9.4 and 9.5.
>>
>> Weren't there fixes specifically intended to make that safe, awhile ago?
>
> 60f826c has not been back-patched. While this would fix parallel runs
> with make's --jobs, PROVE_FLAGS="-j X" would still fail.

Ah, that's why I didn't find it.

I think applying Michael's patch makes sense now, and if we decide to
backpatch PostgresNode (and I get the time to do it) we can clobber
that fix quite happily with the full backport. Thanks Michael for the
workaround.

-- Craig Ringer                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services