Re: backup manifests and contemporaneous buildfarm failures - Mailing list pgsql-hackers

From Andres Freund
Subject Re: backup manifests and contemporaneous buildfarm failures
Date
Msg-id 20200404030628.rlyvz2n76g7g26ea@alap3.anarazel.de
Whole thread Raw
In response to Re: backup manifests and contemporaneous buildfarm failures  (Robert Haas <robertmhaas@gmail.com>)
Responses Re: backup manifests and contemporaneous buildfarm failures  (Petr Jelinek <petr@2ndquadrant.com>)
Re: backup manifests and contemporaneous buildfarm failures  (Robert Haas <robertmhaas@gmail.com>)
List pgsql-hackers
Hi,

Peter, Petr, CCed you because it's probably a bug somewhere around the
initial copy code for logical replication.


On 2020-04-03 20:48:09 -0400, Robert Haas wrote:
> 'serinus' is also failing. This is less obviously related:

Hm. Tests passed once since then.


> 2020-04-04 02:08:57.299 CEST [5e87d019.506c1:4] LOG:  received
> replication command: CREATE_REPLICATION_SLOT
> "tap_sub_16390_sync_16384" TEMPORARY LOGICAL pgoutput USE_SNAPSHOT
> 2020-04-04 02:08:57.299 CEST [5e87d019.506c1:5] ERROR:  replication
> slot "tap_sub_16390_sync_16384" already exists

That already seems suspicious. I checked the following (successful) run
and I did not see that in the stage's logs.

Looking at the failing log, it fails because for some reason there's
rounds (once due to a refresh, once due to an intention replication
failure) of copying the relation. Each creates its own temporary slot.

first time:
2020-04-04 02:08:57.276 CEST [5e87d019.506bd:1] LOG:  connection received: host=[local]
2020-04-04 02:08:57.278 CEST [5e87d019.506bd:4] LOG:  received replication command: CREATE_REPLICATION_SLOT
"tap_sub_16390_sync_16384"TEMPORARY LOGICAL pgoutput USE_SNAPSHOT
 
2020-04-04 02:08:57.282 CEST [5e87d019.506bd:9] LOG:  statement: COPY public.tab_rep TO STDOUT
2020-04-04 02:08:57.284 CEST [5e87d019.506bd:10] LOG:  disconnection: session time: 0:00:00.007 user=bf
database=postgreshost=[local]
 

second time:
2020-04-04 02:08:57.288 CEST [5e87d019.506bf:1] LOG:  connection received: host=[local]
2020-04-04 02:08:57.289 CEST [5e87d019.506bf:4] LOG:  received replication command: CREATE_REPLICATION_SLOT
"tap_sub_16390_sync_16384"TEMPORARY LOGICAL pgoutput USE_SNAPSHOT
 
2020-04-04 02:08:57.293 CEST [5e87d019.506bf:9] LOG:  statement: COPY public.tab_rep TO STDOUT

third time:
2020-04-04 02:08:57.297 CEST [5e87d019.506c1:1] LOG:  connection received: host=[local]
2020-04-04 02:08:57.299 CEST [5e87d019.506c1:4] LOG:  received replication command: CREATE_REPLICATION_SLOT
"tap_sub_16390_sync_16384"TEMPORARY LOGICAL pgoutput USE_SNAPSHOT
 
2020-04-04 02:08:57.299 CEST [5e87d019.506c1:5] ERROR:  replication slot "tap_sub_16390_sync_16384" already exists

Note that the connection from the second attempt has not yet
disconnected. Hence the error about the replication slot already
existing - it's a temporary replication slot that'd otherwise already
have been dropped.


Seems the logical rep code needs to do something about this race?


About the assertion failure:

TRAP: FailedAssertion("owner->bufferarr.nitems == 0", File:
"/home/bf/build/buildfarm-serinus/HEAD/pgsql.build/../pgsql/src/backend/utils/resowner/resowner.c",Line: 718)
 
postgres: publisher: walsender bf [local] idle in transaction (aborted)(ExceptionalCondition+0x5c)[0x9a13ac]
postgres: publisher: walsender bf [local] idle in transaction (aborted)(ResourceOwnerDelete+0x295)[0x9db8e5]
postgres: publisher: walsender bf [local] idle in transaction (aborted)[0x54c61f]
postgres: publisher: walsender bf [local] idle in transaction (aborted)(AbortOutOfAnyTransaction+0x122)[0x550e32]
postgres: publisher: walsender bf [local] idle in transaction (aborted)[0x9b3bc9]
postgres: publisher: walsender bf [local] idle in transaction (aborted)(shmem_exit+0x35)[0x80db45]
postgres: publisher: walsender bf [local] idle in transaction (aborted)[0x80dc77]
postgres: publisher: walsender bf [local] idle in transaction (aborted)(proc_exit+0x8)[0x80dd08]
postgres: publisher: walsender bf [local] idle in transaction (aborted)(PostgresMain+0x59f)[0x83bd0f]
postgres: publisher: walsender bf [local] idle in transaction (aborted)[0x7a0264]
postgres: publisher: walsender bf [local] idle in transaction (aborted)(PostmasterMain+0xbfc)[0x7a2b8c]
postgres: publisher: walsender bf [local] idle in transaction (aborted)(main+0x6fb)[0x49749b]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xeb)[0x7fc52d83bbbb]
postgres: publisher: walsender bf [local] idle in transaction (aborted)(_start+0x2a)[0x49753a]
2020-04-04 02:08:57.302 CEST [5e87d018.5066b:4] LOG:  server process (PID 329409) was terminated by signal 6: Aborted

Due to the log_line_prefix used, I was at first not entirely sure the
backend that crashed was the one with the ERROR. But it appears we print
the pid as hex for '%c' (why?), so it indeed is the one.


I, again, have to say that the amount of stuff that was done as part of

commit 7c4f52409a8c7d85ed169bbbc1f6092274d03920
Author: Peter Eisentraut <peter_e@gmx.net>
Date:   2017-03-23 08:36:36 -0400

    Logical replication support for initial data copy

is insane. Adding support for running sql over replication connections
and extending CREATE_REPLICATION_SLOT with new options (without even
mentioning that in the commit message!) as part of a commit described as
"Logical replication support for initial data copy" shouldn't happen.


It's not obvious to me what buffer pins could be held at this point. I
wonder if this could be somehow related to

commit 3cb646264e8ced9f25557ce271284da512d92043
Author: Tom Lane <tgl@sss.pgh.pa.us>
Date:   2018-07-18 12:15:16 -0400

    Use a ResourceOwner to track buffer pins in all cases.
...
    In passing, remove some other ad-hoc resource owner creations that had
    gotten cargo-culted into various other places.  As far as I can tell
    that was all unnecessary, and if it had been necessary it was incomplete,
    due to lacking any provision for clearing those resowners later.
    (Also worth noting in this connection is that a process that hasn't called
    InitBufferPoolBackend has no business accessing buffers; so there's more
    to do than just add the resowner if we want to touch buffers in processes
    not covered by this patch.)

which removed the resowner previously used in walsender. At the very
least we should remove the SavedResourceOwnerDuringExport dance that's
still done in snapbuild.c.  But it can't really be at fault here,
because the crashing backend won't have used that.


So I'm a bit confused here. The best approach is probably to try to
reproduce this by adding an artifical delay into backend shutdown.


> (I still really dislike the fact that we have this evil hack allowing
> one connection to mix and match those sets of commands...)

FWIW, I think the opposite. We should get rid of the difference as much
as possible.

Greetings,

Andres Freund



pgsql-hackers by date:

Previous
From: Robert Haas
Date:
Subject: Re: backup manifests and contemporaneous buildfarm failures
Next
From: Laurenz Albe
Date:
Subject: Re: Add A Glossary