Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions - Mailing list pgsql-hackers

From Tom Lane
Subject Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
Date
Msg-id 929224.1600046876@sss.pgh.pa.us
Whole thread Raw
In response to Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-hackers
I wrote:
> Probably this requires a relcache inval at the wrong time;
> although we have recent passes from CLOBBER_CACHE_ALWAYS animals,
> so that can't be the whole triggering condition.  I wonder whether
> it is relevant that all of the complaining animals are JIT-enabled.

Hmmm ... I take that back.  hyrax has indeed passed since this went
in, but *it doesn't run any TAP tests*.  So the buildfarm offers no
information about whether the replication tests work under
CLOBBER_CACHE_ALWAYS.

Realizing that, I built an installation that way and tried to run
the subscription tests.  Results so far:

* Running 010_truncate.pl by itself passed for me.  So there's still
some unexplained factor needed to trigger the buildfarm failures.
(I'm wondering about concurrent autovacuum activity now...)

* Starting over, it appears that 001_rep_changes.pl almost immediately
gets into an infinite loop.  It does not complete the third test step,
rather infinitely waiting for progress to be made.  The publisher log
shows a repeating loop like

2020-09-13 21:16:05.734 EDT [928529] tap_sub LOG:  could not send data to client: Broken pipe
2020-09-13 21:16:05.734 EDT [928529] tap_sub CONTEXT:  slot "tap_sub", output plugin "pgoutput", in the commit
callback,associated LSN 0/1660628 
2020-09-13 21:16:05.843 EDT [928581] 001_rep_changes.pl LOG:  statement: SELECT pg_current_wal_lsn() <= replay_lsn AND
state= 'streaming' FROM pg_catalog.pg_stat_replication WHERE application_name = 'tap_sub'; 
2020-09-13 21:16:05.861 EDT [928582] tap_sub LOG:  statement: SELECT pg_catalog.set_config('search_path', '', false);
2020-09-13 21:16:05.929 EDT [928582] tap_sub LOG:  received replication command: IDENTIFY_SYSTEM
2020-09-13 21:16:05.930 EDT [928582] tap_sub LOG:  received replication command: START_REPLICATION SLOT "tap_sub"
LOGICAL0/1652820 (proto_version '2', publication_names '"tap_pub","tap_pub_ins_only"') 
2020-09-13 21:16:05.930 EDT [928582] tap_sub LOG:  starting logical decoding for slot "tap_sub"
2020-09-13 21:16:05.930 EDT [928582] tap_sub DETAIL:  Streaming transactions committing after 0/1652820, reading WAL
from0/1651B20. 
2020-09-13 21:16:05.930 EDT [928582] tap_sub LOG:  logical decoding found consistent point at 0/1651B20
2020-09-13 21:16:05.930 EDT [928582] tap_sub DETAIL:  There are no running transactions.
2020-09-13 21:16:21.560 EDT [928600] 001_rep_changes.pl LOG:  statement: SELECT pg_current_wal_lsn() <= replay_lsn AND
state= 'streaming' FROM pg_catalog.pg_stat_replication WHERE application_name = 'tap_sub'; 
2020-09-13 21:16:37.291 EDT [928610] 001_rep_changes.pl LOG:  statement: SELECT pg_current_wal_lsn() <= replay_lsn AND
state= 'streaming' FROM pg_catalog.pg_stat_replication WHERE application_name = 'tap_sub'; 
2020-09-13 21:16:52.959 EDT [928627] 001_rep_changes.pl LOG:  statement: SELECT pg_current_wal_lsn() <= replay_lsn AND
state= 'streaming' FROM pg_catalog.pg_stat_replication WHERE application_name = 'tap_sub'; 
2020-09-13 21:17:06.866 EDT [928636] tap_sub LOG:  statement: SELECT pg_catalog.set_config('search_path', '', false);
2020-09-13 21:17:06.934 EDT [928636] tap_sub LOG:  received replication command: IDENTIFY_SYSTEM
2020-09-13 21:17:06.934 EDT [928636] tap_sub LOG:  received replication command: START_REPLICATION SLOT "tap_sub"
LOGICAL0/1652820 (proto_version '2', publication_names '"tap_pub","tap_pub_ins_only"') 
2020-09-13 21:17:06.934 EDT [928636] tap_sub ERROR:  replication slot "tap_sub" is active for PID 928582
2020-09-13 21:17:07.811 EDT [928638] tap_sub LOG:  statement: SELECT pg_catalog.set_config('search_path', '', false);
2020-09-13 21:17:07.880 EDT [928638] tap_sub LOG:  received replication command: IDENTIFY_SYSTEM
2020-09-13 21:17:07.881 EDT [928638] tap_sub LOG:  received replication command: START_REPLICATION SLOT "tap_sub"
LOGICAL0/1652820 (proto_version '2', publication_names '"tap_pub","tap_pub_ins_only"') 
2020-09-13 21:17:07.881 EDT [928638] tap_sub ERROR:  replication slot "tap_sub" is active for PID 928582
2020-09-13 21:17:08.618 EDT [928641] 001_rep_changes.pl LOG:  statement: SELECT pg_current_wal_lsn() <= replay_lsn AND
state= 'streaming' FROM pg_catalog.pg_stat_replication WHERE application_name = 'tap_sub'; 
2020-09-13 21:17:08.753 EDT [928642] tap_sub LOG:  statement: SELECT pg_catalog.set_config('search_path', '', false);
2020-09-13 21:17:08.821 EDT [928642] tap_sub LOG:  received replication command: IDENTIFY_SYSTEM
2020-09-13 21:17:08.821 EDT [928642] tap_sub LOG:  received replication command: START_REPLICATION SLOT "tap_sub"
LOGICAL0/1652820 (proto_version '2', publication_names '"tap_pub","tap_pub_ins_only"') 
2020-09-13 21:17:08.821 EDT [928642] tap_sub ERROR:  replication slot "tap_sub" is active for PID 928582
2020-09-13 21:17:09.689 EDT [928645] tap_sub LOG:  statement: SELECT pg_catalog.set_config('search_path', '', false);
2020-09-13 21:17:09.756 EDT [928645] tap_sub LOG:  received replication command: IDENTIFY_SYSTEM
2020-09-13 21:17:09.757 EDT [928645] tap_sub LOG:  received replication command: START_REPLICATION SLOT "tap_sub"
LOGICAL0/1652820 (proto_version '2', publication_names '"tap_pub","tap_pub_ins_only"') 
2020-09-13 21:17:09.757 EDT [928645] tap_sub ERROR:  replication slot "tap_sub" is active for PID 928582
2020-09-13 21:17:09.841 EDT [928582] tap_sub LOG:  could not send data to client: Broken pipe
2020-09-13 21:17:09.841 EDT [928582] tap_sub CONTEXT:  slot "tap_sub", output plugin "pgoutput", in the commit
callback,associated LSN 0/1660628 

while the subscriber is repeating

2020-09-13 21:15:01.598 EDT [928528] LOG:  logical replication apply worker for subscription "tap_sub" has started
2020-09-13 21:16:02.178 EDT [928528] ERROR:  terminating logical replication worker due to timeout
2020-09-13 21:16:02.179 EDT [920797] LOG:  background worker "logical replication worker" (PID 928528) exited with exit
code1 
2020-09-13 21:16:02.606 EDT [928571] LOG:  logical replication apply worker for subscription "tap_sub" has started
2020-09-13 21:16:03.117 EDT [928571] ERROR:  could not start WAL streaming: ERROR:  replication slot "tap_sub" is
activefor PID 928529 
2020-09-13 21:16:03.118 EDT [920797] LOG:  background worker "logical replication worker" (PID 928571) exited with exit
code1 
2020-09-13 21:16:03.544 EDT [928574] LOG:  logical replication apply worker for subscription "tap_sub" has started
2020-09-13 21:16:04.053 EDT [928574] ERROR:  could not start WAL streaming: ERROR:  replication slot "tap_sub" is
activefor PID 928529 
2020-09-13 21:16:04.054 EDT [920797] LOG:  background worker "logical replication worker" (PID 928574) exited with exit
code1 
2020-09-13 21:16:04.479 EDT [928576] LOG:  logical replication apply worker for subscription "tap_sub" has started
2020-09-13 21:16:04.990 EDT [928576] ERROR:  could not start WAL streaming: ERROR:  replication slot "tap_sub" is
activefor PID 928529 
2020-09-13 21:16:04.990 EDT [920797] LOG:  background worker "logical replication worker" (PID 928576) exited with exit
code1 
2020-09-13 21:16:05.415 EDT [928579] LOG:  logical replication apply worker for subscription "tap_sub" has started
2020-09-13 21:17:05.994 EDT [928579] ERROR:  terminating logical replication worker due to timeout

I'm out of patience to investigate this for tonight, but there is
something extremely broken here; maybe more than one something.

            regards, tom lane



pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: Gripes about walsender command processing
Next
From: Michael Paquier
Date:
Subject: Re: Avoid incorrect allocation in buildIndexArray