Re: Logical replication stops dropping used initial-sync replication slots - Mailing list pgsql-bugs

From Amit Kapila
Subject Re: Logical replication stops dropping used initial-sync replication slots
Date
Msg-id CAA4eK1KbBAPfLL86x0nTz8f3dKKgETv9Udx3Ex8yKBArx+u1FA@mail.gmail.com
Whole thread Raw
In response to Re: Logical replication stops dropping used initial-sync replication slots  (hubert depesz lubaczewski <depesz@depesz.com>)
Responses Re: Logical replication stops dropping used initial-sync replication slots  (hubert depesz lubaczewski <depesz@depesz.com>)
List pgsql-bugs
On Fri, Mar 25, 2022 at 2:09 PM hubert depesz lubaczewski
<depesz@depesz.com> wrote:
>
> On Tue, Mar 22, 2022 at 12:26:46PM +0100, hubert depesz lubaczewski wrote:
> > On Tue, Mar 22, 2022 at 12:32:41PM +0530, vignesh C wrote:
> > > > That wouldn't do me any good. The server I'm trying to upgrade now is
> > > > test bench to make sure that I can upgrade using logical replication *at
> > > > all*.
> > > > Once I will have verified that I can, I have couple hundred servers to
> > > > upgrade, that will have concurrent access all the time.
> > > Is it possible to get  a reproducible test or a script for this problem?
> >
> > Well, given that we're talking about 38GB database with 60k tables
> > i find that unlikely.
> >
> > But I'll try if I can figure out a way to replicate the problem with
> > something that is shareable.
>
> OK. I don't think it will be simple/possible to replicate.
>
> Reason:
>
> Did setup 10 separate environments, all with the same database on bionic
> 12.9.
>
> Then I created publication with all tables in specific schemas, and
> subscribed to it from separate focal pg 14.2 servers.
>
> In total, used 20 servers to replicate the test 10 times.
>
> Effect: in one case replication didn't finish initial sync.
>
> Currently, it's 2022-03-25 08:32:14.10081+00, and on the one source that
> had failed with subscription i see:
>                   slot_name                  |  plugin  | slot_type | datoid | database | temporary | active |
active_pid| xmin | catalog_xmin | restart_lsn  | confirmed_flush_lsn 
>
---------------------------------------------+----------+-----------+--------+----------+-----------+--------+------------+------+--------------+--------------+---------------------
>  focal14                                     | pgoutput | logical   |  16606 | dbname   | f         | f      |
   |      |    161822728 | 8E9/3E70BB58 | 8E9/3E70BB58 
>  pg_3548429_sync_3343043_7077521990523695096 | pgoutput | logical   |  16606 | dbname   | f         | t      |
32385|      |    161824826 | 8F0/18327650 | 8F0/1ABF89C0 
>  pg_3548429_sync_3079712_7077521990523695096 | pgoutput | logical   |  16606 | dbname   | f         | t      |
32378|      |    161824826 | 8F0/18327650 | 8F0/1ABF89C0 
>  pg_3548429_sync_3145162_7077521990523695096 | pgoutput | logical   |  16606 | dbname   | f         | t      |
32386|      |    161824826 | 8F0/18327650 | 8F0/1ABF89C0 
>  pg_3548429_sync_2993718_7077521990523695096 | pgoutput | logical   |  16606 | dbname   | f         | t      |
32387|      |    161824826 | 8F0/18327650 | 8F0/1ABF89C0 
>  pg_3548429_sync_3355696_7077521990523695096 | pgoutput | logical   |  16606 | dbname   | f         | t      |
32379|      |    161824826 | 8F0/18327650 | 8F0/1ABF89C0 
>  pg_3548429_sync_3099460_7077521990523695096 | pgoutput | logical   |  16606 | dbname   | f         | t      |
32380|      |    161824826 | 8F0/18327650 | 8F0/1ABF89C0 
>  pg_3548429_sync_3162651_7077521990523695096 | pgoutput | logical   |  16606 | dbname   | f         | t      |
32381|      |    161824826 | 8F0/18327650 | 8F0/1ABF89C0 
>  pg_3548429_sync_3157454_7077521990523695096 | pgoutput | logical   |  16606 | dbname   | f         | t      |
32382|      |    161824826 | 8F0/18327650 | 8F0/1ABF89C0 
>  pg_3548429_sync_3084198_7077521990523695096 | pgoutput | logical   |  16606 | dbname   | f         | t      |
32383|      |    161824826 | 8F0/18327650 | 8F0/1ABF89C0 
>  pg_3548429_sync_3142573_7077521990523695096 | pgoutput | logical   |  16606 | dbname   | f         | t      |
32384|      |    161824826 | 8F0/18327650 | 8F0/1ABF89C0 
> (11 rows)
>
> pg_stat_replication shows 10 sync connections, and that's it. their sent_lsn,
> write_lsn, flush_lsn, replay_lsn seems to be progressing nicely, but nothing is
> changing.
>
> pg_stat_activity contains:
>
> =# select xact_start, query_start, backend_type, query from pg_stat_activity where state is distinct from 'idle';
>           xact_start           │          query_start          │         backend_type         │
                                            query 
>
═══════════════════════════════╪═══════════════════════════════╪══════════════════════════════╪══════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════
>  [null]                        │ [null]                        │ autovacuum launcher          │
>  [null]                        │ 2022-03-24 20:45:30.757582+00 │ walsender                    │ START_REPLICATION
SLOT"pg_3548429_sync_2993718_7077521990523695096" LOGICAL 8E9/40FF0170 (proto_version '1', publication_names
'"focal14"')
>  [null]                        │ [null]                        │ pglogical supervisor         │
>  [null]                        │ [null]                        │ logical replication launcher │
>  [null]                        │ 2022-03-24 20:45:30.730818+00 │ walsender                    │ START_REPLICATION
SLOT"pg_3548429_sync_3343043_7077521990523695096" LOGICAL 8E9/40FF0100 (proto_version '1', publication_names
'"focal14"')
>  [null]                        │ 2022-03-24 20:45:31.65015+00  │ walsender                    │ START_REPLICATION
SLOT"pg_3548429_sync_3079712_7077521990523695096" LOGICAL 8E9/40FEFF60 (proto_version '1', publication_names
'"focal14"')
>  [null]                        │ 2022-03-24 20:45:30.733366+00 │ walsender                    │ START_REPLICATION
SLOT"pg_3548429_sync_3145162_7077521990523695096" LOGICAL 8E9/40FF0138 (proto_version '1', publication_names
'"focal14"')
>  [null]                        │ 2022-03-24 20:45:30.76285+00  │ walsender                    │ START_REPLICATION
SLOT"pg_3548429_sync_3355696_7077521990523695096" LOGICAL 8E9/40FEFF98 (proto_version '1', publication_names
'"focal14"')
>  [null]                        │ 2022-03-24 20:45:31.663299+00 │ walsender                    │ START_REPLICATION
SLOT"pg_3548429_sync_3099460_7077521990523695096" LOGICAL 8E9/40FEFFD0 (proto_version '1', publication_names
'"focal14"')
>  [null]                        │ 2022-03-24 20:45:31.676412+00 │ walsender                    │ START_REPLICATION
SLOT"pg_3548429_sync_3162651_7077521990523695096" LOGICAL 8E9/40FF0020 (proto_version '1', publication_names
'"focal14"')
>  [null]                        │ 2022-03-24 20:45:31.691807+00 │ walsender                    │ START_REPLICATION
SLOT"pg_3548429_sync_3157454_7077521990523695096" LOGICAL 8E9/40FF0058 (proto_version '1', publication_names
'"focal14"')
>  [null]                        │ 2022-03-24 20:45:31.69548+00  │ walsender                    │ START_REPLICATION
SLOT"pg_3548429_sync_3084198_7077521990523695096" LOGICAL 8E9/40FF0090 (proto_version '1', publication_names
'"focal14"')
>  [null]                        │ 2022-03-24 20:45:30.974833+00 │ walsender                    │ START_REPLICATION
SLOT"pg_3548429_sync_3142573_7077521990523695096" LOGICAL 8E9/40FF00C8 (proto_version '1', publication_names
'"focal14"')
>  2022-03-25 08:35:27.971161+00 │ 2022-03-25 08:35:27.971161+00 │ client backend               │ select xact_start,
query_start,backend_type, query from pg_stat_activity where state is distinct from 'idle'; 
>  [null]                        │ [null]                        │ background writer            │
>  [null]                        │ [null]                        │ checkpointer                 │
>  [null]                        │ [null]                        │ walwriter                    │
> (17 rows)
>
> So, it looks to me that there is some random-ish condition that causes
> the problem.
>
> I tried to insulate pgs from the world as well as I could: no
> monitoring, no cron, no external access. Just replication and my simple
> load generator, running loop:
>
> CREATE TABLE depesz_test as SELECT i, repeat('payload', 50) as payload FROM generate_series(1,100000) i;
> CREATE INDEX depesz_test_idx on depesz_test(i);
> DROP TABLE depesz_test;
>
> every minute.
>
> I'm out of ideas. The good-ish news is that when I tried replication in
> smaller sets (30k tables, not 60k), and/or with just
> 2 max_sync_workers_per_subscription, and not 10, it never failed. so
> I guess I'll go that way.
>

What was your setting for max_logical_replication_workers? It should
be more than max_sync_workers_per_subscription as that is taken from
the pool of max_logical_replication_workers. One possible reason for
some inactive slots could be that some tablesync workers exit without
finishing the initial sync and in the meantime the new tablesync
workers started and the old ones didn't get time to finish.
Eventually, you will get the error: "Increase max_replication_slots
and try again" because we also have restrictions on the number of
origins equal to max_replication_slots. We create one origin for each
copy.

It is good if you are able to proceed but we might get some clue about
the problem you are seeing if you can show subscriber-side logs.

--
With Regards,
Amit Kapila.



pgsql-bugs by date:

Previous
From: "David G. Johnston"
Date:
Subject: Re: BUG #17450: SUBSTRING function extracting lesser characters than specified
Next
From: hubert depesz lubaczewski
Date:
Subject: Re: Logical replication stops dropping used initial-sync replication slots