Re: Logical replication stops dropping used initial-sync replication slots - Mailing list pgsql-bugs

From vignesh C
Subject Re: Logical replication stops dropping used initial-sync replication slots
Date
Msg-id CALDaNm1S-DtphU51d3orNknjyNQEjAOL90jXAxkis_n4rtS=eA@mail.gmail.com
Whole thread Raw
In response to Re: Logical replication stops dropping used initial-sync replication slots  (hubert depesz lubaczewski <depesz@depesz.com>)
Responses Re: Logical replication stops dropping used initial-sync replication slots  (hubert depesz lubaczewski <depesz@depesz.com>)
List pgsql-bugs
On Mon, Mar 21, 2022 at 8:13 PM hubert depesz lubaczewski
<depesz@depesz.com> wrote:
>
> On Mon, Mar 21, 2022 at 07:48:20PM +0530, vignesh C wrote:
> > > What can I do about it?
> >
> > >2022-03-17 13:32:48.119 UTC,,,2923553,,62333880.2c9c21,2,,2022-03-17 13:32:48
UTC,8/447946,23877409,ERROR,53400,"couldnot find free replication state slot for replication origin with OID
51",,"Increasemax_replication_slots and try again.",,,,,,,"","logical replication worker",,0 
> > Each Tablesync worker needs a replication slot to synchronize the data
> > to the subscriber, this error means there are not enough replication
> > slots. Try to increase max_replication_slots configuration if it is
> > set to lower value as you are using around 60k tables.
>
> yeah, but I don't need 60k replication slots?
>
> Please note that replication (initial sync) *IS* working happily for
> *some* time, and only then starts "eating" replication slots for
> whatever reason.
>
> Specifically - I have max_replication_slots set to 50 (per my original
> mail), and sync workers shouldn't exceed 10.
>
> > > 2022-03-17 18:28:54.189 UTC,"lrep_user","dbname",17706,"10.2.210.180:35794",62337de5.452a,13,"idle",2022-03-17
18:28:53UTC,16/0,0,LOG,00000,"logical decoding found initial starting point at 8FB/83FCA7D8","Waiting for transactions
(approximately1) older than 160417372 to end.",,,,,"START_REPLICATION SLOT 
> >
> > Snapshot builder is waiting for the open transaction to finish. Is it
> > possible that there are some open transactions that were started
> > during this logical replication. I felt once the transaction
> > completes, snapshot building and tablesync operation should finish.
>
> There are no long open transacitons there. This is test box with *zero*
> traffic (aside from running in a loop: create table, drop table), but
> *NOT* in transaction.

I could simulate the similar problem by starting a new transaction
while creating a subscription and notice the following log content:
2022-03-22 08:00:17.868 IST [2680] LOG:  logical decoding found
initial starting point at 0/427DC3D0
2022-03-22 08:00:17.868 IST [2680] DETAIL:  Waiting for transactions
(approximately 1) older than 748 to end.
2022-03-22 08:00:17.868 IST [2680] STATEMENT:  CREATE_REPLICATION_SLOT
"pg_16439_sync_16403_7077745941166836171" LOGICAL pgoutput (SNAPSHOT
'use')

I could see that the following transaction was open(older than 748):
     5 | postgres | 2653 |            |       10 | vignesh | psql
                              |             |                 |
  -1 | 2022-03-22 07:59:46.147802+05:30 | 2022-03-22 08:00:17.394259+
05:30 | 2022-03-22 08:00:23.024178+05:30 | 2022-03-22
08:00:23.024181+05:30 |                 |                     | active
|         747 |          747 |          | select * from
pg_stat_activity ;
                                                        | client backend
     5 | postgres | 2665 |            |          |         |
                              |             |                 |
     | 2022-03-22 08:00:03.005628+05:30 | 2022-03-22 08:00:21.653243+
05:30 | 2022-03-22 08:00:21.653243+05:30 | 2022-03-22
08:00:21.653243+05:30 | Timeout         | VacuumDelay         | active
|             |          747 |          | autovacuum: VACUUM ANALYZE
sch1.t4
  | autovacuum worker

Similarly I could see it from pg_locks:
 transactionid |          |          |      |       |            |
      747 |         |       |          | 6/11               | 2682 |
ShareLock        | f       | f        | 2022-03-22
08:00:17.932021+05:3
0
 relation      |        0 |     1262 |      |       |            |
          |         |       |          | 3/41               | 2653 |
AccessShareLock  | t       | f        |
 transactionid |          |          |      |       |            |
      747 |         |       |          | 7/11               | 2680 |
ShareLock        | f       | f        | 2022-03-22
08:00:17.868615+05:3
0
 transactionid |          |          |      |       |            |
      747 |         |       |          | 3/41               | 2653 |
ExclusiveLock    | t       | f        |

I'm not sure if this is the same scenario or not.
Could you also confirm from pg_stat_activity and pg_locks if there are
any open transactions and if there is any open transaction, could you
see this transaction is on which relation and are these because of the
table sync workers?

Also as an alternative you might be able to use pg_dump to take a
backup and restore it to the destination if there are no transactions
and create subscription with copy_data as off.

Regards,
Vignesh



pgsql-bugs by date:

Previous
From: Peter Geoghegan
Date:
Subject: Re: BUG #17445: "ON CONFLICT" has different behaviors when its param is passed with prepared stmt or hard coded
Next
From: hubert depesz lubaczewski
Date:
Subject: Re: Logical replication stops dropping used initial-sync replication slots