RE: Excessive number of replication slots for 12->14 logical replication - Mailing list pgsql-bugs

From houzj.fnst@fujitsu.com
Subject RE: Excessive number of replication slots for 12->14 logical replication
Date
Msg-id OS0PR01MB5716ED883D44E2214F3563B094959@OS0PR01MB5716.jpnprd01.prod.outlook.com
Whole thread Raw
In response to Re: Excessive number of replication slots for 12->14 logical replication  (Ajin Cherian <itsajin@gmail.com>)
Responses Re: Excessive number of replication slots for 12->14 logical replication
List pgsql-bugs
On Sunday, July 24, 2022 4:17 PM Ajin Cherian <itsajin@gmail.com> wrote:
> On Sun, Jul 24, 2022 at 6:12 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Mon, Jul 18, 2022 at 3:13 PM hubert depesz lubaczewski
> > <depesz@depesz.com> wrote:
> > >
> > > On Mon, Jul 18, 2022 at 09:07:35AM +0530, Amit Kapila wrote:
> > >
> > > First error:
> > > #v+
> > > 2022-07-18 09:22:07.046 UTC,,,4145917,,62d5263f.3f42fd,2,,2022-07-18
> > > 09:22:07 UTC,28/21641,1219146,ERROR,53400,"could not find free
> > > replication state slot for replication origin with OID
> > > 51",,"Increase max_replication_slots and try
> > > again.",,,,,,,"","logical replication worker",,0
> > > #v-
> > >
> > > Nothing else errored out before, no warning, no fatals.
> > >
> > > from the first ERROR I was getting them in the range of 40-70 per minute.
> > >
> > > At the same time I was logging data from `select now(), * from
> pg_replication_slots`, every 2 seconds.
> > >
> > ...
> > >
> > > So, it looks that there are up to 10 focal slots, all active, and then there are
> sync slots with weirdly high counts for inactive ones.
> > >
> > > At most, I had 11 active sync slots.
> > >
> > > Looks like some kind of timing issue, which would be inline with
> > > what Kyotaro Horiguchi wrote initially.
> > >
> >
> > I think this is a timing issue similar to what Horiguchi-San has
> > pointed out but due to replication origins. We drop the replication
> > origin after the sync worker that has used it is finished. This is
> > done by the apply worker because we don't allow to drop the origin
> > till the process owning the origin is alive. I am not sure of
> > repercussions but maybe we can allow dropping the origin by the
> > process that owns it.
> >
> 
> I have written a patch which will do the dropping of replication origins in the
> sync worker itself.
> I had to reset the origin session (which also resets the owned by
> flag) prior to the dropping of the slots.

Thanks for the patch.

I tried the patch and confirmed that we won't get the ERROR "could not find
free replication state slot for replication origin with OID" again after
applying the patch.

I tested the patch by letting the apply worker wait for a bit more time after
setting the state to SUBREL_STATE_CATCHUP. In this case(before the patch) the
table sync worker will exit before the apply worker drop the replorigin, and
the apply worker will try to start another worker which would cause the
ERROR(before the patch).

Few comments:

1)
-                 * There is a chance that the user is concurrently performing
-                 * refresh for the subscription where we remove the table
-                 * state and its origin and by this time the origin might be
-                 * already removed. So passing missing_ok = true.
-                 */

I think it would be better if we can move these comments to the new place where
we drop the replorigin.


2)

-                replorigin_drop_by_name(originname, true, false);
 
                 /*
                  * Update the state to READY only after the origin cleanup.

Do we need to slightly modify the comment here as the origin drop code has been
moved to other places. Maybe "It's safe to update the state to READY as the
origin should have been dropped by table sync worker".

Best regards,
Hou zj

pgsql-bugs by date:

Previous
From: Kyotaro Horiguchi
Date:
Subject: Re: could not link file in wal restore lines
Next
From: Ajin Cherian
Date:
Subject: Re: Excessive number of replication slots for 12->14 logical replication