Re: Intermittent Issue with WAL Segment Removal in Logical Replication - Mailing list pgsql-general

From Kaushik Iska
Subject Re: Intermittent Issue with WAL Segment Removal in Logical Replication
Date
Msg-id CAHYLuV=_+_+OqXm=yuObjqPYfQhiBbTRJk9qFzhAte4TfNtV3Q@mail.gmail.com
Whole thread Raw
In response to Re: Intermittent Issue with WAL Segment Removal in Logical Replication  (Tomas Vondra <tomas.vondra@enterprisedb.com>)
List pgsql-general
I am unfortunately not really familiar with Google Cloud SQL internals as well. But we have seen this happen on Amazon RDS as well.

Could it be possible that we are requesting a future WAL segment, say WAL upto X is written and we are asking for X + 1? It could be that the error message is misleading.

I do not have the information from pg_replication_slots as I have terminated the test. I am fairly certain that I can reproduce this again. I will gather both the restart_lsn and contents of pg_wal for the failed segment. Is there any other information that would help debug this further?

On Fri, Dec 29, 2023 at 4:16 PM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:
On 12/27/23 16:31, Kaushik Iska wrote:
> Hi all,
>
> I'm including additional details, as I am able to reproduce this issue a
> little more reliably.
>
> Postgres Version: POSTGRES_14_9.R20230830.01_07
> Vendor: Google Cloud SQL
> Logical Replication Protocol version 1
>

I don't know much about Google Cloud SQL internals. Is it relatively
close to Postgres (as e.g. RDS) or are the internals very different /
modified for cloud environments?

> Here are the logs of attempt succeeding right after it fails:
>
> 2023-12-27 01:12:40.581 UTC [59790]: [6-1] db=postgres,user=postgres
> STATEMENT:  START_REPLICATION SLOT peerflow_slot_wal_testing_2 LOGICAL
> 6/5AE67D79 (proto_version '1', publication_names
> 'peerflow_pub_wal_testing_2') <- FAILS
> 2023-12-27 01:12:41.087 UTC [59790]: [7-1] db=postgres,user=postgres
> ERROR:  requested WAL segment 000000010000000600000059 has already been
> removed
> 2023-12-27 01:12:44.581 UTC [59794]: [3-1] db=postgres,user=postgres
> STATEMENT:  START_REPLICATION SLOT peerflow_slot_wal_testing_2 LOGICAL
> 6/5AE67D79 (proto_version '1', publication_names
> 'peerflow_pub_wal_testing_2')  <- SUCCEEDS
> 2023-12-27 01:12:44.582 UTC [59794]: [4-1] db=postgres,user=postgres
> LOG:  logical decoding found consistent point at 6/5A31F050
>
> Happy to include any additional details of my setup.
>

I personally don't see how could this fail and then succeed, unless
Google does something smart with the WAL segments under the hood. Surely
we try to open the same WAL segment (given the LSN is the same), so how
could it not exist and then exist?

As Ron already suggested, it might be useful to see information for the
replication slot peerflow_slot_wal_testing_2 (especially the restart_lsn
value). Also, maybe show the contents of pg_wal (especially for the
segment referenced in the error message).

Can you reproduce this outside Google cloud environment?


regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

pgsql-general by date:

Previous
From: Tomas Vondra
Date:
Subject: Re: Intermittent Issue with WAL Segment Removal in Logical Replication
Next
From: Tomas Vondra
Date:
Subject: Re: Intermittent Issue with WAL Segment Removal in Logical Replication