Re: Intermittent Issue with WAL Segment Removal in Logical Replication - Mailing list pgsql-general

From Tomas Vondra
Subject Re: Intermittent Issue with WAL Segment Removal in Logical Replication
Date
Msg-id 3b75e08a-2d53-fbb5-731a-3a8e5c71edd8@enterprisedb.com
Whole thread Raw
In response to Re: Intermittent Issue with WAL Segment Removal in Logical Replication  (Kaushik Iska <kaushik@peerdb.io>)
Responses Re: Intermittent Issue with WAL Segment Removal in Logical Replication  (Kaushik Iska <kaushik@peerdb.io>)
List pgsql-general
On 12/27/23 16:31, Kaushik Iska wrote:
> Hi all,
> 
> I'm including additional details, as I am able to reproduce this issue a
> little more reliably.
> 
> Postgres Version: POSTGRES_14_9.R20230830.01_07
> Vendor: Google Cloud SQL
> Logical Replication Protocol version 1
> 

I don't know much about Google Cloud SQL internals. Is it relatively
close to Postgres (as e.g. RDS) or are the internals very different /
modified for cloud environments?

> Here are the logs of attempt succeeding right after it fails:
> 
> 2023-12-27 01:12:40.581 UTC [59790]: [6-1] db=postgres,user=postgres
> STATEMENT:  START_REPLICATION SLOT peerflow_slot_wal_testing_2 LOGICAL
> 6/5AE67D79 (proto_version '1', publication_names
> 'peerflow_pub_wal_testing_2') <- FAILS
> 2023-12-27 01:12:41.087 UTC [59790]: [7-1] db=postgres,user=postgres
> ERROR:  requested WAL segment 000000010000000600000059 has already been
> removed
> 2023-12-27 01:12:44.581 UTC [59794]: [3-1] db=postgres,user=postgres
> STATEMENT:  START_REPLICATION SLOT peerflow_slot_wal_testing_2 LOGICAL
> 6/5AE67D79 (proto_version '1', publication_names
> 'peerflow_pub_wal_testing_2')  <- SUCCEEDS
> 2023-12-27 01:12:44.582 UTC [59794]: [4-1] db=postgres,user=postgres
> LOG:  logical decoding found consistent point at 6/5A31F050
> 
> Happy to include any additional details of my setup.
> 

I personally don't see how could this fail and then succeed, unless
Google does something smart with the WAL segments under the hood. Surely
we try to open the same WAL segment (given the LSN is the same), so how
could it not exist and then exist?

As Ron already suggested, it might be useful to see information for the
replication slot peerflow_slot_wal_testing_2 (especially the restart_lsn
value). Also, maybe show the contents of pg_wal (especially for the
segment referenced in the error message).

Can you reproduce this outside Google cloud environment?


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



pgsql-general by date:

Previous
From: Tom Lane
Date:
Subject: Re: Need help
Next
From: Kaushik Iska
Date:
Subject: Re: Intermittent Issue with WAL Segment Removal in Logical Replication