RE: Logical replication is missing block of rows when sending initial sync? - Mailing list pgsql-bugs

From Hayato Kuroda (Fujitsu)
Subject RE: Logical replication is missing block of rows when sending initial sync?
Date
Msg-id TYAPR01MB5866C2BE11AD4FBDC0AB46F0F5A1A@TYAPR01MB5866.jpnprd01.prod.outlook.com
Whole thread Raw
In response to Logical replication is missing block of rows when sending initial sync?  (hubert depesz lubaczewski <depesz@depesz.com>)
Responses Re: Logical replication is missing block of rows when sending initial sync?  (hubert depesz lubaczewski <depesz@depesz.com>)
Re: Logical replication is missing block of rows when sending initial sync?  (Kyotaro Horiguchi <horikyota.ntt@gmail.com>)
List pgsql-bugs
Dear depesz,

Hi, thanks for reporting.

> 1. make all tables in destination, with *just* primary keys. no other
>    indexes, constraints
> 2. setup logical replication, adding tables in batches by 40, adding new
>    batch to publication only when all tables are in 'r' state
> 3. once all tables are replicating, add indexes, fkeys, trgiggers and
>    the like on recipient.
>

I've tried, but I could not reproduce the failure. PSA the script what I did.
Could you please missing information like below?

* Table definitions on both nodes
* GUC settings
* Publication/subscription options
* Server log and output when you met the failure
* A minimal script for reproducing the failure (if you succeeded)
* Are there incoming data on publisher during the initial sync?
* A method how you checked pg_subscription_rel

> 
> Every so often we got errors when adding fkeys. Usually I didn't have
> much time to investigate, but today had a bit. Unfortunatley the replica
> is gone (I had to re-set the replication).

Could you share an error message what you met? I thought you saw like
"Key (XXX)=(YYY) is not present in table "ZZZ".", but please paste here.
If so, are tuples really missing at that time?

> But when I had it there
> I noticed that a block of rows from one of the tables was missing.
>
> I am not entirely sure about the size of the block, but I check 20 rows
> with consecutive ids, and they were all missing from focal side:
>...

(I was not sure how you defined tables subscriptions, but...)
Hmm, logical replication does not copy physical information like ctid, so I'm not
sure it is related with us. I felt that tuples between 19733141 and 19733150 were
updated by some way so that locations were changed.

FYI - Below workflow is a brief description for initial synchronization.

1. Tablesync worker connects to publisher
2. Starts a transaction on both pub/sub
3. Creates a replication slot on publisher
4. Executes "COPY ... TO STDOUT" on publisher. This means that all tuples are extracted
5. Does BeginCopyFrom() and CopyFrom(), they are same infrastacture as normall "COPY FROM" statement.
6. Commits the transaction on both pub/sub
7. Catches up changes since the replication slot was defined at 4.

> 
> I don't know when exactly this batch was added to publication, but I found that
> COPY that sent initial sync of the data was logged at:
> 
> #v+
> 2023-10-25 18:48:43.982
> UTC,"upguser","dbname",12995,"10.1.207.7:44956",65396265.32c3,6,"COPY",20
> 23-10-25 18:45:57 UTC,74/160580191,0,LOG,00000,"duration: 166355.331 ms
> statement: COPY c33s16831.subs TO
> STDOUT",,,,,,,,,"pg_443910492_sync_443582288_7292150788357777097"
> #v-

Yeah, I think this copy command was sent from tablesync worker.

> Unfortunately I don't have a way to repeat the problem. Just noticed
> that on some dbs that we're upgrading the problem is more common, and on
> some - it generally doesn't happen.

Just to confirm - You met the error only when the logical replication was set
up from old clusters to newer, right?

> So, I don't have high hopes re: finding/fixing the issue, but figured
> I'll let you know that there is such situation, perhaps it will strike
> someone as something relating to something that could be the problem...

That's great, I have never known the issue before.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED 


Attachment

pgsql-bugs by date:

Previous
From: Richard Guo
Date:
Subject: Re: BUG #18170: Unexpected error: no relation entry for relid 3
Next
From: Bruce Momjian
Date:
Subject: Re: BUG #15172: Postgresql ts_headline with <-> operator does not highlight text properly