Logical replication is missing block of rows when sending initial sync? - Mailing list pgsql-bugs

From hubert depesz lubaczewski
Subject Logical replication is missing block of rows when sending initial sync?
Date
Msg-id ZTu8GTDajCkZVjMs@depesz.com
Whole thread Raw
Responses RE: Logical replication is missing block of rows when sending initial sync?
List pgsql-bugs
Hi,
we have situation where we use native logical replication to copy data
from pg12 (on ubuntu bionic) to pg14 (on ubuntu focal).

Generally it mostly works.

The process is like this:

1. make all tables in destination, with *just* primary keys. no other
   indexes, constraints
2. setup logical replication, adding tables in batches by 40, adding new
   batch to publication only when all tables are in 'r' state
3. once all tables are replicating, add indexes, fkeys, trgiggers and
   the like on recipient.

Every so often we got errors when adding fkeys. Usually I didn't have
much time to investigate, but today had a bit. Unfortunatley the replica
is gone (I had to re-set the replication). But when I had it there
I noticed that a block of rows from one of the tables was missing.

I am not entirely sure about the size of the block, but I check 20 rows
with consecutive ids, and they were all missing from focal side:

#v+
=# select ctid, xmin, id, created_at, updated_at from c33s16831.subs where id between 19733130 and 19733150 order by
1;
    ctid     │    xmin    │    id    │         created_at         │         updated_at
─────────────┼────────────┼──────────┼────────────────────────────┼────────────────────────────
 (486337,57) │ 2668461379 │ 19733130 │ 2023-10-25 19:11:41.926204 │ 2023-10-25 19:11:41.926204
 (486337,58) │ 2668461379 │ 19733131 │ 2023-10-25 19:11:41.926204 │ 2023-10-25 19:11:41.926204
 (486337,59) │ 2668461379 │ 19733132 │ 2023-10-25 19:11:41.926204 │ 2023-10-25 19:11:41.926204
 (486337,60) │ 2668461379 │ 19733133 │ 2023-10-25 19:11:41.926204 │ 2023-10-25 19:11:41.926204
 (486337,61) │ 2668461379 │ 19733134 │ 2023-10-25 19:11:41.926204 │ 2023-10-25 19:11:41.926204
 (486337,62) │ 2668461379 │ 19733135 │ 2023-10-25 19:11:41.926204 │ 2023-10-25 19:11:41.926204
 (486337,63) │ 2668461379 │ 19733136 │ 2023-10-25 19:11:41.926204 │ 2023-10-25 19:11:41.926204
 (486337,64) │ 2668461379 │ 19733137 │ 2023-10-25 19:11:41.926204 │ 2023-10-25 19:11:41.926204
 (486337,65) │ 2668461379 │ 19733138 │ 2023-10-25 19:11:41.926204 │ 2023-10-25 19:11:41.926204
 (486337,66) │ 2668461379 │ 19733139 │ 2023-10-25 19:11:41.926204 │ 2023-10-25 19:11:41.926204
 (486337,67) │ 2668461379 │ 19733140 │ 2023-10-25 19:11:41.926204 │ 2023-10-25 19:11:41.926204
 (486340,38) │ 2668691763 │ 19733147 │ 2023-10-25 19:19:06.299059 │ 2023-10-25 19:20:07.360778
 (486340,58) │ 2668695225 │ 19733146 │ 2023-10-25 19:19:06.299059 │ 2023-10-25 19:20:15.227267
 (486340,61) │ 2668711141 │ 19733144 │ 2023-10-25 19:19:06.299059 │ 2023-10-25 19:20:43.499749
 (486340,62) │ 2668717802 │ 19733149 │ 2023-10-25 19:19:06.299059 │ 2023-10-25 19:20:49.900798
 (486340,63) │ 2668723876 │ 19733141 │ 2023-10-25 19:19:06.299059 │ 2023-10-25 19:21:00.749445
 (486340,64) │ 2668753420 │ 19733142 │ 2023-10-25 19:19:06.299059 │ 2023-10-25 19:22:02.140332
 (486340,66) │ 2668761678 │ 19733150 │ 2023-10-25 19:19:06.299059 │ 2023-10-25 19:22:19.140231
 (486341,9)  │ 2668726256 │ 19733145 │ 2023-10-25 19:19:06.299059 │ 2023-10-25 19:21:06.845051
 (486341,17) │ 2668748364 │ 19733148 │ 2023-10-25 19:19:06.299059 │ 2023-10-25 19:21:50.246899
 (486341,18) │ 2668750613 │ 19733143 │ 2023-10-25 19:19:06.299059 │ 2023-10-25 19:21:55.863125
(21 rows)
#v-

Last vacuum (from autovacuum) was ~ 3 weeks ago, so it isn't a factor.

I don't know when exactly this batch was added to publication, but I found that COPY that sent initial sync of the data
waslogged at:
 

#v+
2023-10-25 18:48:43.982 UTC,"upguser","dbname",12995,"10.1.207.7:44956",65396265.32c3,6,"COPY",2023-10-25 18:45:57
UTC,74/160580191,0,LOG,00000,"duration:166355.331 ms  statement: COPY c33s16831.subs TO
STDOUT",,,,,,,,,"pg_443910492_sync_443582288_7292150788357777097"
#v-

Unfortunately I don't have a way to repeat the problem. Just noticed
that on some dbs that we're upgrading the problem is more common, and on
some - it generally doesn't happen.

So, I don't have high hopes re: finding/fixing the issue, but figured
I'll let you know that there is such situation, perhaps it will strike
someone as something relating to something that could be the problem...

Best regards,

depesz




pgsql-bugs by date:

Previous
From: PG Bug reporting form
Date:
Subject: BUG #18172: High memory usage in tSRF function context
Next
From: Richard Guo
Date:
Subject: Re: BUG #18170: Unexpected error: no relation entry for relid 3