Thread: Logical replication is missing block of rows when sending initial sync?

Logical replication is missing block of rows when sending initial sync?

From

hubert depesz lubaczewski

Date:

27 October 2023, 13:33:13

Hi,
we have situation where we use native logical replication to copy data
from pg12 (on ubuntu bionic) to pg14 (on ubuntu focal).

Generally it mostly works.

The process is like this:

1. make all tables in destination, with *just* primary keys. no other
   indexes, constraints
2. setup logical replication, adding tables in batches by 40, adding new
   batch to publication only when all tables are in 'r' state
3. once all tables are replicating, add indexes, fkeys, trgiggers and
   the like on recipient.

Every so often we got errors when adding fkeys. Usually I didn't have
much time to investigate, but today had a bit. Unfortunatley the replica
is gone (I had to re-set the replication). But when I had it there
I noticed that a block of rows from one of the tables was missing.

I am not entirely sure about the size of the block, but I check 20 rows
with consecutive ids, and they were all missing from focal side:

#v+
=# select ctid, xmin, id, created_at, updated_at from c33s16831.subs where id between 19733130 and 19733150 order by
1;
    ctid     │    xmin    │    id    │         created_at         │         updated_at
─────────────┼────────────┼──────────┼────────────────────────────┼────────────────────────────
 (486337,57) │ 2668461379 │ 19733130 │ 2023-10-25 19:11:41.926204 │ 2023-10-25 19:11:41.926204
 (486337,58) │ 2668461379 │ 19733131 │ 2023-10-25 19:11:41.926204 │ 2023-10-25 19:11:41.926204
 (486337,59) │ 2668461379 │ 19733132 │ 2023-10-25 19:11:41.926204 │ 2023-10-25 19:11:41.926204
 (486337,60) │ 2668461379 │ 19733133 │ 2023-10-25 19:11:41.926204 │ 2023-10-25 19:11:41.926204
 (486337,61) │ 2668461379 │ 19733134 │ 2023-10-25 19:11:41.926204 │ 2023-10-25 19:11:41.926204
 (486337,62) │ 2668461379 │ 19733135 │ 2023-10-25 19:11:41.926204 │ 2023-10-25 19:11:41.926204
 (486337,63) │ 2668461379 │ 19733136 │ 2023-10-25 19:11:41.926204 │ 2023-10-25 19:11:41.926204
 (486337,64) │ 2668461379 │ 19733137 │ 2023-10-25 19:11:41.926204 │ 2023-10-25 19:11:41.926204
 (486337,65) │ 2668461379 │ 19733138 │ 2023-10-25 19:11:41.926204 │ 2023-10-25 19:11:41.926204
 (486337,66) │ 2668461379 │ 19733139 │ 2023-10-25 19:11:41.926204 │ 2023-10-25 19:11:41.926204
 (486337,67) │ 2668461379 │ 19733140 │ 2023-10-25 19:11:41.926204 │ 2023-10-25 19:11:41.926204
 (486340,38) │ 2668691763 │ 19733147 │ 2023-10-25 19:19:06.299059 │ 2023-10-25 19:20:07.360778
 (486340,58) │ 2668695225 │ 19733146 │ 2023-10-25 19:19:06.299059 │ 2023-10-25 19:20:15.227267
 (486340,61) │ 2668711141 │ 19733144 │ 2023-10-25 19:19:06.299059 │ 2023-10-25 19:20:43.499749
 (486340,62) │ 2668717802 │ 19733149 │ 2023-10-25 19:19:06.299059 │ 2023-10-25 19:20:49.900798
 (486340,63) │ 2668723876 │ 19733141 │ 2023-10-25 19:19:06.299059 │ 2023-10-25 19:21:00.749445
 (486340,64) │ 2668753420 │ 19733142 │ 2023-10-25 19:19:06.299059 │ 2023-10-25 19:22:02.140332
 (486340,66) │ 2668761678 │ 19733150 │ 2023-10-25 19:19:06.299059 │ 2023-10-25 19:22:19.140231
 (486341,9)  │ 2668726256 │ 19733145 │ 2023-10-25 19:19:06.299059 │ 2023-10-25 19:21:06.845051
 (486341,17) │ 2668748364 │ 19733148 │ 2023-10-25 19:19:06.299059 │ 2023-10-25 19:21:50.246899
 (486341,18) │ 2668750613 │ 19733143 │ 2023-10-25 19:19:06.299059 │ 2023-10-25 19:21:55.863125
(21 rows)
#v-

Last vacuum (from autovacuum) was ~ 3 weeks ago, so it isn't a factor.

I don't know when exactly this batch was added to publication, but I found that COPY that sent initial sync of the data
waslogged at:
 

#v+
2023-10-25 18:48:43.982 UTC,"upguser","dbname",12995,"10.1.207.7:44956",65396265.32c3,6,"COPY",2023-10-25 18:45:57
UTC,74/160580191,0,LOG,00000,"duration:166355.331 ms  statement: COPY c33s16831.subs TO
STDOUT",,,,,,,,,"pg_443910492_sync_443582288_7292150788357777097"
#v-

Unfortunately I don't have a way to repeat the problem. Just noticed
that on some dbs that we're upgrading the problem is more common, and on
some - it generally doesn't happen.

So, I don't have high hopes re: finding/fixing the issue, but figured
I'll let you know that there is such situation, perhaps it will strike
someone as something relating to something that could be the problem...

Best regards,

depesz

RE: Logical replication is missing block of rows when sending initial sync?

From

"Hayato Kuroda (Fujitsu)"

Date:

30 October 2023, 07:10:35

Dear depesz,

Hi, thanks for reporting.

> 1. make all tables in destination, with *just* primary keys. no other
>    indexes, constraints
> 2. setup logical replication, adding tables in batches by 40, adding new
>    batch to publication only when all tables are in 'r' state
> 3. once all tables are replicating, add indexes, fkeys, trgiggers and
>    the like on recipient.
>

I've tried, but I could not reproduce the failure. PSA the script what I did.
Could you please missing information like below?

* Table definitions on both nodes
* GUC settings
* Publication/subscription options
* Server log and output when you met the failure
* A minimal script for reproducing the failure (if you succeeded)
* Are there incoming data on publisher during the initial sync?
* A method how you checked pg_subscription_rel

> 
> Every so often we got errors when adding fkeys. Usually I didn't have
> much time to investigate, but today had a bit. Unfortunatley the replica
> is gone (I had to re-set the replication).

Could you share an error message what you met? I thought you saw like
"Key (XXX)=(YYY) is not present in table "ZZZ".", but please paste here.
If so, are tuples really missing at that time?

> But when I had it there
> I noticed that a block of rows from one of the tables was missing.
>
> I am not entirely sure about the size of the block, but I check 20 rows
> with consecutive ids, and they were all missing from focal side:
>...

(I was not sure how you defined tables subscriptions, but...)
Hmm, logical replication does not copy physical information like ctid, so I'm not
sure it is related with us. I felt that tuples between 19733141 and 19733150 were
updated by some way so that locations were changed.

FYI - Below workflow is a brief description for initial synchronization.

1. Tablesync worker connects to publisher
2. Starts a transaction on both pub/sub
3. Creates a replication slot on publisher
4. Executes "COPY ... TO STDOUT" on publisher. This means that all tuples are extracted
5. Does BeginCopyFrom() and CopyFrom(), they are same infrastacture as normall "COPY FROM" statement.
6. Commits the transaction on both pub/sub
7. Catches up changes since the replication slot was defined at 4.

> 
> I don't know when exactly this batch was added to publication, but I found that
> COPY that sent initial sync of the data was logged at:
> 
> #v+
> 2023-10-25 18:48:43.982
> UTC,"upguser","dbname",12995,"10.1.207.7:44956",65396265.32c3,6,"COPY",20
> 23-10-25 18:45:57 UTC,74/160580191,0,LOG,00000,"duration: 166355.331 ms
> statement: COPY c33s16831.subs TO
> STDOUT",,,,,,,,,"pg_443910492_sync_443582288_7292150788357777097"
> #v-

Yeah, I think this copy command was sent from tablesync worker.

> Unfortunately I don't have a way to repeat the problem. Just noticed
> that on some dbs that we're upgrading the problem is more common, and on
> some - it generally doesn't happen.

Just to confirm - You met the error only when the logical replication was set
up from old clusters to newer, right?

> So, I don't have high hopes re: finding/fixing the issue, but figured
> I'll let you know that there is such situation, perhaps it will strike
> someone as something relating to something that could be the problem...

That's great, I have never known the issue before.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED

Attachment

test_1030.sh

Re: Logical replication is missing block of rows when sending initial sync?

From

hubert depesz lubaczewski

Date:

30 October 2023, 15:55:55

On Mon, Oct 30, 2023 at 07:10:35AM +0000, Hayato Kuroda (Fujitsu) wrote:
> * Table definitions on both nodes

Well, table has 50 columns, 28 indexes, 10 fkeys, is referenced back by 26 tables, has two simple triggers.
Showing it all would be rather complicated and time consuming (legal issues).

Used datatypes:

- 14 columns with bigint
-  8 columns with character varying(255)
-  7 columns with timestamp without time zone
-  6 columns with boolean
-  3 columns with text
-  3 columns with integer
-  3 columns with double precision
-  2 columns with character varying(16)
-  1 column with uuid
-  1 column with numeric(6,2)
-  1 column with character varying(5)
-  1 column with character varying

I know it's not full \d, but I don't think sharing it in full is realistic.

Schema on recipient is restored from schema dump made ~ 5 minutes earlier.

> * GUC settings

Which GUCs?

> * Publication/subscription options

=# \dRp focal14
                           List of publications
  Name   │  Owner   │ All tables │ Inserts │ Updates │ Deletes │ Truncates
─────────┼──────────┼────────────┼─────────┼─────────┼─────────┼───────────
 focal14 │ postgres │ f          │ t       │ t       │ t       │ t
(1 row)

and

=# \dRs focal14
           List of subscriptions
  Name   │  Owner   │ Enabled │ Publication
─────────┼──────────┼─────────┼─────────────
 focal14 │ postgres │ t       │ {focal14}
(1 row)

> * Server log and output when you met the failure

The failure was just fail of adding fkey. These logs are gone now, I can't
keep failed host when I'm in process of preparing boxes for upgrade.

Error message was:

ALTER TABLE ONLY c33s16831.ag_change_records_2023_10 ADD CONSTRAINT fk_rails_0113b4b993 FOREIGN KEY (sub_id) REFERENCES
c33s16831.subs(id);
ERROR:  insert or update on table "ag_change_records_2023_10" violates foreign key constraint "fk_rails_0113b4b993"
DETAIL:  Key (sub_id)=(19733141) is not present in table "subs".

> * A minimal script for reproducing the failure (if you succeeded)

No. I think it's some kind of race condition.

> * Are there incoming data on publisher during the initial sync?

Yes. Pg12 is working as normal primary server for cluster that is having traffic 24h/day.

> * A method how you checked pg_subscription_rel

#v+
WITH info AS (
    SELECT
        format(
            '%s:%s',
            sr.srsubstate,
            count(*)
        ) AS part,
        min( o.ord ) AS ord
    FROM
        pg_subscription AS s
        JOIN pg_subscription_rel AS sr ON s.oid = sr.srsubid
        JOIN unnest( '{i,d,f,s,r}'::text[] ) WITH ORDINALITY AS o ( s, ord ) ON sr.srsubstate = o.s
    WHERE
        s.subname = 'focal14'
    GROUP BY
        sr.srsubstate
)
SELECT
    string_agg(
        part,
        ' ; '
        order by ord
    )
FROM
    info;
#v-

This query I used while adding tables. Add 40 tables to publication, wait for all states to be 0, with exception of
"r".
Afterwards I used:

#v+
SELECT
    n.nspname,
    c.relname
FROM
    pg_subscription s
    join pg_subscription_rel sr on s.oid = sr.srsubid
    join pg_class c on sr.srrelid = c.oid
    join pg_namespace n on c.relnamespace = n.oid
WHERE
    s.subname = 'focal14'
ORDER BY n.nspname, c.relname
#v-

and compared it against

#v+
SELECT
    schemaname,
    tablename
FROM
    pg_publication_tables
WHERE
    pubname = 'focal14'
ORDER BY schemaname, tablename
#v-

> Could you share an error message what you met? I thought you saw like
> "Key (XXX)=(YYY) is not present in table "ZZZ".", but please paste here.

#v+
ALTER TABLE ONLY c33s16831.ag_change_records_2023_10 ADD CONSTRAINT fk_rails_0113b4b993 FOREIGN KEY (sub_id) REFERENCES
c33s16831.subs(id);
ERROR:  insert or update on table "ag_change_records_2023_10" violates foreign key constraint "fk_rails_0113b4b993"
DETAIL:  Key (sub_id)=(19733141) is not present in table "subs".
#v-

> If so, are tuples really missing at that time?

Yes. I checked, and the whole block of values were missing - like I showed in previous mail.

> FYI - Below workflow is a brief description for initial synchronization.
> 1. Tablesync worker connects to publisher
> 2. Starts a transaction on both pub/sub
> 3. Creates a replication slot on publisher
> 4. Executes "COPY ... TO STDOUT" on publisher. This means that all tuples are extracted
> 5. Does BeginCopyFrom() and CopyFrom(), they are same infrastacture as normall "COPY FROM" statement.
> 6. Commits the transaction on both pub/sub
> 7. Catches up changes since the replication slot was defined at 4.

I kinda, understand it. That's why I thought it's important to report that
sometimes, something, somewhere breaks.

Since my project is time sensitive, I have to proceed, and if such problem
happens on my boxes, I just redo the whole sync from scratch.

So I don't mind that there is bug, as I have workaround, but I think it would
be cool to figure out what could be the potential source, and perhaps find a
fix.

> > Unfortunately I don't have a way to repeat the problem. Just noticed
> > that on some dbs that we're upgrading the problem is more common, and on
> > some - it generally doesn't happen.
> Just to confirm - You met the error only when the logical replication was set
> up from old clusters to newer, right?

Yes. That's what we use LR for. For upgrades.

It happens randomly, on some db clusters more than on others, but all in all,
at most I had to rebuilt a upgrade box 3 times, and the problem was gone.
Usually (in > 90% of cases) there is no problem.

Best regards,

depesz

Re: Logical replication is missing block of rows when sending initial sync?

From

Kyotaro Horiguchi

Date:

02 November 2023, 01:17:13

At Mon, 30 Oct 2023 07:10:35 +0000, "Hayato Kuroda (Fujitsu)" <kuroda.hayato@fujitsu.com> wrote in 
> I've tried, but I could not reproduce the failure. PSA the script what I did.

I'm not well-versed in the details of logical replication, but does
logical replication inherently operate in such a way that it fully
maintains relationships between tables? If not, isn't it possible that
the issue in question is not about missing referenced data, but merely
a temporary delay?

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

RE: Logical replication is missing block of rows when sending initial sync?

From

"Hayato Kuroda (Fujitsu)"

Date:

02 November 2023, 03:17:03

Dear Horiguchi-san,

> I'm not well-versed in the details of logical replication, but does
> logical replication inherently operate in such a way that it fully
> maintains relationships between tables? If not, isn't it possible that
> the issue in question is not about missing referenced data, but merely
> a temporary delay?

I assumed that tuples referred by FK existed before synchronization or during that.
In this case, IIUC the status is never 'r' before they are replicated. Thought?

Dear Depesz,

Is above assumption right?

Also, I suggest you that pg_stat_replicaiton can be checked on publisher when you
meet the failure again. If pg_stat_replication.replay_lsn is much behind
pg_stat_replication.sent_lsn, the replication may be delayed so that the tuples
are not appeared. pg_stat_subscription on subscriber has similar infos with replay_lsn.

Note that pg_stat_replication may be not updated if feedbacks from a subscriber
is disabled.

https://www.postgresql.org/docs/14/monitoring-stats.html#MONITORING-PG-STAT-SUBSCRIPTION

Best Regards,
Hayato Kuroda
FUJITSU LIMITED

Re: Logical replication is missing block of rows when sending initial sync?

From

hubert depesz lubaczewski

Date:

02 November 2023, 11:23:14

On Thu, Nov 02, 2023 at 10:17:13AM +0900, Kyotaro Horiguchi wrote:
> At Mon, 30 Oct 2023 07:10:35 +0000, "Hayato Kuroda (Fujitsu)" <kuroda.hayato@fujitsu.com> wrote in 
> > I've tried, but I could not reproduce the failure. PSA the script what I did.
> 
> I'm not well-versed in the details of logical replication, but does
> logical replication inherently operate in such a way that it fully
> maintains relationships between tables? If not, isn't it possible that
> the issue in question is not about missing referenced data, but merely
> a temporary delay?

The problem is that date that appeared *later* was visible on the
subscriber. Data that came earlier was visible too. Just some block of
data got, for some reason, skipped.

Best regards,

depesz

Re: Logical replication is missing block of rows when sending initial sync?

From

hubert depesz lubaczewski

Date:

02 November 2023, 11:24:32

On Thu, Nov 02, 2023 at 03:17:03AM +0000, Hayato Kuroda (Fujitsu) wrote:
> Dear Horiguchi-san, 
> 
> > I'm not well-versed in the details of logical replication, but does
> > logical replication inherently operate in such a way that it fully
> > maintains relationships between tables? If not, isn't it possible that
> > the issue in question is not about missing referenced data, but merely
> > a temporary delay?
> 
> I assumed that tuples referred by FK existed before synchronization or during that.
> In this case, IIUC the status is never 'r' before they are replicated. Thought?
> 
> Dear Depesz,
> 
> Is above assumption right? 
> 
> Also, I suggest you that pg_stat_replicaiton can be checked on publisher when you
> meet the failure again. If pg_stat_replication.replay_lsn is much behind
> pg_stat_replication.sent_lsn, the replication may be delayed so that the tuples
> are not appeared. pg_stat_subscription on subscriber has similar infos with replay_lsn.
> 
> Note that pg_stat_replication may be not updated if feedbacks from a subscriber
> is disabled.
> 
> https://www.postgresql.org/docs/14/monitoring-stats.html#MONITORING-PG-STAT-SUBSCRIPTION

Whenever the problem happens lag is virtually nonexistent.
We test it by adding new rows to couple of tables, and verify that these
rows show up on subscriber. They did.

Again - replication was *working* with the exception of block of rows
from one (so far it never happened on more than one) table that was
missing.

Best regards,

depesz

Re: Logical replication is missing block of rows when sending initial sync?

From

Amit Kapila

Date:

03 November 2023, 03:39:12

On Thu, Nov 2, 2023 at 4:53 PM hubert depesz lubaczewski
<depesz@depesz.com> wrote:
>
> On Thu, Nov 02, 2023 at 10:17:13AM +0900, Kyotaro Horiguchi wrote:
> > At Mon, 30 Oct 2023 07:10:35 +0000, "Hayato Kuroda (Fujitsu)" <kuroda.hayato@fujitsu.com> wrote in
> > > I've tried, but I could not reproduce the failure. PSA the script what I did.
> >
> > I'm not well-versed in the details of logical replication, but does
> > logical replication inherently operate in such a way that it fully
> > maintains relationships between tables? If not, isn't it possible that
> > the issue in question is not about missing referenced data, but merely
> > a temporary delay?
>
> The problem is that date that appeared *later* was visible on the
> subscriber. Data that came earlier was visible too. Just some block of
> data got, for some reason, skipped.
>

Quite strange. I think to narrow down such a problem, the first thing
to figure out is whether the data is skipped by initial sync or later
replication. To find that out, you can check remote_lsn value in
pg_replication_origin_status for the origin used in the initial sync
once the relation reaches the 'ready' state. Then, you can try to see
on the publisher side using pg_waldump whether the missing rows exist
before the value of remote_lsn or after it. That can help us to narrow
down the problem and could give us some clues for the next steps.

--
With Regards,
Amit Kapila.

Re: Logical replication is missing block of rows when sending initial sync?

From

hubert depesz lubaczewski

Date:

03 November 2023, 12:04:57

On Fri, Nov 03, 2023 at 09:09:12AM +0530, Amit Kapila wrote:
> On Thu, Nov 2, 2023 at 4:53 PM hubert depesz lubaczewski
> <depesz@depesz.com> wrote:
> >
> > On Thu, Nov 02, 2023 at 10:17:13AM +0900, Kyotaro Horiguchi wrote:
> > > At Mon, 30 Oct 2023 07:10:35 +0000, "Hayato Kuroda (Fujitsu)" <kuroda.hayato@fujitsu.com> wrote in
> > > > I've tried, but I could not reproduce the failure. PSA the script what I did.
> > >
> > > I'm not well-versed in the details of logical replication, but does
> > > logical replication inherently operate in such a way that it fully
> > > maintains relationships between tables? If not, isn't it possible that
> > > the issue in question is not about missing referenced data, but merely
> > > a temporary delay?
> >
> > The problem is that date that appeared *later* was visible on the
> > subscriber. Data that came earlier was visible too. Just some block of
> > data got, for some reason, skipped.
> >
> 
> Quite strange. I think to narrow down such a problem, the first thing
> to figure out is whether the data is skipped by initial sync or later
> replication. To find that out, you can check remote_lsn value in
> pg_replication_origin_status for the origin used in the initial sync
> once the relation reaches the 'ready' state. Then, you can try to see
> on the publisher side using pg_waldump whether the missing rows exist
> before the value of remote_lsn or after it. That can help us to narrow
> down the problem and could give us some clues for the next steps.

I will be prepping another set of clusters to upgrade soon, will try to
get some more data. The window to work on the bad data isn't long,
though.

Best regards,

depesz

Re: Logical replication is missing block of rows when sending initial sync?

From

Tomas Vondra

Date:

03 November 2023, 12:20:22


On 11/3/23 13:04, hubert depesz lubaczewski wrote:
> On Fri, Nov 03, 2023 at 09:09:12AM +0530, Amit Kapila wrote:
>> On Thu, Nov 2, 2023 at 4:53 PM hubert depesz lubaczewski
>> <depesz@depesz.com> wrote:
>>>
>>> On Thu, Nov 02, 2023 at 10:17:13AM +0900, Kyotaro Horiguchi wrote:
>>>> At Mon, 30 Oct 2023 07:10:35 +0000, "Hayato Kuroda (Fujitsu)" <kuroda.hayato@fujitsu.com> wrote in
>>>>> I've tried, but I could not reproduce the failure. PSA the script what I did.
>>>>
>>>> I'm not well-versed in the details of logical replication, but does
>>>> logical replication inherently operate in such a way that it fully
>>>> maintains relationships between tables? If not, isn't it possible that
>>>> the issue in question is not about missing referenced data, but merely
>>>> a temporary delay?
>>>
>>> The problem is that date that appeared *later* was visible on the
>>> subscriber. Data that came earlier was visible too. Just some block of
>>> data got, for some reason, skipped.
>>>
>>
>> Quite strange. I think to narrow down such a problem, the first thing
>> to figure out is whether the data is skipped by initial sync or later
>> replication. To find that out, you can check remote_lsn value in
>> pg_replication_origin_status for the origin used in the initial sync
>> once the relation reaches the 'ready' state. Then, you can try to see
>> on the publisher side using pg_waldump whether the missing rows exist
>> before the value of remote_lsn or after it. That can help us to narrow
>> down the problem and could give us some clues for the next steps.
> 
> I will be prepping another set of clusters to upgrade soon, will try to
> get some more data. The window to work on the bad data isn't long,
> though.
> 

I think it'd be interesting to know:

1) Commit LSN for the missing rows (for the xmin).

2) Are the other changes for these transactions that *got* replicated
correctly?

3) LSNs used for the tablesync slot, catchup etc. I believe those are in
the server log.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Logical replication is missing block of rows when sending initial sync?

From

hubert depesz lubaczewski

Date:

03 November 2023, 12:22:42

On Fri, Nov 03, 2023 at 01:20:22PM +0100, Tomas Vondra wrote:
> I think it'd be interesting to know:
> 
> 1) Commit LSN for the missing rows (for the xmin).
> 
> 2) Are the other changes for these transactions that *got* replicated
> correctly?

How would I know?

> 3) LSNs used for the tablesync slot, catchup etc. I believe those are in
> the server log.

I will provide more logs on the next problem.

Best regards,

depesz

Re: Logical replication is missing block of rows when sending initial sync?

From

hubert depesz lubaczewski

Date:

08 November 2023, 12:20:36

On Fri, Nov 03, 2023 at 01:22:42PM +0100, hubert depesz lubaczewski wrote:
> On Fri, Nov 03, 2023 at 01:20:22PM +0100, Tomas Vondra wrote:
> > I think it'd be interesting to know:
> > 
> > 1) Commit LSN for the missing rows (for the xmin).
> > 
> > 2) Are the other changes for these transactions that *got* replicated
> > correctly?
> 
> How would I know?
> 
> > 3) LSNs used for the tablesync slot, catchup etc. I believe those are in
> > the server log.
> 
> I will provide more logs on the next problem.

OK. found another case (maybe two).

First, error:
ALTER TABLE ONLY c203s15531.conversation_message_participants ADD CONSTRAINT fk_rails_992a8b9e13 FOREIGN KEY
(conversation_message_id)REFERENCES c203s15531.conversation_messages(id);
 
ERROR:  insert or update on table "conversation_message_participants" violates foreign key constraint
"fk_rails_992a8b9e13"
DETAIL:  Key (conversation_message_id)=(431568) is not present in table "conversation_messages".

With forced seq scans (set enable_indexscan = false; set enable_bitmapscan = false; set enable_indexonlyscan = false;)

I got:

(12/bionic):

#v+
=# select count(*) from c203s15531.conversation_messages where id = 431568;
 count
───────
     1
(1 row)

=# select count(*) from c203s15531.conversation_message_participants where conversation_message_id=431568;
 count
───────
     2
(1 row)
#v-

(14/focal):

#v+
=# select count(*) from c203s15531.conversation_messages where id = 431568;
 count
───────
     0
(1 row)

=# select count(*) from c203s15531.conversation_message_participants where conversation_message_id=431568;
 count
───────
     2
(1 row)
#v-

So it looks that some rows are missing. Let's see the missing range:

#v+
Pg14# select max(id) from conversation_messages where id < 431568;
  max
────────
 431567
(1 row)

Pg14# select min(id) from conversation_messages where id > 431568;
  min
────────
 431572
(1 row)
#v-

So, the block from 431568 to 431571 is missing. On pg12, though:

#v+
Pg12# select count(*) from conversation_messages where id between 431568 and 431571;
 count
───────
     4
(1 row)
Pg12# select ctid, xmin, cmin, xmax, cmax, id, created_at from conversation_messages where id between 431568 and
431571;
    ctid    │    xmin    │ cmin │    xmax    │ cmax │   id   │         created_at
────────────┼────────────┼──────┼────────────┼──────┼────────┼────────────────────────────
 (29604,10) │ 3095841913 │    1 │ 3095841995 │    1 │ 431568 │ 2023-11-07 15:12:38.712504
 (29604,11) │ 3095848690 │    2 │ 3095848745 │    2 │ 431569 │ 2023-11-07 15:12:49.5695
 (29604,12) │ 3095884949 │    2 │ 3095885033 │    2 │ 431570 │ 2023-11-07 15:13:37.511025
 (29604,13) │ 3095917833 │    2 │ 3095917890 │    2 │ 431571 │ 2023-11-07 15:14:40.058367
(4 rows)
#v-

In logs on bionic/pg12 I found when this particular table was added to subscription:

#v+
2023-11-07 15:10:19.023 UTC,"postgres","dbname",7350,"[local]",654a535a.1cb6,29,"idle",2023-11-07 15:10:18
UTC,30/865502097,0,LOG,00000,"statement:ALTER PUBLICATION focal14 ADD TABLE ONLY
c203s15745.a_g_c_r_2023_10;",,,,,,,,,"psql"
2023-11-07 15:10:19.026 UTC,"postgres","dbname",7350,"[local]",654a535a.1cb6,30,"idle",2023-11-07 15:10:18
UTC,30/865502098,0,LOG,00000,"statement:ALTER PUBLICATION focal14 ADD TABLE ONLY c203s15311.q_q;",,,,,,,,,"psql"
 
2023-11-07 15:10:19.032 UTC,"postgres","dbname",7350,"[local]",654a535a.1cb6,31,"idle",2023-11-07 15:10:18
UTC,30/865502099,0,LOG,00000,"statement:ALTER PUBLICATION focal14 ADD TABLE ONLY c203s16352.m_2023_06;",,,,,,,,,"psql"
 
2023-11-07 15:10:19.035 UTC,"postgres","dbname",7350,"[local]",654a535a.1cb6,32,"idle",2023-11-07 15:10:18
UTC,30/865502100,0,LOG,00000,"statement:ALTER PUBLICATION focal14 ADD TABLE ONLY
c203s15531.conversation_messages;",,,,,,,,,"psql"
2023-11-07 15:10:19.039 UTC,"postgres","dbname",7350,"[local]",654a535a.1cb6,33,"idle",2023-11-07 15:10:18
UTC,30/865502101,0,LOG,00000,"statement:ALTER PUBLICATION focal14 ADD TABLE ONLY
c203s15479.q_s_e_2023_10;",,,,,,,,,"psql"
2023-11-07 15:10:19.043 UTC,"postgres","dbname",7350,"[local]",654a535a.1cb6,34,"idle",2023-11-07 15:10:18
UTC,30/865502102,0,LOG,00000,"statement:ALTER PUBLICATION focal14 ADD TABLE ONLY
c203s17062.wiki_pages;",,,,,,,,,"psql"
2023-11-07 15:10:19.046 UTC,"postgres","dbname",7350,"[local]",654a535a.1cb6,35,"idle",2023-11-07 15:10:18
UTC,30/865502103,0,LOG,00000,"statement:ALTER PUBLICATION focal14 ADD TABLE ONLY
c203s15841.a_g_c_r_2023_10;",,,,,,,,,"psql"
#v-

Since it was previously asked - schema:

#v+
        Column         │            Type             │ Collation │ Nullable │                      Default

───────────────────────┼─────────────────────────────┼───────────┼──────────┼───────────────────────────────────────────────────
 id                    │ bigint                      │           │ not null │
nextval('conversation_messages_id_seq'::regclass)
 ..................... │ bigint                      │           │          │
 ..................... │ bigint                      │           │          │
 created_at            │ timestamp without time zone │           │          │
 ..................... │ boolean                     │           │          │
 ..................... │ text                        │           │          │
 ..................... │ text                        │           │          │
 ..................... │ character varying(255)      │           │          │
 ..................... │ character varying(255)      │           │          │
 ..................... │ bigint                      │           │          │
 ..................... │ character varying(255)      │           │          │
 ..................... │ bigint                      │           │          │
 ..................... │ character varying(255)      │           │          │
 ..................... │ text                        │           │          │
 ..................... │ boolean                     │           │          │
 ..................... │ boolean                     │           │          │
 ..................... │ text                        │           │          │
Indexes:
    "conversation_messages_pkey" PRIMARY KEY, btree (id), tablespace "data1"
    "index_conversation_messages_on_...." btree (.........), tablespace "data1"
    "index_conversation_messages_on_......_and_created_at" btree (..........., created_at), tablespace "data1"
Foreign-key constraints:
    "fk_rails_c3c322d882" FOREIGN KEY (conversation_id) REFERENCES conversations(id)
Referenced by:
    TABLE "conversation_message_participants" CONSTRAINT "fk_rails_992a8b9e13" FOREIGN KEY (conversation_message_id)
REFERENCESconversation_messages(id)
 
    TABLE "conversation_batches" CONSTRAINT "fk_rails_d421fc74f4" FOREIGN KEY (root_conversation_message_id) REFERENCES
conversation_messages(id)
Publications:
    "cdc"
    "focal14"
Triggers:
    guard_excessive_deletes AFTER DELETE ON conversation_messages REFERENCING OLD TABLE AS oldtbl FOR EACH STATEMENT
EXECUTEFUNCTION guard_excessive_updates()
 
    guard_excessive_updates AFTER UPDATE ON conversation_messages REFERENCING OLD TABLE AS oldtbl FOR EACH STATEMENT
EXECUTEFUNCTION guard_excessive_updates()
 
#v-

There are no dropped attributes.

Stats:

#v+
Pg12# select * from pg_stat_user_tables  where relid = 'c203s15531.conversation_messages'::regclass \gx
─[ RECORD 1 ]───────┬──────────────────────────────
relid               │ 166420
schemaname          │ c203s15531
relname             │ conversation_messages
seq_scan            │ 14
seq_tup_read        │ 2595888
idx_scan            │ 16059568
idx_tup_fetch       │ 27426090
n_tup_ins           │ 372696
n_tup_upd           │ 38
n_tup_del           │ 0
n_tup_hot_upd       │ 0
n_live_tup          │ 432630
n_dead_tup          │ 0
n_mod_since_analyze │ 7034
last_vacuum         │ [null]
last_autovacuum     │ 2023-08-10 18:06:11.655573+00
last_analyze        │ [null]
last_autoanalyze    │ 2023-11-01 18:56:52.868395+00
vacuum_count        │ 0
autovacuum_count    │ 5
analyze_count       │ 0
autoanalyze_count   │ 91
#v-

I can leave it as is for, let's say, a day. So if there are any queries/data
you'd like me to find out, let me know, but afterwards I will have to rebuild
the focal from scratch.

For whatever it's worth, we're adding tables to publication using:

#v+
with base as (
    select n.nspname, c.relname, c.relpages
    from pg_class c join pg_namespace n on c.relnamespace = n.oid
        left outer join pg_catalog.pg_publication_tables pt on pt.pubname = 'focal14' and pt.schemaname = n.nspname and
pt.tablename= c.relname
 
    where c.relkind in ('p', 'r') and n.nspname ~ 'schemas|we|care|about' and pt.pubname is null
)
select format('ALTER PUBLICATION focal14 ADD TABLE ONLY %I.%I;', nspname, relname)
from base
order by relpages desc
limit 40 \gexec
#v-

Best regards,

depesz

Re: Logical replication is missing block of rows when sending initial sync?

From

hubert depesz lubaczewski

Date:

08 November 2023, 14:02:02

> OK. found another case (maybe two).

OK, the other case is real. Different database. Same scenario.

Errors (3 because we have 3 missing fkeys):

#v+
ALTER TABLE ONLY c44s16539.assignment_overrides ADD CONSTRAINT fk_rails_58f8ee369b FOREIGN KEY (assignment_id)
REFERENCESc44s16539.assignments(id);
 
ERROR:  insert or update on table "assignment_overrides" violates foreign key constraint "fk_rails_58f8ee369b"
DETAIL:  Key (assignment_id)=(248083) is not present in table "assignments".

ALTER TABLE ONLY c44s16539.post_policies ADD CONSTRAINT fk_rails_cf2d119863 FOREIGN KEY (assignment_id) REFERENCES
c44s16539.assignments(id);
ERROR:  insert or update on table "post_policies" violates foreign key constraint "fk_rails_cf2d119863"
DETAIL:  Key (assignment_id)=(248083) is not present in table "assignments".

ALTER TABLE ONLY c44s16539.submissions ADD CONSTRAINT fk_rails_61cac0823d FOREIGN KEY (assignment_id) REFERENCES
c44s16539.assignments(id);
ERROR:  insert or update on table "submissions" violates foreign key constraint "fk_rails_61cac0823d"
DETAIL:  Key (assignment_id)=(248083) is not present in table "assignments".
#v-

In all cases the problem is missing c44s16539.assignments with is = 248083.

This time it looks that only one row is missing:

#v+
Pg14# select max(id) from c44s16539.assignments where id < 248083;
  max
────────
 248082
(1 row)

Pg14# select min(id) from c44s16539.assignments where id > 248083;
  min
────────
 248084
(1 row)
#v-

Schema is complicated. 76 columns. On pg12 side one column is dropped.
22 indexes, 9 fkeys from this table (assignments) to others, ~ 35 fkeys
pointing to this table (but only 3 have rows that reference the
missing record.

Some informationa bout the record:

#v+
Pg12# select ctid, xmin, cmin, xmax, cmax, id, created_at, updated_at from c44s16539.assignments where id = 248083 \gx
─[ RECORD 1 ]──────────────────────────
ctid       │ (24192,1)
xmin       │ 472291499
cmin       │ 12
xmax       │ 472291499
cmax       │ 12
id         │ 248083
created_at │ 2023-11-07 15:50:05.506077
updated_at │ 2023-11-07 15:51:28.738893
#v-

logs for when it was added to replication:

#v+
2023-11-07 15:47:27.898 UTC,"postgres","dbname",2333,"[local]",654a5c0f.91d,39,"ALTER PUBLICATION",2023-11-07 15:47:27
UTC,6/0,0,LOG,00000,"duration:1.476 ms",,,,,,,,,"psql"
 
2023-11-07 15:47:27.899 UTC,"postgres","dbname",2333,"[local]",654a5c0f.91d,40,"idle",2023-11-07 15:47:27
UTC,6/1209097693,0,LOG,00000,"statement:ALTER PUBLICATION focal14 ADD TABLE ONLY
c247s1094.messages_2022_50;",,,,,,,,,"psql"
2023-11-07 15:47:27.900 UTC,"postgres","dbname",2333,"[local]",654a5c0f.91d,41,"ALTER PUBLICATION",2023-11-07 15:47:27
UTC,6/0,0,LOG,00000,"duration:1.493 ms",,,,,,,,,"psql"
 
2023-11-07 15:47:27.900 UTC,"postgres","dbname",2333,"[local]",654a5c0f.91d,42,"idle",2023-11-07 15:47:27
UTC,6/1209097694,0,LOG,00000,"statement:ALTER PUBLICATION focal14 ADD TABLE ONLY
c44s16539.assignments;",,,,,,,,,"psql"
2023-11-07 15:47:27.902 UTC,"postgres","dbname",2333,"[local]",654a5c0f.91d,43,"ALTER PUBLICATION",2023-11-07 15:47:27
UTC,6/0,0,LOG,00000,"duration:1.936 ms",,,,,,,,,"psql"
 
2023-11-07 15:47:27.902 UTC,"postgres","dbname",2333,"[local]",654a5c0f.91d,44,"idle",2023-11-07 15:47:27
UTC,6/1209097695,0,LOG,00000,"statement:ALTER PUBLICATION focal14 ADD TABLE ONLY
c39s5131.messages_2023_20;",,,,,,,,,"psql"
2023-11-07 15:47:27.903 UTC,"postgres","dbname",2333,"[local]",654a5c0f.91d,45,"ALTER PUBLICATION",2023-11-07 15:47:27
UTC,6/0,0,LOG,00000,"duration:1.292 ms",,,,,,,,,"psql"
 
#v-

It is interesting that the time when the row was inserted (created_at) is
always "around" the time that the table was added to publication...

Just like with the other case, I can leave it in current state for
a day, so if you'd need more information from it, let me know.

Best regards,

depesz

Re: Logical replication is missing block of rows when sending initial sync?

From

Tomas Vondra

Date:

09 November 2023, 19:14:46


On 11/8/23 15:02, hubert depesz lubaczewski wrote:
>> OK. found another case (maybe two).
> 
> OK, the other case is real. Different database. Same scenario.
> 
> Errors (3 because we have 3 missing fkeys):
> 
> #v+
> ALTER TABLE ONLY c44s16539.assignment_overrides ADD CONSTRAINT fk_rails_58f8ee369b FOREIGN KEY (assignment_id)
REFERENCESc44s16539.assignments(id);
 
> ERROR:  insert or update on table "assignment_overrides" violates foreign key constraint "fk_rails_58f8ee369b"
> DETAIL:  Key (assignment_id)=(248083) is not present in table "assignments".
> 
> ALTER TABLE ONLY c44s16539.post_policies ADD CONSTRAINT fk_rails_cf2d119863 FOREIGN KEY (assignment_id) REFERENCES
c44s16539.assignments(id);
> ERROR:  insert or update on table "post_policies" violates foreign key constraint "fk_rails_cf2d119863"
> DETAIL:  Key (assignment_id)=(248083) is not present in table "assignments".
> 
> ALTER TABLE ONLY c44s16539.submissions ADD CONSTRAINT fk_rails_61cac0823d FOREIGN KEY (assignment_id) REFERENCES
c44s16539.assignments(id);
> ERROR:  insert or update on table "submissions" violates foreign key constraint "fk_rails_61cac0823d"
> DETAIL:  Key (assignment_id)=(248083) is not present in table "assignments".
> #v-
> 
> In all cases the problem is missing c44s16539.assignments with is = 248083.
> 
> This time it looks that only one row is missing:
> 
> #v+
> Pg14# select max(id) from c44s16539.assignments where id < 248083;
>   max
> ────────
>  248082
> (1 row)
> 
> Pg14# select min(id) from c44s16539.assignments where id > 248083;
>   min
> ────────
>  248084
> (1 row)
> #v-
> 
> Schema is complicated. 76 columns. On pg12 side one column is dropped.
> 22 indexes, 9 fkeys from this table (assignments) to others, ~ 35 fkeys
> pointing to this table (but only 3 have rows that reference the
> missing record.
> 
> Some informationa bout the record:
> 
> #v+
> Pg12# select ctid, xmin, cmin, xmax, cmax, id, created_at, updated_at from c44s16539.assignments where id = 248083
\gx
> ─[ RECORD 1 ]──────────────────────────
> ctid       │ (24192,1)
> xmin       │ 472291499
> cmin       │ 12
> xmax       │ 472291499
> cmax       │ 12
> id         │ 248083
> created_at │ 2023-11-07 15:50:05.506077
> updated_at │ 2023-11-07 15:51:28.738893
> #v-
> 
> logs for when it was added to replication:
> 
> #v+
> 2023-11-07 15:47:27.898 UTC,"postgres","dbname",2333,"[local]",654a5c0f.91d,39,"ALTER PUBLICATION",2023-11-07
15:47:27UTC,6/0,0,LOG,00000,"duration: 1.476 ms",,,,,,,,,"psql"
 
> 2023-11-07 15:47:27.899 UTC,"postgres","dbname",2333,"[local]",654a5c0f.91d,40,"idle",2023-11-07 15:47:27
UTC,6/1209097693,0,LOG,00000,"statement:ALTER PUBLICATION focal14 ADD TABLE ONLY
c247s1094.messages_2022_50;",,,,,,,,,"psql"
> 2023-11-07 15:47:27.900 UTC,"postgres","dbname",2333,"[local]",654a5c0f.91d,41,"ALTER PUBLICATION",2023-11-07
15:47:27UTC,6/0,0,LOG,00000,"duration: 1.493 ms",,,,,,,,,"psql"
 
> 2023-11-07 15:47:27.900 UTC,"postgres","dbname",2333,"[local]",654a5c0f.91d,42,"idle",2023-11-07 15:47:27
UTC,6/1209097694,0,LOG,00000,"statement:ALTER PUBLICATION focal14 ADD TABLE ONLY
c44s16539.assignments;",,,,,,,,,"psql"
> 2023-11-07 15:47:27.902 UTC,"postgres","dbname",2333,"[local]",654a5c0f.91d,43,"ALTER PUBLICATION",2023-11-07
15:47:27UTC,6/0,0,LOG,00000,"duration: 1.936 ms",,,,,,,,,"psql"
 
> 2023-11-07 15:47:27.902 UTC,"postgres","dbname",2333,"[local]",654a5c0f.91d,44,"idle",2023-11-07 15:47:27
UTC,6/1209097695,0,LOG,00000,"statement:ALTER PUBLICATION focal14 ADD TABLE ONLY
c39s5131.messages_2023_20;",,,,,,,,,"psql"
> 2023-11-07 15:47:27.903 UTC,"postgres","dbname",2333,"[local]",654a5c0f.91d,45,"ALTER PUBLICATION",2023-11-07
15:47:27UTC,6/0,0,LOG,00000,"duration: 1.292 ms",,,,,,,,,"psql"
 
> #v-
> 
> It is interesting that the time when the row was inserted (created_at) is
> always "around" the time that the table was added to publication...
> 
> Just like with the other case, I can leave it in current state for
> a day, so if you'd need more information from it, let me know.
> 

I think it'd be interesting to look at the WAL using pg_waldump, and see
how it relates to the LSN used for the tablesync slot.

I'll try reproducing this locally over the weekend. Should I use the
test_1030.sh script that you shared a week ago, or do I need to do
something more?

One thing that puzzles me is that IIUC the script doesn't generate any
changes while adding the tables to the subscription? I probably should
have paid more attention to the details, instead of assuming there are
concurrent changes - which seemed like there might be some race, or
something. But this seems like there are no changes and the tablesync
just fails to copy some subset of it. That's *really* strange.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Logical replication is missing block of rows when sending initial sync?

From

hubert depesz lubaczewski

Date:

10 November 2023, 11:44:17

On Thu, Nov 09, 2023 at 08:14:46PM +0100, Tomas Vondra wrote:
> 
> 
> On 11/8/23 15:02, hubert depesz lubaczewski wrote:
> >> OK. found another case (maybe two).
> > 
> > OK, the other case is real. Different database. Same scenario.
> > 
> > Errors (3 because we have 3 missing fkeys):
> > 
> > #v+
> > ALTER TABLE ONLY c44s16539.assignment_overrides ADD CONSTRAINT fk_rails_58f8ee369b FOREIGN KEY (assignment_id)
REFERENCESc44s16539.assignments(id);
 
> > ERROR:  insert or update on table "assignment_overrides" violates foreign key constraint "fk_rails_58f8ee369b"
> > DETAIL:  Key (assignment_id)=(248083) is not present in table "assignments".
> > 
> > ALTER TABLE ONLY c44s16539.post_policies ADD CONSTRAINT fk_rails_cf2d119863 FOREIGN KEY (assignment_id) REFERENCES
c44s16539.assignments(id);
> > ERROR:  insert or update on table "post_policies" violates foreign key constraint "fk_rails_cf2d119863"
> > DETAIL:  Key (assignment_id)=(248083) is not present in table "assignments".
> > 
> > ALTER TABLE ONLY c44s16539.submissions ADD CONSTRAINT fk_rails_61cac0823d FOREIGN KEY (assignment_id) REFERENCES
c44s16539.assignments(id);
> > ERROR:  insert or update on table "submissions" violates foreign key constraint "fk_rails_61cac0823d"
> > DETAIL:  Key (assignment_id)=(248083) is not present in table "assignments".
> > #v-
> > 
> > In all cases the problem is missing c44s16539.assignments with is = 248083.
> > 
> > This time it looks that only one row is missing:
> > 
> > #v+
> > Pg14# select max(id) from c44s16539.assignments where id < 248083;
> >   max
> > ────────
> >  248082
> > (1 row)
> > 
> > Pg14# select min(id) from c44s16539.assignments where id > 248083;
> >   min
> > ────────
> >  248084
> > (1 row)
> > #v-
> > 
> > Schema is complicated. 76 columns. On pg12 side one column is dropped.
> > 22 indexes, 9 fkeys from this table (assignments) to others, ~ 35 fkeys
> > pointing to this table (but only 3 have rows that reference the
> > missing record.
> > 
> > Some informationa bout the record:
> > 
> > #v+
> > Pg12# select ctid, xmin, cmin, xmax, cmax, id, created_at, updated_at from c44s16539.assignments where id = 248083
\gx
> > ─[ RECORD 1 ]──────────────────────────
> > ctid       │ (24192,1)
> > xmin       │ 472291499
> > cmin       │ 12
> > xmax       │ 472291499
> > cmax       │ 12
> > id         │ 248083
> > created_at │ 2023-11-07 15:50:05.506077
> > updated_at │ 2023-11-07 15:51:28.738893
> > #v-
> > 
> > logs for when it was added to replication:
> > 
> > #v+
> > 2023-11-07 15:47:27.898 UTC,"postgres","dbname",2333,"[local]",654a5c0f.91d,39,"ALTER PUBLICATION",2023-11-07
15:47:27UTC,6/0,0,LOG,00000,"duration: 1.476 ms",,,,,,,,,"psql"
 
> > 2023-11-07 15:47:27.899 UTC,"postgres","dbname",2333,"[local]",654a5c0f.91d,40,"idle",2023-11-07 15:47:27
UTC,6/1209097693,0,LOG,00000,"statement:ALTER PUBLICATION focal14 ADD TABLE ONLY
c247s1094.messages_2022_50;",,,,,,,,,"psql"
> > 2023-11-07 15:47:27.900 UTC,"postgres","dbname",2333,"[local]",654a5c0f.91d,41,"ALTER PUBLICATION",2023-11-07
15:47:27UTC,6/0,0,LOG,00000,"duration: 1.493 ms",,,,,,,,,"psql"
 
> > 2023-11-07 15:47:27.900 UTC,"postgres","dbname",2333,"[local]",654a5c0f.91d,42,"idle",2023-11-07 15:47:27
UTC,6/1209097694,0,LOG,00000,"statement:ALTER PUBLICATION focal14 ADD TABLE ONLY
c44s16539.assignments;",,,,,,,,,"psql"
> > 2023-11-07 15:47:27.902 UTC,"postgres","dbname",2333,"[local]",654a5c0f.91d,43,"ALTER PUBLICATION",2023-11-07
15:47:27UTC,6/0,0,LOG,00000,"duration: 1.936 ms",,,,,,,,,"psql"
 
> > 2023-11-07 15:47:27.902 UTC,"postgres","dbname",2333,"[local]",654a5c0f.91d,44,"idle",2023-11-07 15:47:27
UTC,6/1209097695,0,LOG,00000,"statement:ALTER PUBLICATION focal14 ADD TABLE ONLY
c39s5131.messages_2023_20;",,,,,,,,,"psql"
> > 2023-11-07 15:47:27.903 UTC,"postgres","dbname",2333,"[local]",654a5c0f.91d,45,"ALTER PUBLICATION",2023-11-07
15:47:27UTC,6/0,0,LOG,00000,"duration: 1.292 ms",,,,,,,,,"psql"
 
> > #v-
> > 
> > It is interesting that the time when the row was inserted (created_at) is
> > always "around" the time that the table was added to publication...
> > 
> > Just like with the other case, I can leave it in current state for
> > a day, so if you'd need more information from it, let me know.
> > 
> 
> I think it'd be interesting to look at the WAL using pg_waldump, and see
> how it relates to the LSN used for the tablesync slot.

well, I can try to dump, and show data, but it will most likely be
a lot. Are you more interested in looking into wal around the second
where data was inserted, or when table was added to publication?

> I'll try reproducing this locally over the weekend. Should I use the
> test_1030.sh script that you shared a week ago, or do I need to do
> something more?

That question is probably not to me, but to Hayato Kuroda

> One thing that puzzles me is that IIUC the script doesn't generate any
> changes while adding the tables to the subscription? I probably should
> have paid more attention to the details, instead of assuming there are
> concurrent changes - which seemed like there might be some race, or
> something. But this seems like there are no changes and the tablesync
> just fails to copy some subset of it. That's *really* strange.

This script was made by Hayato Kuroda, as far as I can tell. Our
situation is not like this.

In our case there is constant traffic on the publisher from normally
working app.

Best regards,

depesz

Re: Logical replication is missing block of rows when sending initial sync?

From

hubert depesz lubaczewski

Date:

10 November 2023, 13:08:10

On Fri, Nov 10, 2023 at 12:44:17PM +0100, hubert depesz lubaczewski wrote:
> > I think it'd be interesting to look at the WAL using pg_waldump, and see
> > how it relates to the LSN used for the tablesync slot.
> 
> well, I can try to dump, and show data, but it will most likely be
> a lot. Are you more interested in looking into wal around the second
> where data was inserted, or when table was added to publication?

I tried to find it based on relfilenode of assignments table, and ctid,
but I can't find it in waldump :(

I'm probably doing something wrong, or basing on wrong assumptions, but
I got 33GB of wal, dumped it all, found relfilenode (41441) and grepped
for this, and block number (ctid       │ (24192,1), so i assumed block
24191, 24192, or 24193 - nothing like this exists in wal dump.

Best regards,

depesz

RE: Logical replication is missing block of rows when sending initial sync?

From

"Hayato Kuroda (Fujitsu)"

Date:

15 November 2023, 01:41:06

Dear Tomas, Depesz,

> 
> > I'll try reproducing this locally over the weekend. Should I use the
> > test_1030.sh script that you shared a week ago, or do I need to do
> > something more?
> 
> That question is probably not to me, but to Hayato Kuroda

Just FYI - As I posted, this script could not reproduce the issue. We may use the same table definition
but I have not tried yet. I will let you know if I succeeded to reproduce.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED

Re: Logical replication is missing block of rows when sending initial sync?

From

Tomas Vondra

Date:

17 November 2023, 14:40:35

On 11/15/23 02:41, Hayato Kuroda (Fujitsu) wrote:
> Dear Tomas, Depesz,
> 
>>
>>> I'll try reproducing this locally over the weekend. Should I use the
>>> test_1030.sh script that you shared a week ago, or do I need to do
>>> something more?
>>
>> That question is probably not to me, but to Hayato Kuroda
> 
> Just FYI - As I posted, this script could not reproduce the issue. We may use the same table definition
> but I have not tried yet. I will let you know if I succeeded to reproduce.
> 

FYI I think I've reproduced the issue (with a different script), I've
started a thread on pgsql-hackers:

https://www.postgresql.org/message-id/de52b282-1166-1180-45a2-8d8917ca74c6%40enterprisedb.com

Obviously, I can't be sure it's exactly the same issue, but the symptoms
seem to match what Depesz reported (gaps in data etc.). I don't know
what the root cause is, or fix. But there's more info (and scripts) in
the hackers thread.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Logical replication is missing block of rows when sending initial sync?

From

hubert depesz lubaczewski

Date:

17 November 2023, 16:22:58

On Fri, Nov 17, 2023 at 03:40:35PM +0100, Tomas Vondra wrote:
> On 11/15/23 02:41, Hayato Kuroda (Fujitsu) wrote:
> > Dear Tomas, Depesz,
> > 
> >>
> >>> I'll try reproducing this locally over the weekend. Should I use the
> >>> test_1030.sh script that you shared a week ago, or do I need to do
> >>> something more?
> >>
> >> That question is probably not to me, but to Hayato Kuroda
> > 
> > Just FYI - As I posted, this script could not reproduce the issue. We may use the same table definition
> > but I have not tried yet. I will let you know if I succeeded to reproduce.
> > 
> 
> FYI I think I've reproduced the issue (with a different script), I've
> started a thread on pgsql-hackers:
> 
> https://www.postgresql.org/message-id/de52b282-1166-1180-45a2-8d8917ca74c6%40enterprisedb.com
> 
> Obviously, I can't be sure it's exactly the same issue, but the symptoms
> seem to match what Depesz reported (gaps in data etc.). I don't know
> what the root cause is, or fix. But there's more info (and scripts) in
> the hackers thread.

Thanks a lot. Much appreciated. It will be over for us with the upgrade
soon, but it's good to know that there might be one less bug in pg
thanks to this :)

Best regards,

depesz