Re: Need help debugging slow logical replication - Mailing list pgsql-general

From sunyucong@gmail.com
Subject Re: Need help debugging slow logical replication
Date
Msg-id CAJygYd0=zZ9TWSu-srcJKh8vLPAMh3MEa6LqamhdehqUaTyEcg@mail.gmail.com
Whole thread Raw
In response to Re: Need help debugging slow logical replication  ("sunyucong@gmail.com" <sunyucong@gmail.com>)
List pgsql-general
OK: so I think I've got some new insight by using test_decoding to
peek into the changes being replicate, here is what i think is
happening:

- there is a background job generating a Huge transaction (10K
changes) one after another  that basically does DELETE on 1 table,
INSERT on 1 table and UPDATE on 1 table. Since this is a test
environment, this is the only TX being accumulated.
- since there is only 1 publication,  this  worker on primary
basically spend all its CPU decoding this large transaction then send
to subscriber to execute
   -  memory usage on logical decoding worker maybe a issue: as it
might have spilled to disk during decoding
   -  worker on subscriber basically spend all time applying this change

- I suspect  keep-alives are not sent during applying a transaction,
and the transaction apply time takes longer than wal_sender_timeout
(30seconds) , that's why we have to increase this number before.
- I think utilizing the new streaming feature for logical replication
in PG14 will improve performance here.
- I think the other thing that will help is to drop the index
temporarily on the subscriber to accelerate applying, but will need to
rebuild once close.

BTW: just to clarify Replica identity is not an issue for the table
that only receives INSERT, I tried to create a replica identity for
that table, and it didn't change anything recorded in WAL (as
expected), if anything it actually decreased performance since

What's puzzling for me is that why does logical apply being so slower
than physical apply (and also apparently physical streaming never
times out!)

On Tue, Feb 7, 2023 at 9:13 PM sunyucong@gmail.com <sunyucong@gmail.com> wrote:
>
> ccing the mailist again in case someone else would have a idea how to debug:
>
> Here is what I see in the pg_locks on subscribe at all time:
>
> As you can see, it mostly concern following 3 tables, accounts,
> ledger, pending_ledger_fees, which I have tried analyze, vacuum them
> etc, none of them helped much:
>
> I do see in the log these tables are getting autovaccumed *very*
> frequently, is that a problem for logical replication?
>
> 2023-02-08 04:59:47 UTC::@:[13626]:LOG:  automatic vacuum of table
> "dev.public.accounts": index scans: 0
> pages: 0 removed, 71437 remain, 0 skipped due to pins, 11465 skipped frozen
> tuples: 0 removed, 3590987 remain, 3522495 are dead but not yet
> removable, oldest xmin: 2893907681
> index scan bypassed: 4 pages from table (0.01% of total) have 5 dead
> item identifiers
> I/O timings: read: 0.000 ms, write: 0.000 ms
> avg read rate: 0.000 MB/s, avg write rate: 0.000 MB/s
> buffer usage: 120107 hits, 0 misses, 0 dirtied
> WAL usage: 1 records, 0 full page images, 708 bytes
> system usage: CPU: user: 0.23 s, system: 0.00 s, elapsed: 1.01 s
> 2023-02-08 04:59:47 UTC::@:[13626]:LOG:  automatic vacuum of table
> "dev.public.pending_fee_ledger": index scans: 0
> pages: 0 removed, 199818 remain, 0 skipped due to pins, 117769 skipped frozen
> tuples: 0 removed, 3670095 remain, 3520000 are dead but not yet
> removable, oldest xmin: 2893907681
> index scan not needed: 0 pages from table (0.00% of total) had 0 dead
> item identifiers removed
> I/O timings: read: 0.000 ms, write: 0.000 ms
> avg read rate: 0.000 MB/s, avg write rate: 0.000 MB/s
> buffer usage: 164212 hits, 0 misses, 0 dirtied
> WAL usage: 0 records, 0 full page images, 0 bytes
> system usage: CPU: user: 0.27 s, system: 0.00 s, elapsed: 0.27 s
>
> Thanks
>
> On Tue, Feb 7, 2023 at 6:10 PM sunyucong@gmail.com <sunyucong@gmail.com> wrote:
> >
> > Hi Justin, thanks for the response!
> >
> > > REPLICA IDENTITY DEFAULT   will only use primary keys,  if the publisher  includes those tables, the subscriber
when replaying the WAL will stop throwing an error not knowing how to replay the UPDATE/DELETE.
 
> >
> > But I don't see any errors being thrown out in the postgresql logs?
> > Should I be seeing it complain there? Is postgresql falling back to
> > replica identity full here?
> >
> > However I checked that table, it doesn't seem to be making progress at
> > all: so I suspect you are right that it is the problem.
> >
> > > Logical Replication is most likely broken at this point.
> > >
> > > I suggest stopping logical replication and correcting tables that don't have qualifying indexes for logical
replicationby creating the necessary indexes and avoid using replica identity full.  Then restart logical replication
fromthe beginning.
 
> > >
> > >



pgsql-general by date:

Previous
From: Andrus
Date:
Subject: How to create directory format backup
Next
From: hubert depesz lubaczewski
Date:
Subject: Re: How to create directory format backup