Re: Logical replication CPU-bound with TRUNCATE/DROP/CREATE many tables - Mailing list pgsql-hackers

From Amit Kapila
Subject Re: Logical replication CPU-bound with TRUNCATE/DROP/CREATE many tables
Date
Msg-id CAA4eK1Lb3sY8TEfQrtZ8ceeHy3=Z-H=dsYcbjWnYonD=e8EvHA@mail.gmail.com
Whole thread Raw
In response to Logical replication CPU-bound with TRUNCATE/DROP/CREATE many tables  (Keisuke Kuroda <keisuke.kuroda.3862@gmail.com>)
Responses Re: Logical replication CPU-bound with TRUNCATE/DROP/CREATE many tables  (Keisuke Kuroda <keisuke.kuroda.3862@gmail.com>)
List pgsql-hackers
On Wed, Sep 23, 2020 at 1:09 PM Keisuke Kuroda
<keisuke.kuroda.3862@gmail.com> wrote:
>
> Hi hackers,
>
> I found a problem in logical replication.
> It seems to have the same cause as the following problem.
>
>   Creating many tables gets logical replication stuck
>   https://www.postgresql.org/message-id/flat/20f3de7675f83176253f607b5e199b228406c21c.camel%40cybertec.at
>
>   Logical decoding CPU-bound w/ large number of tables
>
https://www.postgresql.org/message-id/flat/CAHoiPjzea6N0zuCi%3D%2Bf9v_j94nfsy6y8SU7-%3Dbp4%3D7qw6_i%3DRg%40mail.gmail.com
>
> # problem
>
> * logical replication enabled
> * walsender process has RelfilenodeMap cache(2000 relations in this case)
> * TRUNCATE or DROP or CREATE many tables in same transaction
>
> At this time, walsender process continues to use 100% of the CPU 1core.
>
...
...
>
> ./src/backend/replication/logical/reorderbuffer.c
> 1746         case REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID:
> 1747           Assert(change->data.command_id != InvalidCommandId);
> 1748
> 1749           if (command_id < change->data.command_id)
> 1750           {
> 1751             command_id = change->data.command_id;
> 1752
> 1753             if (!snapshot_now->copied)
> 1754             {
> 1755               /* we don't use the global one anymore */
> 1756               snapshot_now = ReorderBufferCopySnap(rb, snapshot_now,
> 1757                                  txn, command_id);
> 1758             }
> 1759
> 1760             snapshot_now->curcid = command_id;
> 1761
> 1762             TeardownHistoricSnapshot(false);
> 1763             SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash);
> 1764
> 1765             /*
> 1766              * Every time the CommandId is incremented, we could
> 1767              * see new catalog contents, so execute all
> 1768              * invalidations.
> 1769              */
> 1770             ReorderBufferExecuteInvalidations(rb, txn);
> 1771           }
> 1772
> 1773           break;
>
> Do you have any solutions?
>

Yeah, I have an idea on how to solve this problem. This problem is
primarily due to the reason that we use to receive invalidations only
at commit time and then we need to execute them after each command id
change. However, after commit c55040ccd0 (When wal_level=logical,
write invalidations at command end into WAL so that decoding can use
this information.) we actually know exactly when we need to execute
each invalidation. The idea is that instead of collecting
invalidations only in ReorderBufferTxn, we need to collect them in
form of ReorderBufferChange as well similar to what we do for other
changes (for ex. REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID). In this
case, we need to collect additionally in ReorderBufferTxn because if
the transaction is aborted or some exception occurred while executing
the changes we need to perform all the invalidations.

-- 
With Regards,
Amit Kapila.



pgsql-hackers by date:

Previous
From: Etsuro Fujita
Date:
Subject: Re: Asynchronous Append on postgres_fdw nodes.
Next
From: Ranier Vilela
Date:
Subject: Avoid suspects casts VARHDRSZ (c.h)