temp table on commit delete rows performance issue - Mailing list pgsql-hackers

From Floris Van Nee
Subject temp table on commit delete rows performance issue
Date
Msg-id 365ab92f507e4ec6addef4286782fe71@Optiver.com
Whole thread Raw
Responses Re: temp table on commit delete rows performance issue
Re: temp table on commit delete rows performance issue
List pgsql-hackers
Hi hackers,

I'm looking for some input on an issue I've observed. A common pattern
I've seen is using temporary tables to put data in before updating the
real tables. Something roughly like:

On session start:
CREATE TEMP TABLE temp_t1 (...) ON COMMIT DELETE ROWS;

On update:
BEGIN;
COPY temp_t1 FROM STDIN (FORMAT BINARY);
INSERT INTO t1 (SELECT * FROM temp_t1 ...) ON CONFLICT DO UPDATE SET ...;
-- potentially some other operations on temp table to put data into real table t1
COMMIT;

This pattern starts to break down under certain exceptional circumstances of
high concurrency. The "ON COMMIT DELETE ROWS" does a truncate that is
fairly expensive and doesn't work well in high-concurrency scenarios. It's
especially noticeable under following circumstances:
- high max_connections setting
- high number of temp tables per session
- concurrent writers at fairly short intervals
Impact is on both TPS on primary as well as that the WAL replay process
on replica becomes completely overloaded (100% cpu even though not
a lot of WAL is being generated)

A very simple pgbench example that showcases degradation (taken
with max_connections=2000 to clearly show it).

test_truncate.sql

begin;
set client_min_messages=warning;
create temp table if not exists ut0 (a int, b text, c text) on commit delete rows;
commit;

test_no_truncate.sql

begin;
set client_min_messages=warning;
create temp table if not exists ut0 (a int, b text, c text); -- do not do anything on commit
commit;

pgbench -n -f  test_truncate.sql -T 10 postgres -c 1 -j 1
pgbench (18devel)
tps = 2693.806376 (without initial connection time)

pgbench -n -f test_truncate.sql -T 10 postgres -c10 -j 10
pgbench (18devel)
tps = 12119.358458 (without initial connection time)

pgbench -n -f test_no_truncate.sql -T 10 postgres -c 1 -j 1
pgbench (18devel)
tps = 22685.877281 (without initial connection time)

pgbench -n -f test_no_truncate.sql -T 10 postgres -c10 -j 10
pgbench (18devel)
tps = 200359.458154 (without initial connection time)

For the test_truncate.sql with 10 threads, WAL replay process is spinning at 100% CPU.

The reason seems to be that for a large part, the 'on commit delete rows'
follows the same code path as a regular truncate.
This means:
* taking an exclusive lock on the temp table (WAL logged)
* invalidating rel cache (WAL logged)

On replica process, these exclusive locks are very expensive to replay,
leading to spending replica most of its cycles trying to get LWLocks on non-fast-path.
On primary, degradation on concurrency I believes comes from the rel cache invalidation.

There doesn't seem to be a good reason for this behavior, as the replica
has nothing to do for a temporary table truncation, so this lock doesn't
need to be WAL logged.
The generated WAL is just fully this stuff. LOCK+INVALIDATION+COMMIT.
And it's spinning at 100% CPU for trying to replay this.

rmgr: Standby     len (rec/tot):     42/    42, tx:    8154875, lsn: 0/3AFFE270, prev 0/3AFFE218, desc: LOCK xid
8154875db 5 rel 18188  
rmgr: Standby     len (rec/tot):     42/    42, tx:    8154875, lsn: 0/3AFFE2A0, prev 0/3AFFE270, desc: LOCK xid
8154875db 5 rel 18191  
rmgr: Standby     len (rec/tot):     42/    42, tx:    8154875, lsn: 0/3AFFE2D0, prev 0/3AFFE2A0, desc: LOCK xid
8154875db 5 rel 18192  
rmgr: Transaction len (rec/tot):     62/    62, tx:    8154875, lsn: 0/3AFFE300, prev 0/3AFFE2D0, desc: INVALIDATION ;
invalmsgs: relcache 18191 relcache 18192 
rmgr: Transaction len (rec/tot):     82/    82, tx:    8154875, lsn: 0/3AFFE340, prev 0/3AFFE300, desc: COMMIT
2024-07-1612:56:42.878147 CEST; inval msgs: relcache 18191 relcache 18192 
rmgr: Standby     len (rec/tot):     42/    42, tx:    8154876, lsn: 0/3AFFE398, prev 0/3AFFE340, desc: LOCK xid
8154876db 5 rel 18188  
rmgr: Standby     len (rec/tot):     42/    42, tx:    8154876, lsn: 0/3AFFE3C8, prev 0/3AFFE398, desc: LOCK xid
8154876db 5 rel 18191  
rmgr: Standby     len (rec/tot):     42/    42, tx:    8154876, lsn: 0/3AFFE3F8, prev 0/3AFFE3C8, desc: LOCK xid
8154876db 5 rel 18192  
rmgr: Transaction len (rec/tot):     62/    62, tx:    8154876, lsn: 0/3AFFE428, prev 0/3AFFE3F8, desc: INVALIDATION ;
invalmsgs: relcache 18191 relcache 18192 
rmgr: Transaction len (rec/tot):     82/    82, tx:    8154876, lsn: 0/3AFFE468, prev 0/3AFFE428, desc: COMMIT
2024-07-1612:56:42.878544 CEST; inval msgs: relcache 18191 relcache 18192 

Have people observed this behavior before? Would the community be
open to accepting a patch that changes the behavior of ON COMMIT ROWS
to not WAL log the lock taken and/or the relcache inval?
I think:
* An exclusive lock needs to be taken on primary, but does not
need to be WAL logged
* Rel cache invalidation should not be necessary I think (it currently
happens just on the TOAST table+index, not on the regular TEMP table)

-Floris




pgsql-hackers by date:

Previous
From: Yasir
Date:
Subject: ActiveState Perl is not valid anymore to build PG17 on the Windows 10/11 platforms, So Documentation still suggesting it should be updated
Next
From: Yasir Shah
Date:
Subject: Re: pg_ctl start may return 0 even if the postmaster has been already started on Windows