Thread: temp table on commit delete rows performance issue

temp table on commit delete rows performance issue

From
Floris Van Nee
Date:
Hi hackers,

I'm looking for some input on an issue I've observed. A common pattern
I've seen is using temporary tables to put data in before updating the
real tables. Something roughly like:

On session start:
CREATE TEMP TABLE temp_t1 (...) ON COMMIT DELETE ROWS;

On update:
BEGIN;
COPY temp_t1 FROM STDIN (FORMAT BINARY);
INSERT INTO t1 (SELECT * FROM temp_t1 ...) ON CONFLICT DO UPDATE SET ...;
-- potentially some other operations on temp table to put data into real table t1
COMMIT;

This pattern starts to break down under certain exceptional circumstances of
high concurrency. The "ON COMMIT DELETE ROWS" does a truncate that is
fairly expensive and doesn't work well in high-concurrency scenarios. It's
especially noticeable under following circumstances:
- high max_connections setting
- high number of temp tables per session
- concurrent writers at fairly short intervals
Impact is on both TPS on primary as well as that the WAL replay process
on replica becomes completely overloaded (100% cpu even though not
a lot of WAL is being generated)

A very simple pgbench example that showcases degradation (taken
with max_connections=2000 to clearly show it).

test_truncate.sql

begin;
set client_min_messages=warning;
create temp table if not exists ut0 (a int, b text, c text) on commit delete rows;
commit;

test_no_truncate.sql

begin;
set client_min_messages=warning;
create temp table if not exists ut0 (a int, b text, c text); -- do not do anything on commit
commit;

pgbench -n -f  test_truncate.sql -T 10 postgres -c 1 -j 1
pgbench (18devel)
tps = 2693.806376 (without initial connection time)

pgbench -n -f test_truncate.sql -T 10 postgres -c10 -j 10
pgbench (18devel)
tps = 12119.358458 (without initial connection time)

pgbench -n -f test_no_truncate.sql -T 10 postgres -c 1 -j 1
pgbench (18devel)
tps = 22685.877281 (without initial connection time)

pgbench -n -f test_no_truncate.sql -T 10 postgres -c10 -j 10
pgbench (18devel)
tps = 200359.458154 (without initial connection time)

For the test_truncate.sql with 10 threads, WAL replay process is spinning at 100% CPU.

The reason seems to be that for a large part, the 'on commit delete rows'
follows the same code path as a regular truncate.
This means:
* taking an exclusive lock on the temp table (WAL logged)
* invalidating rel cache (WAL logged)

On replica process, these exclusive locks are very expensive to replay,
leading to spending replica most of its cycles trying to get LWLocks on non-fast-path.
On primary, degradation on concurrency I believes comes from the rel cache invalidation.

There doesn't seem to be a good reason for this behavior, as the replica
has nothing to do for a temporary table truncation, so this lock doesn't
need to be WAL logged.
The generated WAL is just fully this stuff. LOCK+INVALIDATION+COMMIT.
And it's spinning at 100% CPU for trying to replay this.

rmgr: Standby     len (rec/tot):     42/    42, tx:    8154875, lsn: 0/3AFFE270, prev 0/3AFFE218, desc: LOCK xid
8154875db 5 rel 18188  
rmgr: Standby     len (rec/tot):     42/    42, tx:    8154875, lsn: 0/3AFFE2A0, prev 0/3AFFE270, desc: LOCK xid
8154875db 5 rel 18191  
rmgr: Standby     len (rec/tot):     42/    42, tx:    8154875, lsn: 0/3AFFE2D0, prev 0/3AFFE2A0, desc: LOCK xid
8154875db 5 rel 18192  
rmgr: Transaction len (rec/tot):     62/    62, tx:    8154875, lsn: 0/3AFFE300, prev 0/3AFFE2D0, desc: INVALIDATION ;
invalmsgs: relcache 18191 relcache 18192 
rmgr: Transaction len (rec/tot):     82/    82, tx:    8154875, lsn: 0/3AFFE340, prev 0/3AFFE300, desc: COMMIT
2024-07-1612:56:42.878147 CEST; inval msgs: relcache 18191 relcache 18192 
rmgr: Standby     len (rec/tot):     42/    42, tx:    8154876, lsn: 0/3AFFE398, prev 0/3AFFE340, desc: LOCK xid
8154876db 5 rel 18188  
rmgr: Standby     len (rec/tot):     42/    42, tx:    8154876, lsn: 0/3AFFE3C8, prev 0/3AFFE398, desc: LOCK xid
8154876db 5 rel 18191  
rmgr: Standby     len (rec/tot):     42/    42, tx:    8154876, lsn: 0/3AFFE3F8, prev 0/3AFFE3C8, desc: LOCK xid
8154876db 5 rel 18192  
rmgr: Transaction len (rec/tot):     62/    62, tx:    8154876, lsn: 0/3AFFE428, prev 0/3AFFE3F8, desc: INVALIDATION ;
invalmsgs: relcache 18191 relcache 18192 
rmgr: Transaction len (rec/tot):     82/    82, tx:    8154876, lsn: 0/3AFFE468, prev 0/3AFFE428, desc: COMMIT
2024-07-1612:56:42.878544 CEST; inval msgs: relcache 18191 relcache 18192 

Have people observed this behavior before? Would the community be
open to accepting a patch that changes the behavior of ON COMMIT ROWS
to not WAL log the lock taken and/or the relcache inval?
I think:
* An exclusive lock needs to be taken on primary, but does not
need to be WAL logged
* Rel cache invalidation should not be necessary I think (it currently
happens just on the TOAST table+index, not on the regular TEMP table)

-Floris




Re: temp table on commit delete rows performance issue

From
Aleksander Alekseev
Date:
Hi,

> I'm looking for some input on an issue I've observed. A common pattern
> I've seen is using temporary tables to put data in before updating the
> real tables. Something roughly like:
>
> On session start:
> CREATE TEMP TABLE temp_t1 (...) ON COMMIT DELETE ROWS;
>
> On update:
> BEGIN;
> COPY temp_t1 FROM STDIN (FORMAT BINARY);
> INSERT INTO t1 (SELECT * FROM temp_t1 ...) ON CONFLICT DO UPDATE SET ...;
> -- potentially some other operations on temp table to put data into real table t1
> COMMIT;
>
> This pattern starts to break down under certain exceptional circumstances of
> high concurrency. The "ON COMMIT DELETE ROWS" does a truncate that is
> fairly expensive and doesn't work well in high-concurrency scenarios. It's
> especially noticeable under following circumstances:
> - high max_connections setting
> - high number of temp tables per session
> - concurrent writers at fairly short intervals
> Impact is on both TPS on primary as well as that the WAL replay process
> on replica becomes completely overloaded (100% cpu even though not
> a lot of WAL is being generated)
>
> [...]

I didn't investigate your particular issue but generally speaking
creating a table, even a temporary one, is an expensive operation.

Note that it's far from being a seperate file on the disk. It affects
catalog tables, shared buffers, all the corresponding locks, etc. If
you have indexes for a temporary table it makes the situation ever
worse. Sooner or later VACUUM will happen for your bloated catalog,
and this is not fun under heavy load.

Is there any particular reason why you don't want to simply change the
target table directly? If you do it in a transaction you are safe.

-- 
Best regards,
Aleksander Alekseev



RE: temp table on commit delete rows performance issue

From
Floris Van Nee
Date:
> 
> I didn't investigate your particular issue but generally speaking creating a
> table, even a temporary one, is an expensive operation.
> 
> Note that it's far from being a seperate file on the disk. It affects catalog
> tables, shared buffers, all the corresponding locks, etc. If you have indexes
> for a temporary table it makes the situation ever worse. Sooner or later
> VACUUM will happen for your bloated catalog, and this is not fun under
> heavy load.

Yes, creation is expensive, but note that this test case does not create 
the temp table for every transaction. It creates it once on startup of the
connection. So there'll be no catalog bloat as the connections are generally
long-lived. The part that remains to be expensive (but which was unexpected
to me) is the truncate that does happen for every transaction.

> 
> Is there any particular reason why you don't want to simply change the target
> table directly? If you do it in a transaction you are safe.
> 

This is a fair point which I didn't explain in my earlier message. We find that for
various use cases the temp table approach is a really convenient way of interacting
with Postgres. You can just easily binary COPY whatever data you want into it without
looking too much at data size and then handling it from there on the db-side.
Doing the same without this, you'd need to resort to passing the data as query parameters, but 
this has complications like a limit in number of bound params per query. It's definitely possible
to use as well, but the binary COPY to temp table is quite convenient.

-Floris


Re: temp table on commit delete rows performance issue

From
feichanghong
Date:
Hi Floris,

On Jul 16, 2024, at 19:47, Floris Van Nee <florisvannee@Optiver.com> wrote:

Hi hackers,

I'm looking for some input on an issue I've observed. A common pattern
I've seen is using temporary tables to put data in before updating the
real tables. Something roughly like:

On session start:
CREATE TEMP TABLE temp_t1 (...) ON COMMIT DELETE ROWS;

On update:
BEGIN;
COPY temp_t1 FROM STDIN (FORMAT BINARY);
INSERT INTO t1 (SELECT * FROM temp_t1 ...) ON CONFLICT DO UPDATE SET ...;
-- potentially some other operations on temp table to put data into real table t1
COMMIT;

This pattern starts to break down under certain exceptional circumstances of
high concurrency. The "ON COMMIT DELETE ROWS" does a truncate that is
fairly expensive and doesn't work well in high-concurrency scenarios. It's
especially noticeable under following circumstances:
- high max_connections setting
- high number of temp tables per session
- concurrent writers at fairly short intervals
Impact is on both TPS on primary as well as that the WAL replay process
on replica becomes completely overloaded (100% cpu even though not
a lot of WAL is being generated)

A very simple pgbench example that showcases degradation (taken
with max_connections=2000 to clearly show it).

I also encountered the similar performance issue with temporary tables
andprovided a patch to optimize the truncate performance during commit
in [1].

Additionally, is it possible to lower the lock level held during truncate for
temporary tables?

[1] https://www.postgresql.org/message-id/flat/tencent_924E990F0493010E2C8404A5D677C70C9707%40qq.com

Best Regards,
Fei Changhong

RE: temp table on commit delete rows performance issue

From
Floris Van Nee
Date:
> I also encountered the similar performance issue with temporary tables
> andprovided a patch to optimize the truncate performance during commit
> in [1].

Interesting, that is definitely another good way to improve the performance,
especially with a large number of temp tables. I think the two optimizations
can actually work well together.
Your optimization on only truncating the tables that are actually used.
Combined with a patch like attached which makes sure that no WAL is generated at all
for the ON COMMIT DELETE ROWS operation.

On my test system this reduces WAL generation for the pgbench test case
I posted previously to 0 (and therefore brought WAL replay process CPU usage
from 100% CPU and lagging behind to only 0% CPU usage)

-Floris


Attachment

Re: temp table on commit delete rows performance issue

From
feichanghong
Date:
Hi Floris,

On Jul 18, 2024, at 21:36, Floris Van Nee <florisvannee@Optiver.com> wrote:


I also encountered the similar performance issue with temporary tables
andprovided a patch to optimize the truncate performance during commit
in [1].

Interesting, that is definitely another good way to improve the performance,
especially with a large number of temp tables. I think the two optimizations
can actually work well together.
Your optimization on only truncating the tables that are actually used.
Combined with a patch like attached which makes sure that no WAL is generated at all
for the ON COMMIT DELETE ROWS operation.

It seems that in your patch, WAL logging is skipped for all tables, not just

temporary tables.


Upon further consideration, do we really need to acquire AccessExclusiveLocks

for temporary tables? Since temporary tables can only be accessed within the

current session, perhaps we can make the following optimizations:

```

diff --git a/src/backend/catalog/heap.c b/src/backend/catalog/heap.c

index 00074c8a94..845c9603e2 100644

--- a/src/backend/catalog/heap.c

+++ b/src/backend/catalog/heap.c

@@ -2977,9 +2977,17 @@ RelationTruncateIndexes(Relation heapRelation)

         Oid         indexId = lfirst_oid(indlist);

         Relation    currentIndex;

         IndexInfo  *indexInfo;

+        LOCKMODE    lockmode;

 

-        /* Open the index relation; use exclusive lock, just to be sure */

-        currentIndex = index_open(indexId, AccessExclusiveLock);

+        /*

+         * Open the index relation; use exclusive lock, just to be sure.

+         * AccessExclusiveLock is not necessary for temporary tables.

+         */

+        if (heapRelation->rd_rel->relpersistence != RELPERSISTENCE_TEMP)

+            lockmode = AccessExclusiveLock;

+        else

+            lockmode = ExclusiveLock;

+        currentIndex = index_open(indexId, lockmode);

 

         /*

          * Fetch info needed for index_build.  Since we know there are no

@@ -3026,7 +3034,9 @@ heap_truncate(List *relids)

         Oid         rid = lfirst_oid(cell);

         Relation    rel;

 

-        rel = table_open(rid, AccessExclusiveLock);

+        /* AccessExclusiveLock is not necessary for temporary tables. */

+        rel = table_open(rid, ExclusiveLock);

+        Assert(rel->rd_rel->relpersistence == RELPERSISTENCE_TEMP);

         relations = lappend(relations, rel);

     }

 

@@ -3059,6 +3069,7 @@ void

 heap_truncate_one_rel(Relation rel)

 {

     Oid         toastrelid;

+    LOCKMODE    lockmode;

 

     /*

      * Truncate the relation.  Partitioned tables have no storage, so there is

@@ -3073,11 +3084,17 @@ heap_truncate_one_rel(Relation rel)

     /* If the relation has indexes, truncate the indexes too */

     RelationTruncateIndexes(rel);

 

+    /* AccessExclusiveLock is not necessary for temporary tables. */

+    if (rel->rd_rel->relpersistence != RELPERSISTENCE_TEMP)

+        lockmode = AccessExclusiveLock;

+    else

+        lockmode = ExclusiveLock;

+

     /* If there is a toast table, truncate that too */

     toastrelid = rel->rd_rel->reltoastrelid;

     if (OidIsValid(toastrelid))

     {

-        Relation    toastrel = table_open(toastrelid, AccessExclusiveLock);

+        Relation    toastrel = table_open(toastrelid, lockmode);

 

         table_relation_nontransactional_truncate(toastrel);

         RelationTruncateIndexes(toastrel);

```


Best Regards,
Fei Changhong

RE: temp table on commit delete rows performance issue

From
Floris Van Nee
Date:
> It seems that in your patch, WAL logging is skipped for all tables, not just
> temporary tables.

This code path is only used in two cases though:
* For the temporary tables ON COMMIT DROP
* For truncating tables that were created in the same transaction, or which
were already truncated in the same transaction (this is some special case
in the TRUNCATE command)
In both cases I believe it's not necessary to log the lock, as the table doesn't exist
on replica yet or the exclusive lock has already been obtained and logged previously.
Regular TRUNCATE commands go through a completely different code path,
as these need to be rollbackable if the transaction aborts.

> Upon further consideration, do we really need to acquire AccessExclusiveLocks
> for temporary tables? Since temporary tables can only be accessed within the
> current session, perhaps we can make the following optimizations:

This one I'm less sure of if it's correct in all cases. Logically it makes sense that no other
backends can access it, however I see some threads [1] that suggest that it's technically
possible for other backends to take locks on these tables, so it's not *that* obvious there
are no edge cases.

[1] https://postgrespro.com/list/thread-id/2477885