Thread: temp table on commit delete rows performance issue
Hi hackers, I'm looking for some input on an issue I've observed. A common pattern I've seen is using temporary tables to put data in before updating the real tables. Something roughly like: On session start: CREATE TEMP TABLE temp_t1 (...) ON COMMIT DELETE ROWS; On update: BEGIN; COPY temp_t1 FROM STDIN (FORMAT BINARY); INSERT INTO t1 (SELECT * FROM temp_t1 ...) ON CONFLICT DO UPDATE SET ...; -- potentially some other operations on temp table to put data into real table t1 COMMIT; This pattern starts to break down under certain exceptional circumstances of high concurrency. The "ON COMMIT DELETE ROWS" does a truncate that is fairly expensive and doesn't work well in high-concurrency scenarios. It's especially noticeable under following circumstances: - high max_connections setting - high number of temp tables per session - concurrent writers at fairly short intervals Impact is on both TPS on primary as well as that the WAL replay process on replica becomes completely overloaded (100% cpu even though not a lot of WAL is being generated) A very simple pgbench example that showcases degradation (taken with max_connections=2000 to clearly show it). test_truncate.sql begin; set client_min_messages=warning; create temp table if not exists ut0 (a int, b text, c text) on commit delete rows; commit; test_no_truncate.sql begin; set client_min_messages=warning; create temp table if not exists ut0 (a int, b text, c text); -- do not do anything on commit commit; pgbench -n -f test_truncate.sql -T 10 postgres -c 1 -j 1 pgbench (18devel) tps = 2693.806376 (without initial connection time) pgbench -n -f test_truncate.sql -T 10 postgres -c10 -j 10 pgbench (18devel) tps = 12119.358458 (without initial connection time) pgbench -n -f test_no_truncate.sql -T 10 postgres -c 1 -j 1 pgbench (18devel) tps = 22685.877281 (without initial connection time) pgbench -n -f test_no_truncate.sql -T 10 postgres -c10 -j 10 pgbench (18devel) tps = 200359.458154 (without initial connection time) For the test_truncate.sql with 10 threads, WAL replay process is spinning at 100% CPU. The reason seems to be that for a large part, the 'on commit delete rows' follows the same code path as a regular truncate. This means: * taking an exclusive lock on the temp table (WAL logged) * invalidating rel cache (WAL logged) On replica process, these exclusive locks are very expensive to replay, leading to spending replica most of its cycles trying to get LWLocks on non-fast-path. On primary, degradation on concurrency I believes comes from the rel cache invalidation. There doesn't seem to be a good reason for this behavior, as the replica has nothing to do for a temporary table truncation, so this lock doesn't need to be WAL logged. The generated WAL is just fully this stuff. LOCK+INVALIDATION+COMMIT. And it's spinning at 100% CPU for trying to replay this. rmgr: Standby len (rec/tot): 42/ 42, tx: 8154875, lsn: 0/3AFFE270, prev 0/3AFFE218, desc: LOCK xid 8154875db 5 rel 18188 rmgr: Standby len (rec/tot): 42/ 42, tx: 8154875, lsn: 0/3AFFE2A0, prev 0/3AFFE270, desc: LOCK xid 8154875db 5 rel 18191 rmgr: Standby len (rec/tot): 42/ 42, tx: 8154875, lsn: 0/3AFFE2D0, prev 0/3AFFE2A0, desc: LOCK xid 8154875db 5 rel 18192 rmgr: Transaction len (rec/tot): 62/ 62, tx: 8154875, lsn: 0/3AFFE300, prev 0/3AFFE2D0, desc: INVALIDATION ; invalmsgs: relcache 18191 relcache 18192 rmgr: Transaction len (rec/tot): 82/ 82, tx: 8154875, lsn: 0/3AFFE340, prev 0/3AFFE300, desc: COMMIT 2024-07-1612:56:42.878147 CEST; inval msgs: relcache 18191 relcache 18192 rmgr: Standby len (rec/tot): 42/ 42, tx: 8154876, lsn: 0/3AFFE398, prev 0/3AFFE340, desc: LOCK xid 8154876db 5 rel 18188 rmgr: Standby len (rec/tot): 42/ 42, tx: 8154876, lsn: 0/3AFFE3C8, prev 0/3AFFE398, desc: LOCK xid 8154876db 5 rel 18191 rmgr: Standby len (rec/tot): 42/ 42, tx: 8154876, lsn: 0/3AFFE3F8, prev 0/3AFFE3C8, desc: LOCK xid 8154876db 5 rel 18192 rmgr: Transaction len (rec/tot): 62/ 62, tx: 8154876, lsn: 0/3AFFE428, prev 0/3AFFE3F8, desc: INVALIDATION ; invalmsgs: relcache 18191 relcache 18192 rmgr: Transaction len (rec/tot): 82/ 82, tx: 8154876, lsn: 0/3AFFE468, prev 0/3AFFE428, desc: COMMIT 2024-07-1612:56:42.878544 CEST; inval msgs: relcache 18191 relcache 18192 Have people observed this behavior before? Would the community be open to accepting a patch that changes the behavior of ON COMMIT ROWS to not WAL log the lock taken and/or the relcache inval? I think: * An exclusive lock needs to be taken on primary, but does not need to be WAL logged * Rel cache invalidation should not be necessary I think (it currently happens just on the TOAST table+index, not on the regular TEMP table) -Floris
Hi, > I'm looking for some input on an issue I've observed. A common pattern > I've seen is using temporary tables to put data in before updating the > real tables. Something roughly like: > > On session start: > CREATE TEMP TABLE temp_t1 (...) ON COMMIT DELETE ROWS; > > On update: > BEGIN; > COPY temp_t1 FROM STDIN (FORMAT BINARY); > INSERT INTO t1 (SELECT * FROM temp_t1 ...) ON CONFLICT DO UPDATE SET ...; > -- potentially some other operations on temp table to put data into real table t1 > COMMIT; > > This pattern starts to break down under certain exceptional circumstances of > high concurrency. The "ON COMMIT DELETE ROWS" does a truncate that is > fairly expensive and doesn't work well in high-concurrency scenarios. It's > especially noticeable under following circumstances: > - high max_connections setting > - high number of temp tables per session > - concurrent writers at fairly short intervals > Impact is on both TPS on primary as well as that the WAL replay process > on replica becomes completely overloaded (100% cpu even though not > a lot of WAL is being generated) > > [...] I didn't investigate your particular issue but generally speaking creating a table, even a temporary one, is an expensive operation. Note that it's far from being a seperate file on the disk. It affects catalog tables, shared buffers, all the corresponding locks, etc. If you have indexes for a temporary table it makes the situation ever worse. Sooner or later VACUUM will happen for your bloated catalog, and this is not fun under heavy load. Is there any particular reason why you don't want to simply change the target table directly? If you do it in a transaction you are safe. -- Best regards, Aleksander Alekseev
> > I didn't investigate your particular issue but generally speaking creating a > table, even a temporary one, is an expensive operation. > > Note that it's far from being a seperate file on the disk. It affects catalog > tables, shared buffers, all the corresponding locks, etc. If you have indexes > for a temporary table it makes the situation ever worse. Sooner or later > VACUUM will happen for your bloated catalog, and this is not fun under > heavy load. Yes, creation is expensive, but note that this test case does not create the temp table for every transaction. It creates it once on startup of the connection. So there'll be no catalog bloat as the connections are generally long-lived. The part that remains to be expensive (but which was unexpected to me) is the truncate that does happen for every transaction. > > Is there any particular reason why you don't want to simply change the target > table directly? If you do it in a transaction you are safe. > This is a fair point which I didn't explain in my earlier message. We find that for various use cases the temp table approach is a really convenient way of interacting with Postgres. You can just easily binary COPY whatever data you want into it without looking too much at data size and then handling it from there on the db-side. Doing the same without this, you'd need to resort to passing the data as query parameters, but this has complications like a limit in number of bound params per query. It's definitely possible to use as well, but the binary COPY to temp table is quite convenient. -Floris
On Jul 16, 2024, at 19:47, Floris Van Nee <florisvannee@Optiver.com> wrote:Hi hackers,
I'm looking for some input on an issue I've observed. A common pattern
I've seen is using temporary tables to put data in before updating the
real tables. Something roughly like:
On session start:
CREATE TEMP TABLE temp_t1 (...) ON COMMIT DELETE ROWS;
On update:
BEGIN;
COPY temp_t1 FROM STDIN (FORMAT BINARY);
INSERT INTO t1 (SELECT * FROM temp_t1 ...) ON CONFLICT DO UPDATE SET ...;
-- potentially some other operations on temp table to put data into real table t1
COMMIT;
This pattern starts to break down under certain exceptional circumstances of
high concurrency. The "ON COMMIT DELETE ROWS" does a truncate that is
fairly expensive and doesn't work well in high-concurrency scenarios. It's
especially noticeable under following circumstances:
- high max_connections setting
- high number of temp tables per session
- concurrent writers at fairly short intervals
Impact is on both TPS on primary as well as that the WAL replay process
on replica becomes completely overloaded (100% cpu even though not
a lot of WAL is being generated)
A very simple pgbench example that showcases degradation (taken
with max_connections=2000 to clearly show it).
Fei Changhong
> I also encountered the similar performance issue with temporary tables > andprovided a patch to optimize the truncate performance during commit > in [1]. Interesting, that is definitely another good way to improve the performance, especially with a large number of temp tables. I think the two optimizations can actually work well together. Your optimization on only truncating the tables that are actually used. Combined with a patch like attached which makes sure that no WAL is generated at all for the ON COMMIT DELETE ROWS operation. On my test system this reduces WAL generation for the pgbench test case I posted previously to 0 (and therefore brought WAL replay process CPU usage from 100% CPU and lagging behind to only 0% CPU usage) -Floris
Attachment
On Jul 18, 2024, at 21:36, Floris Van Nee <florisvannee@Optiver.com> wrote:I also encountered the similar performance issue with temporary tables
andprovided a patch to optimize the truncate performance during commit
in [1].
Interesting, that is definitely another good way to improve the performance,
especially with a large number of temp tables. I think the two optimizations
can actually work well together.
Your optimization on only truncating the tables that are actually used.
Combined with a patch like attached which makes sure that no WAL is generated at all
for the ON COMMIT DELETE ROWS operation.
It seems that in your patch, WAL logging is skipped for all tables, not just
temporary tables.
Upon further consideration, do we really need to acquire AccessExclusiveLocks
for temporary tables? Since temporary tables can only be accessed within the
current session, perhaps we can make the following optimizations:
```
diff --git a/src/backend/catalog/heap.c b/src/backend/catalog/heap.c
index 00074c8a94..845c9603e2 100644
--- a/src/backend/catalog/heap.c
+++ b/src/backend/catalog/heap.c
@@ -2977,9 +2977,17 @@ RelationTruncateIndexes(Relation heapRelation)
Oid indexId = lfirst_oid(indlist);
Relation currentIndex;
IndexInfo *indexInfo;
+ LOCKMODE lockmode;
- /* Open the index relation; use exclusive lock, just to be sure */
- currentIndex = index_open(indexId, AccessExclusiveLock);
+ /*
+ * Open the index relation; use exclusive lock, just to be sure.
+ * AccessExclusiveLock is not necessary for temporary tables.
+ */
+ if (heapRelation->rd_rel->relpersistence != RELPERSISTENCE_TEMP)
+ lockmode = AccessExclusiveLock;
+ else
+ lockmode = ExclusiveLock;
+ currentIndex = index_open(indexId, lockmode);
/*
* Fetch info needed for index_build. Since we know there are no
@@ -3026,7 +3034,9 @@ heap_truncate(List *relids)
Oid rid = lfirst_oid(cell);
Relation rel;
- rel = table_open(rid, AccessExclusiveLock);
+ /* AccessExclusiveLock is not necessary for temporary tables. */
+ rel = table_open(rid, ExclusiveLock);
+ Assert(rel->rd_rel->relpersistence == RELPERSISTENCE_TEMP);
relations = lappend(relations, rel);
}
@@ -3059,6 +3069,7 @@ void
heap_truncate_one_rel(Relation rel)
{
Oid toastrelid;
+ LOCKMODE lockmode;
/*
* Truncate the relation. Partitioned tables have no storage, so there is
@@ -3073,11 +3084,17 @@ heap_truncate_one_rel(Relation rel)
/* If the relation has indexes, truncate the indexes too */
RelationTruncateIndexes(rel);
+ /* AccessExclusiveLock is not necessary for temporary tables. */
+ if (rel->rd_rel->relpersistence != RELPERSISTENCE_TEMP)
+ lockmode = AccessExclusiveLock;
+ else
+ lockmode = ExclusiveLock;
+
/* If there is a toast table, truncate that too */
toastrelid = rel->rd_rel->reltoastrelid;
if (OidIsValid(toastrelid))
{
- Relation toastrel = table_open(toastrelid, AccessExclusiveLock);
+ Relation toastrel = table_open(toastrelid, lockmode);
table_relation_nontransactional_truncate(toastrel);
RelationTruncateIndexes(toastrel);
```
Fei Changhong
> It seems that in your patch, WAL logging is skipped for all tables, not just > temporary tables. This code path is only used in two cases though: * For the temporary tables ON COMMIT DROP * For truncating tables that were created in the same transaction, or which were already truncated in the same transaction (this is some special case in the TRUNCATE command) In both cases I believe it's not necessary to log the lock, as the table doesn't exist on replica yet or the exclusive lock has already been obtained and logged previously. Regular TRUNCATE commands go through a completely different code path, as these need to be rollbackable if the transaction aborts. > Upon further consideration, do we really need to acquire AccessExclusiveLocks > for temporary tables? Since temporary tables can only be accessed within the > current session, perhaps we can make the following optimizations: This one I'm less sure of if it's correct in all cases. Logically it makes sense that no other backends can access it, however I see some threads [1] that suggest that it's technically possible for other backends to take locks on these tables, so it's not *that* obvious there are no edge cases. [1] https://postgrespro.com/list/thread-id/2477885