Thread: subtransaction performance regression [kind of] due to snapshot caching
Hi, In a recent thread ([1]) I found a performance regression of the following statement DO $do$ BEGIN FOR i IN 1 .. 10000 LOOP BEGIN EXECUTE $cf$CREATE OR REPLACE FUNCTION foo() RETURNS VOID LANGUAGE plpgsql AS $f$BEGIN frakbar; END;$f$;$cf$; EXCEPTION WHEN others THEN END; END LOOP; END;$do$; 13: 1617.798 14-dev: 34088.505 The time in 14 is spent mostly below: - 94.58% 0.01% postgres postgres [.] CreateFunction - 94.57% CreateFunction - 94.49% ProcedureCreate - 90.95% record_object_address_dependencies - 90.93% recordMultipleDependencies - 89.65% isObjectPinned - 89.12% systable_getnext - 89.06% index_getnext_slot - 56.13% index_fetch_heap - 54.82% table_index_fetch_tuple + 53.79% heapam_index_fetch_tuple 0.07% heap_hot_search_buffer 0.01% ReleaseAndReadBuffer 0.01% LockBuffer 0.08% heapam_index_fetch_tuple After a bit of debugging I figured out that the direct failure lies with 623a9ba79b. The problem is that subtransaction abort does not increment ShmemVariableCache->xactCompletionCount. That's trivial to remedy (we already lock ProcArrayLock during XidCacheRemoveRunningXids). What happens is that heap_hot_search_buffer()-> correctly recognizes the aborted subtransaction's rows as dead, but they are not recognized as "surely dead". Which then leads to O(iterations^2) index->heap lookups, because the rows from previous iterations are never recognized as dead. I initially was a bit worried that this could be a correctness issue as well. The more I thought about it the more confused I got. A transaction's subtransaction abort should not actually change the contents of a snapshot, right? Snapshot GetSnapshotData(Snapshot snapshot) ... /* * We don't include our own XIDs (if any) in the snapshot. It * needs to be includeded in the xmin computation, but we did so * outside the loop. */ if (pgxactoff == mypgxactoff) continue; The sole reason for the behavioural difference is that the cached snapshot's xmax is *lower* than a new snapshot's would be, because GetSnapshotData() initializes xmax as ShmemVariableCache->latestCompletedXid - which XidCacheRemoveRunningXids() incremented, without incrementing ShmemVariableCache->xactCompletionCount. Which then causes HeapTupleSatisfiesMVCC to go down if (!HeapTupleHeaderXminCommitted(tuple)) ... else if (XidInMVCCSnapshot(HeapTupleHeaderGetRawXmin(tuple), snapshot)) return false; else if (TransactionIdDidCommit(HeapTupleHeaderGetRawXmin(tuple))) SetHintBits(tuple, buffer, HEAP_XMIN_COMMITTED, HeapTupleHeaderGetRawXmin(tuple)); else { /* it must have aborted or crashed */ SetHintBits(tuple, buffer, HEAP_XMIN_INVALID, InvalidTransactionId); return false; } the "return false" for XidInMVCCSnapshot() rather than the SetHintBits(HEAP_XMIN_INVALID) path. Which then in turn causes HeapTupleIsSurelyDead() to not recognize the rows as surely dead. bool XidInMVCCSnapshot(TransactionId xid, Snapshot snapshot) { uint32 i; /* * Make a quick range check to eliminate most XIDs without looking at the * xip arrays. Note that this is OK even if we convert a subxact XID to * its parent below, because a subxact with XID < xmin has surely also got * a parent with XID < xmin, while one with XID >= xmax must belong to a * parent that was not yet committed at the time of this snapshot. */ /* Any xid < xmin is not in-progress */ if (TransactionIdPrecedes(xid, snapshot->xmin)) return false; /* Any xid >= xmax is in-progress */ if (TransactionIdFollowsOrEquals(xid, snapshot->xmax)) return true; I *think* this issue doesn't lead to actually wrong query results. For HeapTupleSatisfiesMVCC purposes there's not much of a difference between an aborted transaction and one that's "in progress" according to the snapshot (that's required - we don't check for aborts for xids in the snapshot). It is a bit disappointing that there - as far as I could find - are no tests for kill_prior_tuple actually working. I guess that lack, and that there's no difference in query results, explains why nobody noticed the issue in the last ~9 months. See the attached fix. I did include a test that verifies that the kill_prior_tuples optimization actually prevents the index from growing, when subtransactions are involved. I think it should be stable, even with concurrent activity. But I'd welcome a look. I don't think that's why the issue exists, but I very much hate the XidCache* name. Makes it sound much less important than it is... Greetings, Andres Freund [1] https://postgr.es/m/20210317055718.v6qs3ltzrformqoa%40alap3.anarazel.de
Attachment
Andres Freund <andres@anarazel.de> writes: > The time in 14 is spent mostly below: > - 94.58% 0.01% postgres postgres [.] CreateFunction > - 94.57% CreateFunction > - 94.49% ProcedureCreate > - 90.95% record_object_address_dependencies > - 90.93% recordMultipleDependencies > - 89.65% isObjectPinned > - 89.12% systable_getnext > - 89.06% index_getnext_slot > - 56.13% index_fetch_heap > - 54.82% table_index_fetch_tuple > + 53.79% heapam_index_fetch_tuple > 0.07% heap_hot_search_buffer > 0.01% ReleaseAndReadBuffer > 0.01% LockBuffer > 0.08% heapam_index_fetch_tuple Not wanting to distract from your point about xactCompletionCount, but ... I wonder if we could get away with defining "isObjectPinned" as "is the OID <= 9999" (and, in consequence, dropping explicit pin entries from pg_depend). I had not previously seen a case where the cost of looking into pg_depend for this info was this much of the total query runtime. regards, tom lane
Re: subtransaction performance regression [kind of] due to snapshot caching
From
Andres Freund
Date:
Hi, On 2021-04-06 00:47:13 -0400, Tom Lane wrote: > Andres Freund <andres@anarazel.de> writes: > > The time in 14 is spent mostly below: > > - 94.58% 0.01% postgres postgres [.] CreateFunction > > - 94.57% CreateFunction > > - 94.49% ProcedureCreate > > - 90.95% record_object_address_dependencies > > - 90.93% recordMultipleDependencies > > - 89.65% isObjectPinned > > - 89.12% systable_getnext > > - 89.06% index_getnext_slot > > - 56.13% index_fetch_heap > > - 54.82% table_index_fetch_tuple > > + 53.79% heapam_index_fetch_tuple > > 0.07% heap_hot_search_buffer > > 0.01% ReleaseAndReadBuffer > > 0.01% LockBuffer > > 0.08% heapam_index_fetch_tuple > > Not wanting to distract from your point about xactCompletionCount, > but ... I wonder if we could get away with defining "isObjectPinned" > as "is the OID <= 9999" (and, in consequence, dropping explicit pin > entries from pg_depend). I had not previously seen a case where the > cost of looking into pg_depend for this info was this much of the > total query runtime. I had the same thought, and yes, I do think we should do that. I've seen it show up in non-buggy workloads too (not to that degree though). The <= 9999 pg_depend entries area also just a substantial proportion of the size of an empty database, making it worth to remove <= 9999 entries: freshly initdb'd empty cluster: VACUUM FULL pg_depend; dropme[926131][1]=# SELECT oid::regclass, pg_relation_size(oid) FROM pg_class WHERE relfilenode <> 0 ORDER BY 2 DESC LIMIT10; ┌─────────────────────────────────┬──────────────────┐ │ oid │ pg_relation_size │ ├─────────────────────────────────┼──────────────────┤ │ pg_depend │ 532480 │ │ pg_toast.pg_toast_2618 │ 516096 │ │ pg_collation │ 360448 │ │ pg_description │ 352256 │ │ pg_depend_depender_index │ 294912 │ │ pg_depend_reference_index │ 294912 │ │ pg_description_o_c_o_index │ 221184 │ │ pg_statistic │ 155648 │ │ pg_operator │ 114688 │ │ pg_collation_name_enc_nsp_index │ 106496 │ └─────────────────────────────────┴──────────────────┘ (10 rows) DELETE FROM pg_depend WHERE deptype = 'p' AND refobjid <> 0 AND refobjid < 10000; VACUUM FULL pg_depend; dropme[926131][1]=# SELECT oid::regclass, pg_relation_size(oid) FROM pg_class WHERE relfilenode <> 0 ORDER BY 2 DESC LIMIT10; ┌─────────────────────────────────┬──────────────────┐ │ oid │ pg_relation_size │ ├─────────────────────────────────┼──────────────────┤ │ pg_toast.pg_toast_2618 │ 516096 │ │ pg_collation │ 360448 │ │ pg_description │ 352256 │ │ pg_depend │ 253952 │ │ pg_description_o_c_o_index │ 221184 │ │ pg_statistic │ 155648 │ │ pg_depend_depender_index │ 147456 │ │ pg_depend_reference_index │ 147456 │ │ pg_operator │ 114688 │ │ pg_collation_name_enc_nsp_index │ 106496 │ └─────────────────────────────────┴──────────────────┘ (10 rows) A reduction from 8407kB to 7863kB of the size of the "dropme" database just by deleting the "implicitly pinned" entries seems like a good deal. To save people the time to look it up: pg_toast.pg_toast_2618 is pg_description... Couldn't we also treat FirstGenbkiObjectId to FirstBootstrapObjectId as pinned? That'd be another 400kB of database size... Greetings, Andres Freund
Andres Freund <andres@anarazel.de> writes: > On 2021-04-06 00:47:13 -0400, Tom Lane wrote: >> Not wanting to distract from your point about xactCompletionCount, >> but ... I wonder if we could get away with defining "isObjectPinned" >> as "is the OID <= 9999" (and, in consequence, dropping explicit pin >> entries from pg_depend). I had not previously seen a case where the >> cost of looking into pg_depend for this info was this much of the >> total query runtime. > Couldn't we also treat FirstGenbkiObjectId to FirstBootstrapObjectId as > pinned? That'd be another 400kB of database size... Yeah, it'd require some close study of exactly what we want to pin or not pin. Certainly everything with hand-assigned OIDs should be pinned, but I think there's a lot of critical stuff like index opclasses that don't get hand-assigned OIDs. On the other hand, it's intentional that nothing in information_schema is pinned. We might have to rejigger initdb so that there's a clearer distinction between the OID ranges we want to pin or not. Maybe we'd even get initdb to record the cutoff OID in pg_control or someplace. Anyway, just idle late-night speculation for now ... regards, tom lane
Re: subtransaction performance regression [kind of] due to snapshot caching
From
Andres Freund
Date:
On 2021-04-06 01:34:02 -0400, Tom Lane wrote: > Andres Freund <andres@anarazel.de> writes: > > On 2021-04-06 00:47:13 -0400, Tom Lane wrote: > >> Not wanting to distract from your point about xactCompletionCount, > >> but ... I wonder if we could get away with defining "isObjectPinned" > >> as "is the OID <= 9999" (and, in consequence, dropping explicit pin > >> entries from pg_depend). I had not previously seen a case where the > >> cost of looking into pg_depend for this info was this much of the > >> total query runtime. > > > Couldn't we also treat FirstGenbkiObjectId to FirstBootstrapObjectId as > > pinned? That'd be another 400kB of database size... > > Yeah, it'd require some close study of exactly what we want to pin > or not pin. One interesting bit is: postgres[947554][1]=# SELECT classid::regclass, objid, refclassid::regclass, refobjid, deptype, refobjversion FROM pg_dependWHERE refobjid < 13000 AND deptype <> 'p'; ┌───────────────┬───────┬──────────────┬──────────┬─────────┬───────────────┐ │ classid │ objid │ refclassid │ refobjid │ deptype │ refobjversion │ ├───────────────┼───────┼──────────────┼──────────┼─────────┼───────────────┤ │ pg_constraint │ 15062 │ pg_collation │ 100 │ n │ 2.31 │ └───────────────┴───────┴──────────────┴──────────┴─────────┴───────────────┘ (1 row) > Certainly everything with hand-assigned OIDs should > be pinned, but I think there's a lot of critical stuff like index > opclasses that don't get hand-assigned OIDs. On the other hand, > it's intentional that nothing in information_schema is pinned. Isn't that pretty much the difference between FirstGenbkiObjectId and FirstBootstrapObjectId? Genbki will have assigned things like opclasses, but not things like information_schema? > We might have to rejigger initdb so that there's a clearer > distinction between the OID ranges we want to pin or not. > Maybe we'd even get initdb to record the cutoff OID in > pg_control or someplace. The only non-pinned pg_depend entry below FirstBootstrapObjectId is the collation versioning one above. The only pinned entries above FirstBootstrapObjectId are the ones created via system_constraints.sql. So it seems we "just" would need to resolve the constraint versioning stuff? And that could probably just be handled as a hardcoded special case for now... Greetings, Andres Freund
Re: subtransaction performance regression [kind of] due to snapshot caching
From
Andres Freund
Date:
Hi, On 2021-04-05 21:35:21 -0700, Andres Freund wrote: > See the attached fix. I did include a test that verifies that the > kill_prior_tuples optimization actually prevents the index from growing, > when subtransactions are involved. I think it should be stable, even > with concurrent activity. But I'd welcome a look. Pushed that now, after trying and failing to make the test spuriously fail due to concurrent activity. Greetings, Andres Freund