subtransaction performance regression [kind of] due to snapshot caching - Mailing list pgsql-hackers

From Andres Freund
Subject subtransaction performance regression [kind of] due to snapshot caching
Date
Msg-id 20210406043521.lopeo7bbigad3n6t@alap3.anarazel.de
Whole thread Raw
Responses Re: subtransaction performance regression [kind of] due to snapshot caching
Re: subtransaction performance regression [kind of] due to snapshot caching
List pgsql-hackers
Hi,

In a recent thread ([1]) I found a performance regression of the
following statement
DO $do$
    BEGIN FOR i IN 1 .. 10000 LOOP
        BEGIN
            EXECUTE $cf$CREATE OR REPLACE FUNCTION foo() RETURNS VOID LANGUAGE plpgsql AS $f$BEGIN frakbar;
END;$f$;$cf$;
        EXCEPTION WHEN others THEN
        END;
   END LOOP;
END;$do$;

13: 1617.798
14-dev: 34088.505

The time in 14 is spent mostly below:
-   94.58%     0.01%  postgres  postgres            [.] CreateFunction
   - 94.57% CreateFunction
      - 94.49% ProcedureCreate
         - 90.95% record_object_address_dependencies
            - 90.93% recordMultipleDependencies
               - 89.65% isObjectPinned
                  - 89.12% systable_getnext
                     - 89.06% index_getnext_slot
                        - 56.13% index_fetch_heap
                           - 54.82% table_index_fetch_tuple
                              + 53.79% heapam_index_fetch_tuple
                                0.07% heap_hot_search_buffer
                                0.01% ReleaseAndReadBuffer
                                0.01% LockBuffer
                             0.08% heapam_index_fetch_tuple


After a bit of debugging I figured out that the direct failure lies with
623a9ba79b. The problem is that subtransaction abort does not increment
ShmemVariableCache->xactCompletionCount. That's trivial to remedy (we
already lock ProcArrayLock during XidCacheRemoveRunningXids).

What happens is that heap_hot_search_buffer()-> correctly recognizes the
aborted subtransaction's rows as dead, but they are not recognized as
"surely dead". Which then leads to O(iterations^2) index->heap lookups,
because the rows from previous iterations are never recognized as dead.

I initially was a bit worried that this could be a correctness issue as
well. The more I thought about it the more confused I got. A
transaction's subtransaction abort should not actually change the
contents of a snapshot, right?

Snapshot
GetSnapshotData(Snapshot snapshot)
...
            /*
             * We don't include our own XIDs (if any) in the snapshot. It
             * needs to be includeded in the xmin computation, but we did so
             * outside the loop.
             */
            if (pgxactoff == mypgxactoff)
                continue;

The sole reason for the behavioural difference is that the cached
snapshot's xmax is *lower* than a new snapshot's would be, because
GetSnapshotData() initializes xmax as
ShmemVariableCache->latestCompletedXid - which
XidCacheRemoveRunningXids() incremented, without incrementing
ShmemVariableCache->xactCompletionCount.

Which then causes HeapTupleSatisfiesMVCC to go down
    if (!HeapTupleHeaderXminCommitted(tuple))
...
        else if (XidInMVCCSnapshot(HeapTupleHeaderGetRawXmin(tuple), snapshot))
            return false;
        else if (TransactionIdDidCommit(HeapTupleHeaderGetRawXmin(tuple)))
            SetHintBits(tuple, buffer, HEAP_XMIN_COMMITTED,
                        HeapTupleHeaderGetRawXmin(tuple));
        else
        {
            /* it must have aborted or crashed */
            SetHintBits(tuple, buffer, HEAP_XMIN_INVALID,
                        InvalidTransactionId);
            return false;
        }

the "return false" for XidInMVCCSnapshot() rather than the
SetHintBits(HEAP_XMIN_INVALID) path. Which then in turn causes
HeapTupleIsSurelyDead() to not recognize the rows as surely dead.

bool
XidInMVCCSnapshot(TransactionId xid, Snapshot snapshot)
{
    uint32      i;

    /*
     * Make a quick range check to eliminate most XIDs without looking at the
     * xip arrays.  Note that this is OK even if we convert a subxact XID to
     * its parent below, because a subxact with XID < xmin has surely also got
     * a parent with XID < xmin, while one with XID >= xmax must belong to a
     * parent that was not yet committed at the time of this snapshot.
     */

    /* Any xid < xmin is not in-progress */
    if (TransactionIdPrecedes(xid, snapshot->xmin))
        return false;
    /* Any xid >= xmax is in-progress */
    if (TransactionIdFollowsOrEquals(xid, snapshot->xmax))
        return true;


I *think* this issue doesn't lead to actually wrong query results. For
HeapTupleSatisfiesMVCC purposes there's not much of a difference between
an aborted transaction and one that's "in progress" according to the
snapshot (that's required - we don't check for aborts for xids in the
snapshot).

It is a bit disappointing that there - as far as I could find - are no
tests for kill_prior_tuple actually working. I guess that lack, and that
there's no difference in query results, explains why nobody noticed the
issue in the last ~9 months.

See the attached fix. I did include a test that verifies that the
kill_prior_tuples optimization actually prevents the index from growing,
when subtransactions are involved. I think it should be stable, even
with concurrent activity. But I'd welcome a look.


I don't think that's why the issue exists, but I very much hate the
XidCache* name. Makes it sound much less important than it is...


Greetings,

Andres Freund

[1] https://postgr.es/m/20210317055718.v6qs3ltzrformqoa%40alap3.anarazel.de

Attachment

pgsql-hackers by date:

Previous
From: Bharath Rupireddy
Date:
Subject: Re: TRUNCATE on foreign table
Next
From: Fujii Masao
Date:
Subject: Re: Get memory contexts of an arbitrary backend process