Thread: subtransaction performance regression [kind of] due to snapshot caching

subtransaction performance regression [kind of] due to snapshot caching

From
Andres Freund
Date:
Hi,

In a recent thread ([1]) I found a performance regression of the
following statement
DO $do$
    BEGIN FOR i IN 1 .. 10000 LOOP
        BEGIN
            EXECUTE $cf$CREATE OR REPLACE FUNCTION foo() RETURNS VOID LANGUAGE plpgsql AS $f$BEGIN frakbar;
END;$f$;$cf$;
        EXCEPTION WHEN others THEN
        END;
   END LOOP;
END;$do$;

13: 1617.798
14-dev: 34088.505

The time in 14 is spent mostly below:
-   94.58%     0.01%  postgres  postgres            [.] CreateFunction
   - 94.57% CreateFunction
      - 94.49% ProcedureCreate
         - 90.95% record_object_address_dependencies
            - 90.93% recordMultipleDependencies
               - 89.65% isObjectPinned
                  - 89.12% systable_getnext
                     - 89.06% index_getnext_slot
                        - 56.13% index_fetch_heap
                           - 54.82% table_index_fetch_tuple
                              + 53.79% heapam_index_fetch_tuple
                                0.07% heap_hot_search_buffer
                                0.01% ReleaseAndReadBuffer
                                0.01% LockBuffer
                             0.08% heapam_index_fetch_tuple


After a bit of debugging I figured out that the direct failure lies with
623a9ba79b. The problem is that subtransaction abort does not increment
ShmemVariableCache->xactCompletionCount. That's trivial to remedy (we
already lock ProcArrayLock during XidCacheRemoveRunningXids).

What happens is that heap_hot_search_buffer()-> correctly recognizes the
aborted subtransaction's rows as dead, but they are not recognized as
"surely dead". Which then leads to O(iterations^2) index->heap lookups,
because the rows from previous iterations are never recognized as dead.

I initially was a bit worried that this could be a correctness issue as
well. The more I thought about it the more confused I got. A
transaction's subtransaction abort should not actually change the
contents of a snapshot, right?

Snapshot
GetSnapshotData(Snapshot snapshot)
...
            /*
             * We don't include our own XIDs (if any) in the snapshot. It
             * needs to be includeded in the xmin computation, but we did so
             * outside the loop.
             */
            if (pgxactoff == mypgxactoff)
                continue;

The sole reason for the behavioural difference is that the cached
snapshot's xmax is *lower* than a new snapshot's would be, because
GetSnapshotData() initializes xmax as
ShmemVariableCache->latestCompletedXid - which
XidCacheRemoveRunningXids() incremented, without incrementing
ShmemVariableCache->xactCompletionCount.

Which then causes HeapTupleSatisfiesMVCC to go down
    if (!HeapTupleHeaderXminCommitted(tuple))
...
        else if (XidInMVCCSnapshot(HeapTupleHeaderGetRawXmin(tuple), snapshot))
            return false;
        else if (TransactionIdDidCommit(HeapTupleHeaderGetRawXmin(tuple)))
            SetHintBits(tuple, buffer, HEAP_XMIN_COMMITTED,
                        HeapTupleHeaderGetRawXmin(tuple));
        else
        {
            /* it must have aborted or crashed */
            SetHintBits(tuple, buffer, HEAP_XMIN_INVALID,
                        InvalidTransactionId);
            return false;
        }

the "return false" for XidInMVCCSnapshot() rather than the
SetHintBits(HEAP_XMIN_INVALID) path. Which then in turn causes
HeapTupleIsSurelyDead() to not recognize the rows as surely dead.

bool
XidInMVCCSnapshot(TransactionId xid, Snapshot snapshot)
{
    uint32      i;

    /*
     * Make a quick range check to eliminate most XIDs without looking at the
     * xip arrays.  Note that this is OK even if we convert a subxact XID to
     * its parent below, because a subxact with XID < xmin has surely also got
     * a parent with XID < xmin, while one with XID >= xmax must belong to a
     * parent that was not yet committed at the time of this snapshot.
     */

    /* Any xid < xmin is not in-progress */
    if (TransactionIdPrecedes(xid, snapshot->xmin))
        return false;
    /* Any xid >= xmax is in-progress */
    if (TransactionIdFollowsOrEquals(xid, snapshot->xmax))
        return true;


I *think* this issue doesn't lead to actually wrong query results. For
HeapTupleSatisfiesMVCC purposes there's not much of a difference between
an aborted transaction and one that's "in progress" according to the
snapshot (that's required - we don't check for aborts for xids in the
snapshot).

It is a bit disappointing that there - as far as I could find - are no
tests for kill_prior_tuple actually working. I guess that lack, and that
there's no difference in query results, explains why nobody noticed the
issue in the last ~9 months.

See the attached fix. I did include a test that verifies that the
kill_prior_tuples optimization actually prevents the index from growing,
when subtransactions are involved. I think it should be stable, even
with concurrent activity. But I'd welcome a look.


I don't think that's why the issue exists, but I very much hate the
XidCache* name. Makes it sound much less important than it is...


Greetings,

Andres Freund

[1] https://postgr.es/m/20210317055718.v6qs3ltzrformqoa%40alap3.anarazel.de

Attachment
Andres Freund <andres@anarazel.de> writes:
> The time in 14 is spent mostly below:
> -   94.58%     0.01%  postgres  postgres            [.] CreateFunction
>    - 94.57% CreateFunction
>       - 94.49% ProcedureCreate
>          - 90.95% record_object_address_dependencies
>             - 90.93% recordMultipleDependencies
>                - 89.65% isObjectPinned
>                   - 89.12% systable_getnext
>                      - 89.06% index_getnext_slot
>                         - 56.13% index_fetch_heap
>                            - 54.82% table_index_fetch_tuple
>                               + 53.79% heapam_index_fetch_tuple
>                                 0.07% heap_hot_search_buffer
>                                 0.01% ReleaseAndReadBuffer
>                                 0.01% LockBuffer
>                              0.08% heapam_index_fetch_tuple

Not wanting to distract from your point about xactCompletionCount,
but ... I wonder if we could get away with defining "isObjectPinned"
as "is the OID <= 9999" (and, in consequence, dropping explicit pin
entries from pg_depend).  I had not previously seen a case where the
cost of looking into pg_depend for this info was this much of the
total query runtime.

            regards, tom lane



Re: subtransaction performance regression [kind of] due to snapshot caching

From
Andres Freund
Date:
Hi,

On 2021-04-06 00:47:13 -0400, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > The time in 14 is spent mostly below:
> > -   94.58%     0.01%  postgres  postgres            [.] CreateFunction
> >    - 94.57% CreateFunction
> >       - 94.49% ProcedureCreate
> >          - 90.95% record_object_address_dependencies
> >             - 90.93% recordMultipleDependencies
> >                - 89.65% isObjectPinned
> >                   - 89.12% systable_getnext
> >                      - 89.06% index_getnext_slot
> >                         - 56.13% index_fetch_heap
> >                            - 54.82% table_index_fetch_tuple
> >                               + 53.79% heapam_index_fetch_tuple
> >                                 0.07% heap_hot_search_buffer
> >                                 0.01% ReleaseAndReadBuffer
> >                                 0.01% LockBuffer
> >                              0.08% heapam_index_fetch_tuple
> 
> Not wanting to distract from your point about xactCompletionCount,
> but ... I wonder if we could get away with defining "isObjectPinned"
> as "is the OID <= 9999" (and, in consequence, dropping explicit pin
> entries from pg_depend).  I had not previously seen a case where the
> cost of looking into pg_depend for this info was this much of the
> total query runtime.

I had the same thought, and yes, I do think we should do that. I've seen
it show up in non-buggy workloads too (not to that degree though).

The <= 9999 pg_depend entries area also just a substantial proportion of
the size of an empty database, making it worth to remove <= 9999 entries:

freshly initdb'd empty cluster:

VACUUM FULL pg_depend;
dropme[926131][1]=# SELECT oid::regclass, pg_relation_size(oid) FROM pg_class WHERE relfilenode <> 0 ORDER BY 2 DESC
LIMIT10;
 
┌─────────────────────────────────┬──────────────────┐
│               oid               │ pg_relation_size │
├─────────────────────────────────┼──────────────────┤
│ pg_depend                       │           532480 │
│ pg_toast.pg_toast_2618          │           516096 │
│ pg_collation                    │           360448 │
│ pg_description                  │           352256 │
│ pg_depend_depender_index        │           294912 │
│ pg_depend_reference_index       │           294912 │
│ pg_description_o_c_o_index      │           221184 │
│ pg_statistic                    │           155648 │
│ pg_operator                     │           114688 │
│ pg_collation_name_enc_nsp_index │           106496 │
└─────────────────────────────────┴──────────────────┘
(10 rows)

DELETE FROM pg_depend WHERE deptype = 'p' AND refobjid <> 0 AND refobjid < 10000;
VACUUM FULL pg_depend;

dropme[926131][1]=# SELECT oid::regclass, pg_relation_size(oid) FROM pg_class WHERE relfilenode <> 0 ORDER BY 2 DESC
LIMIT10;
 
┌─────────────────────────────────┬──────────────────┐
│               oid               │ pg_relation_size │
├─────────────────────────────────┼──────────────────┤
│ pg_toast.pg_toast_2618          │           516096 │
│ pg_collation                    │           360448 │
│ pg_description                  │           352256 │
│ pg_depend                       │           253952 │
│ pg_description_o_c_o_index      │           221184 │
│ pg_statistic                    │           155648 │
│ pg_depend_depender_index        │           147456 │
│ pg_depend_reference_index       │           147456 │
│ pg_operator                     │           114688 │
│ pg_collation_name_enc_nsp_index │           106496 │
└─────────────────────────────────┴──────────────────┘
(10 rows)

A reduction from 8407kB to 7863kB of the size of the "dropme" database
just by deleting the "implicitly pinned" entries seems like a good deal.

To save people the time to look it up: pg_toast.pg_toast_2618 is
pg_description...


Couldn't we also treat FirstGenbkiObjectId to FirstBootstrapObjectId as
pinned? That'd be another 400kB of database size...


Greetings,

Andres Freund



Andres Freund <andres@anarazel.de> writes:
> On 2021-04-06 00:47:13 -0400, Tom Lane wrote:
>> Not wanting to distract from your point about xactCompletionCount,
>> but ... I wonder if we could get away with defining "isObjectPinned"
>> as "is the OID <= 9999" (and, in consequence, dropping explicit pin
>> entries from pg_depend).  I had not previously seen a case where the
>> cost of looking into pg_depend for this info was this much of the
>> total query runtime.

> Couldn't we also treat FirstGenbkiObjectId to FirstBootstrapObjectId as
> pinned? That'd be another 400kB of database size...

Yeah, it'd require some close study of exactly what we want to pin
or not pin.  Certainly everything with hand-assigned OIDs should
be pinned, but I think there's a lot of critical stuff like index
opclasses that don't get hand-assigned OIDs.  On the other hand,
it's intentional that nothing in information_schema is pinned.

We might have to rejigger initdb so that there's a clearer
distinction between the OID ranges we want to pin or not.
Maybe we'd even get initdb to record the cutoff OID in
pg_control or someplace.

Anyway, just idle late-night speculation for now ...

            regards, tom lane



Re: subtransaction performance regression [kind of] due to snapshot caching

From
Andres Freund
Date:
On 2021-04-06 01:34:02 -0400, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > On 2021-04-06 00:47:13 -0400, Tom Lane wrote:
> >> Not wanting to distract from your point about xactCompletionCount,
> >> but ... I wonder if we could get away with defining "isObjectPinned"
> >> as "is the OID <= 9999" (and, in consequence, dropping explicit pin
> >> entries from pg_depend).  I had not previously seen a case where the
> >> cost of looking into pg_depend for this info was this much of the
> >> total query runtime.
> 
> > Couldn't we also treat FirstGenbkiObjectId to FirstBootstrapObjectId as
> > pinned? That'd be another 400kB of database size...
> 
> Yeah, it'd require some close study of exactly what we want to pin
> or not pin.

One interesting bit is:

postgres[947554][1]=# SELECT classid::regclass, objid, refclassid::regclass, refobjid, deptype, refobjversion FROM
pg_dependWHERE refobjid < 13000 AND deptype <> 'p';
 
┌───────────────┬───────┬──────────────┬──────────┬─────────┬───────────────┐
│    classid    │ objid │  refclassid  │ refobjid │ deptype │ refobjversion │
├───────────────┼───────┼──────────────┼──────────┼─────────┼───────────────┤
│ pg_constraint │ 15062 │ pg_collation │      100 │ n       │ 2.31          │
└───────────────┴───────┴──────────────┴──────────┴─────────┴───────────────┘
(1 row)



> Certainly everything with hand-assigned OIDs should
> be pinned, but I think there's a lot of critical stuff like index
> opclasses that don't get hand-assigned OIDs.  On the other hand,
> it's intentional that nothing in information_schema is pinned.

Isn't that pretty much the difference between FirstGenbkiObjectId and
FirstBootstrapObjectId? Genbki will have assigned things like opclasses,
but not things like information_schema?


> We might have to rejigger initdb so that there's a clearer
> distinction between the OID ranges we want to pin or not.
> Maybe we'd even get initdb to record the cutoff OID in
> pg_control or someplace.

The only non-pinned pg_depend entry below FirstBootstrapObjectId is the
collation versioning one above. The only pinned entries above
FirstBootstrapObjectId are the ones created via
system_constraints.sql. So it seems we "just" would need to resolve the
constraint versioning stuff? And that could probably just be handled as
a hardcoded special case for now...

Greetings,

Andres Freund



Re: subtransaction performance regression [kind of] due to snapshot caching

From
Andres Freund
Date:
Hi,

On 2021-04-05 21:35:21 -0700, Andres Freund wrote:
> See the attached fix. I did include a test that verifies that the
> kill_prior_tuples optimization actually prevents the index from growing,
> when subtransactions are involved. I think it should be stable, even
> with concurrent activity. But I'd welcome a look.

Pushed that now, after trying and failing to make the test spuriously
fail due to concurrent activity.

Greetings,

Andres Freund