Re: ResourceOwner refactoring - Mailing list pgsql-hackers

From Heikki Linnakangas
Subject Re: ResourceOwner refactoring
Date
Msg-id 20607b96-b275-4a57-8509-29a3ed320d3f@iki.fi
Whole thread Raw
In response to Re: ResourceOwner refactoring  (Andres Freund <andres@anarazel.de>)
Responses Re: ResourceOwner refactoring
Re: ResourceOwner refactoring
List pgsql-hackers
On 10/07/2023 22:14, Andres Freund wrote:
> On 2023-05-25 20:14:23 +0300, Heikki Linnakangas wrote:
>> @@ -4183,9 +4180,6 @@ FlushRelationsAllBuffers(SMgrRelation *smgrs, int nrels)
>>       if (use_bsearch)
>>           pg_qsort(srels, nrels, sizeof(SMgrSortArray), rlocator_comparator);
>>
>> -    /* Make sure we can handle the pin inside the loop */
>> -    ResourceOwnerEnlargeBuffers(CurrentResourceOwner);
>> -
>>       for (i = 0; i < NBuffers; i++)
>>       {
>>           SMgrSortArray *srelent = NULL;
> 
> Hm, a tad worried that increasing the number of ResourceOwnerEnlargeBuffers()
> that drastically could have a visible overhead.

You mean in FlushRelationsAllBuffers() in particular? Seems pretty cheap 
compared to all the other work involved, and it's not that performance 
critical anyway.

ExtendBufferedRelShared() gave me a pause, as it is in the critical 
path. But there too, the ResourceOwnerEnlarge() call pales in comparison 
to all the other work that it does for each buffer.

Other than that, I don't think I'm introducing new 
ResourceOwnerEnlarge() calls to hot codepaths, just moving them around.

>> +static ResourceOwnerFuncs tupdesc_resowner_funcs =
>> +{
>> +    .name = "tupdesc reference",
>> +    .release_phase = RESOURCE_RELEASE_AFTER_LOCKS,
>> +    .release_priority = RELEASE_PRIO_TUPDESC_REFS,
>> +    .ReleaseResource = ResOwnerReleaseTupleDesc,
>> +    .PrintLeakWarning = ResOwnerPrintTupleDescLeakWarning
>> +};
> 
> I think these should all be marked const, there's no need to have them be in a
> mutable section of the binary. Of course that will require adjusting a bunch
> of the signatures too, but that seems fine.

Done.

>> --- a/src/backend/access/transam/xact.c
>> +++ b/src/backend/access/transam/xact.c
>> @@ -5171,9 +5171,24 @@ AbortSubTransaction(void)
>>           ResourceOwnerRelease(s->curTransactionOwner,
>>                                RESOURCE_RELEASE_BEFORE_LOCKS,
>>                                false, false);
>> +
>>           AtEOSubXact_RelationCache(false, s->subTransactionId,
>>                                     s->parent->subTransactionId);
>> +
>> +
>> +        /*
>> +         * AtEOSubXact_Inval sometimes needs to temporarily
>> +         * bump the refcount on the relcache entries that it processes.
>> +         * We cannot use the subtransaction's resource owner anymore,
>> +         * because we've already started releasing it.  But we can use
>> +         * the parent resource owner.
>> +         */
>> +        CurrentResourceOwner = s->parent->curTransactionOwner;
>> +
>>           AtEOSubXact_Inval(false);
>> +
>> +        CurrentResourceOwner = s->curTransactionOwner;
>> +
>>           ResourceOwnerRelease(s->curTransactionOwner,
>>                                RESOURCE_RELEASE_LOCKS,
>>                                false, false);
> 
> I might be a bit slow this morning, but why did this need to change as part of
> this?

I tried to explain it in the comment. AtEOSubXact_Inval() sometimes 
needs to bump the refcount on a relcache entry, which is tracked in the 
current resource owner. But we have already started to release it. On 
master, you can allocate new resources in a ResourceOwner after you have 
already called ResourceOwnerRelease(RESOURCE_RELEASE_BEFORE_LOCKS), but 
with this patch, that's no longer allowed and you get an assertion 
failure. I think it was always a bit questionable; it's not clear that 
the resource would've been released correctly if an error occurred. In 
practice, I don't think an error could actually occur while 
AtEOXSubXact_Inval() briefly holds the relcache entry.

>> @@ -5182,8 +5221,8 @@ TerminateBufferIO(BufferDesc *buf, bool clear_dirty, uint32 set_flag_bits)
>>    *    If I/O was in progress, we always set BM_IO_ERROR, even though it's
>>    *    possible the error condition wasn't related to the I/O.
>>    */
>> -void
>> -AbortBufferIO(Buffer buffer)
>> +static void
>> +AbortBufferIO(Buffer buffer, bool forget_owner)
> 
> What is the forget_owner argument for? It's always false, afaics?

Removed. There used to be more callers of AbortBufferIO() which needed 
that, but not anymore.

>> +/* Convenience wrappers over ResourceOwnerRemember/Forget */
>> +#define ResourceOwnerRememberCatCacheRef(owner, tuple) \
>> +    ResourceOwnerRemember(owner, PointerGetDatum(tuple), &catcache_resowner_funcs)
>> +#define ResourceOwnerForgetCatCacheRef(owner, tuple) \
>> +    ResourceOwnerForget(owner, PointerGetDatum(tuple), &catcache_resowner_funcs)
>> +#define ResourceOwnerRememberCatCacheListRef(owner, list) \
>> +    ResourceOwnerRemember(owner, PointerGetDatum(list), &catlistref_resowner_funcs)
>> +#define ResourceOwnerForgetCatCacheListRef(owner, list) \
>> +    ResourceOwnerForget(owner, PointerGetDatum(list), &catlistref_resowner_funcs)
> 
> Not specific to this type of resource owner, but I wonder if it'd not be
> better to use inline functions for these - the PointerGetDatum() cast removes
> nearly all type safety.

Peter just said the same thing. I guess you're right, changed.

>> +Releasing
>> +---------
>> +
>> +Releasing the resources of a ResourceOwner happens in three phases:
>> +
>> +1. "Before-locks" resources
>> +
>> +2. Locks
>> +
>> +3. "After-locks" resources
> 
> Given that we now have priorities, I wonder if we shouldn't merge the phase and
> the priorities?  We have code like this:
> 
> 
>     ResourceOwnerRelease(TopTransactionResourceOwner,
>                          RESOURCE_RELEASE_BEFORE_LOCKS,
>                          true, true);
> ...
>     ResourceOwnerRelease(TopTransactionResourceOwner,
>                          RESOURCE_RELEASE_LOCKS,
>                          true, true);
>     ResourceOwnerRelease(TopTransactionResourceOwner,
>                          RESOURCE_RELEASE_AFTER_LOCKS,
>                          true, true);
> 
> Now, admittedly locks currently are "special" anyway, but we have a plan to
> get rid of that.  If we did, then we could do the last two as one as
> ResourceOwnerRelease(from = RELEASE_PRIO_LOCKS, to: RELEASE_PRIO_LAST, ...)
> or such.

Currently, we always release parent's resources before a child's, within 
each phase. I'm not sure if we rely on that parent-child ordering 
anywhere though. I'd like to leave it as it is for now, to limit the 
scope of this, and maybe revisit later.

>> +Normally, you are expected to call ResourceOwnerForget on every resource so
>> +that at commit, the ResourceOwner is empty (locks are an exception). If there
>> +are any resources still held at commit, ResourceOwnerRelease will call the
>> +PrintLeakWarning callback on each such resource. At abort, however, we truly
>> +rely on the ResourceOwner mechanism and it is normal that there are resources
>> +to be released.
> 
> I am not sure it's a good idea to encode that all kinds of resources ought to
> have been released prior on commit. I can see that not always making sense -
> in fact we already don't do it for locks, no?  Perhaps just add a "commonly"?

Hmm, I'd still like to print the warnings for resources that we expect 
to be released before commit, though. We could have a flag in the 
ResourceOwnerDesc to give flexibility, but I'm inclined to wait until we 
get the first case of someone actually needing it.

>> +typedef struct ResourceElem
>>   {
>> ...
>> +    Datum        item;
>> +    ResourceOwnerFuncs *kind;    /* NULL indicates a free hash table slot */
>> +} ResourceElem;
> 
>>   /*
>> - * Initially allocated size of a ResourceArray.  Must be power of two since
>> - * we'll use (arraysize - 1) as mask for hashing.
>> + * Size of the fixed-size array to hold most-recently remembered resources.
>>    */
>> -#define RESARRAY_INIT_SIZE 16
>> +#define RESOWNER_ARRAY_SIZE 32
> 
> That's 512 bytes, pretty large to be searched sequentially. It's somewhat sad
> that the array needs to include 8 byte of ResourceOwnerFuncs...

Yeah. Early on I considered having a RegisterResourceKind() function 
that you need to call first, and then each resource kind could be 
assigned a smaller ID. If we wanted to keep the old mechanism of 
separate arrays for each different resource kind, that'd be the way to 
go. But overall this seems simpler, and the performance is decent 
despite that linear scan.

Note that the current code in master also uses a plain array, until it 
grows to 512 bytes, when it switches to a hash table. Lot of the details
with the array/hash combo are different, but that threshold was the same.

>> +    bool        releasing;        /* has ResourceOwnerRelease been called? */
>> +
>> +    /*
>> +     * The fixed-size array for recent resources.
>> +     *
>> +     * If 'releasing' is set, the contents are sorted by release priority.
>> +     */
>> +    uint32        narr;            /* how many items are stored in the array */
>> +    ResourceElem arr[RESOWNER_ARRAY_SIZE];
>> +
>> +    /*
>> +     * The hash table.  Uses open-addressing.  'nhash' is the number of items
>> +     * present; if it would exceed 'grow_at', we enlarge it and re-hash.
>> +     * 'grow_at' should be rather less than 'capacity' so that we don't waste
>> +     * too much time searching for empty slots.
>> +     *
>> +     * If 'releasing' is set, the contents are no longer hashed, but sorted by
>> +     * release priority.  The first 'nhash' elements are occupied, the rest
>> +     * are empty.
>> +     */
>> +    ResourceElem *hash;
>> +    uint32        nhash;            /* how many items are stored in the hash */
>> +    uint32        capacity;        /* allocated length of hash[] */
>> +    uint32        grow_at;        /* grow hash when reach this */
>> +
>> +    /* The local locks cache. */
>> +    uint32        nlocks;            /* number of owned locks */
>>       LOCALLOCK  *locks[MAX_RESOWNER_LOCKS];    /* list of owned locks */
> 
> Due to 'narr' being before the large 'arr', 'nhash' is always on a separate
> cacheline. I.e. the number of cachelines accessed in happy paths is higher
> than necessary, also relevant because there's more dependencies on narr, nhash
> than on the contents of those.
> 
> I'd probably order it so that both of the large elements (arr, locks) are at
> the end.
> 
> It's hard to know with changes this small, but it does seem to yield a small
> and repeatable performance benefit (pipelined pgbench -S).

Swapped the 'hash' and 'arr' parts of the struct, so that 'arr' and 
'locks' are at the end.

>> +/*
>> + * Forget that an object is owned by a ResourceOwner
>> + *
>> + * Note: if same resource ID is associated with the ResourceOwner more than
>> + * once, one instance is removed.
>> + */
>> +void
>> +ResourceOwnerForget(ResourceOwner owner, Datum value, ResourceOwnerFuncs *kind)
>> +{
>> +#ifdef RESOWNER_TRACE
>> +    elog(LOG, "FORGET %d: owner %p value " UINT64_FORMAT ", kind: %s",
>> +         resowner_trace_counter++, owner, DatumGetUInt64(value), kind->name);
>> +#endif
>> +
>> +    /*
>> +     * Mustn't call this after we have already started releasing resources.
>> +     * (Release callback functions are not allowed to release additional
>> +     * resources.)
>> +     */
>> +    if (owner->releasing)
>> +        elog(ERROR, "ResourceOwnerForget called for %s after release started", kind->name);
>> +
>> +    /* Search through all items in the array first. */
>> +    for (uint32 i = 0; i < owner->narr; i++)
>> +    {
> 
> Hm, I think we oftend up releasing resources in a lifo order. Would it
> possibly make sense to search in reverse order?

Changed. I can see a small speedup in a micro benchmark, when the array 
is almost full and you repeatedly remember / forget one more resource.

> Separately, I wonder if it'd be worth to check if there are any other matches
> when assertions are enabled.

It is actually allowed to remember the same resource twice. There's a 
comment in ResourceOwnerForget (previously in ResourceArrayRemove) that 
each call releases one instance in that case. I'm not sure if we rely on 
that anywhere currently.

>>   ResourceOwnerRelease(ResourceOwner owner,
>> @@ -492,6 +631,15 @@ ResourceOwnerRelease(ResourceOwner owner,
>>   {
>>       /* There's not currently any setup needed before recursing */
>>       ResourceOwnerReleaseInternal(owner, phase, isCommit, isTopLevel);
>> +
>> +#ifdef RESOWNER_STATS
>> +    if (isTopLevel)
>> +    {
>> +        elog(LOG, "RESOWNER STATS: lookups: array %d, hash %d", narray_lookups, nhash_lookups);
>> +        narray_lookups = 0;
>> +        nhash_lookups = 0;
>> +    }
>> +#endif
>>   }
> 
> Why do we have ResourceOwnerRelease() vs ResourceOwnerReleaseInternal()? Looks
> like that's older than your patch, but still confused.

Dunno. It provides a place to do things on the top-level resource owner 
that you don't want to do when you recurse to the children, as hinted by 
the comment, but I don't know if we've ever had such things. It was 
handy for printing the RESOWNER_STATS now, though :-).

>> +    /* Let add-on modules get a chance too */
>> +    for (item = ResourceRelease_callbacks; item; item = next)
>> +    {
>> +        /* allow callbacks to unregister themselves when called */
>> +        next = item->next;
>> +        item->callback(phase, isCommit, isTopLevel, item->arg);
>> +    }
> 
> I wonder if it's really a good idea to continue having this API. What you are
> allowed to do in a resowner changed

Hmm, I don't think anything changed for users of this API. I'm not in a 
hurry to remove this and force extensions to adapt, as it's not hard to 
maintain this.

>> +/*
>> + * ResourceOwnerReleaseAllOfKind
>> + *        Release all resources of a certain type held by this owner.
>> + */
>> +void
>> +ResourceOwnerReleaseAllOfKind(ResourceOwner owner, ResourceOwnerFuncs *kind)
>> +{
>> +    /* Mustn't call this after we have already started releasing resources. */
>> +    if (owner->releasing)
>> +        elog(ERROR, "ResourceOwnerForget called for %s after release started", kind->name);
>>
>> -        /* Ditto for plancache references */
>> -        while (ResourceArrayGetAny(&(owner->planrefarr), &foundres))
>> +    /*
>> +     * Temporarily set 'releasing', to prevent calls to ResourceOwnerRemember
>> +     * while we're scanning the owner.  Enlarging the hash would cause us to
>> +     * lose track of the point we're scanning.
>> +     */
>> +    owner->releasing = true;
> 
> Is it a problem than a possible error would lead to ->releasing = true to be
> "leaked"?  I think it might be, because we haven't called ResourceOwnerSort(),
> which means we'd not process resources in the right order during abort
> processing.

Good point. I added a separate 'sorted' flag to handle that gracefully.

>> +/*
>> + * Hash function for value+kind combination.
>> + */
>>   static inline uint32
>>   hash_resource_elem(Datum value, ResourceOwnerFuncs *kind)
>>   {
>> -    Datum        data[2];
>> -
>> -    data[0] = value;
>> -    data[1] = PointerGetDatum(kind);
>> -
>> -    return hash_bytes((unsigned char *) &data, 2 * SIZEOF_DATUM);
>> +    /*
>> +     * Most resource kinds store a pointer in 'value', and pointers are unique
>> +     * all on their own.  But some resources store plain integers (Files and
>> +     * Buffers as of this writing), so we want to incorporate the 'kind' in
>> +     * the hash too, otherwise those resources will collide a lot.  But
>> +     * because there are only a few resource kinds like that - and only a few
>> +     * resource kinds to begin with - we don't need to work too hard to mix
>> +     * 'kind' into the hash.  Just add it with hash_combine(), it perturbs the
>> +     * result enough for our purposes.
>> +     */
>> +#if SIZEOF_DATUM == 8
>> +    return hash_combine64(murmurhash64((uint64) value), (uint64) kind);
>> +#else
>> +    return hash_combine(murmurhash32((uint32) value), (uint32) kind);
>> +#endif
>>   }
> 
> Why are we using a 64bit hash to then just throw out the high bits?

It was just expedient when the inputs are 64-bit. Implementation of 
murmurhash64 is tiny, and we already use murmurhash32 elsewhere, so it 
seemed like an OK choice. I'm sure there are better algorithms out 
there, though.

-- 
Heikki Linnakangas
Neon (https://neon.tech)
Attachment

pgsql-hackers by date:

Previous
From: Dilip Kumar
Date:
Subject: Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
Next
From: Aleksander Alekseev
Date:
Subject: Re: RFC: Pluggable TOAST