Thread: Re: Strange assertion in procarray.c

Re: Strange assertion in procarray.c

From
Michail Nikolaev
Date:
Hello, again.

> Another backend attempts to release the same resource (e.g., by aborting a transaction) and triggers the injection point.
Oh, all that GPT-like correctors required to be carefully checked :)

Correct version: Another backend attempts to release some resource (e.g., by aborting a transaction) and triggers the injection point.

Best regards,
Mikhail.

Re: Strange assertion in procarray.c

From
Michail Nikolaev
Date:
Hello, Michael!

> I fail to see how this is related?  The original issue was that this
> was impossible to run safely concurrently, but now we have the
> facilities able to do so.  There are a few cases where using a wait
> point has limits, for example outside a transaction context for some
> of the processes, but that has not really been an issue up to now.

I encountered this issue while trying to stabilize tests for [0].

Tests were crashing during the installcheck with assertion of that thread.
I have spent time trying to identify the root cause - injection points and
the way it may affect another tests even if them set to be executed locally.

In the spec, the backend performs the following:

    SELECT injection_points_set_local();
    SELECT injection_points_attach('invalidate_catalog_snapshot_end', 'wait');

That's all - we don't even execute any command that would trigger the wait condition.

Meanwhile, three different backends are attempting to test SERIALIZABLE isolation without any injection points.
Initially, this was a separate 'two-ids' test executed in parallel during installcheck,
but I incorporated it into the reproducer spec for simplicity.

> Isn't that pointing to an actual bug with serializable transactions?

No, let me explain.

> What you are telling here is that there is a race condition where it
> is possible to trigger an assertion failure when finishing a session
> while another one is waiting on an invalidation, if there's in the mix
> a read/write dependency error.

Actually, no backend is waiting for invalidation in this case.

Here's the sequence of events:

* The s4 backend creates a local injection point but performs no further actions. The injection point is marked as local for that pid.
* Three other backends proceed with their serializable snapshot operations.
* s2 determines it cannot commit and correctly decides to abort the transaction.
* s2 begins releasing resources:

        ResourceOwnerReleaseInternal resowner.c:694 <--- NOTE: After starting the release process, by calling this function, no new
        ResourceOwnerRelease resowner.c:654                    resources can be remembered in the resource owner.
        AbortTransaction xact.c:2960
        AbortCurrentTransactionInternal xact.c:3531
        AbortCurrentTransaction xact.c:3449
        PostgresMain postgres.c:4513
        BackendMain backend_startup.c:107
        postmaster_child_launch launch_backend.c:274
        BackendStartup postmaster.c:3377
        ServerLoop postmaster.c:1663
        PostmasterMain postmaster.c:1361
        main main.c:196

* During transaction abort, s2 invalidates its catalog snapshot with this stack trace:

        InvalidateCatalogSnapshot snapmgr.c:430
        AtEOXact_Snapshot snapmgr.c:1050
        CleanupTransaction xact.c:3016
        AbortCurrentTransactionInternal xact.c:3532
        AbortCurrentTransaction xact.c:3449
        PostgresMain postgres.c:4513
        BackendMain backend_startup.c:107
        postmaster_child_launch launch_backend.c:274
        BackendStartup postmaster.c:3377
        ServerLoop postmaster.c:1663
        PostmasterMain postmaster.c:1361
        main main.c:196

* Consequently, s2 encounters INJECTION_POINT("invalidate_catalog_snapshot_end");
* Although invalidate_catalog_snapshot_end is set to 'wait' only for s4, s2 enters the injection_wait handler
* Since this is s2's first interaction with injection points (as injection_points isn't in shared_preload_libraries), it calls injection_init_shmem
* Here, GetNamedDSMSegment is called - this is new infrastructure for initializing shared memory for extensions without shared_preload_libraries, committed by Nathan [1]
* GetNamedDSMSegment attempts to attach to memory and triggers the assertion:

        if (owner->releasing)
            elog(ERROR, "ResourceOwnerEnlarge called after release started");

        ResourceOwnerEnlarge resowner.c:449
        dsm_create_descriptor dsm.c:1206
        dsm_attach dsm.c:696
        dsa_attach dsa.c:519
        init_dsm_registry dsm_registry.c:115
        GetNamedDSMSegment dsm_registry.c:156
        injection_init_shmem injection_points.c:185
        injection_wait injection_points.c:277
        InjectionPointRun injection_point.c:551
        InvalidateCatalogSnapshot snapmgr.c:430
        AtEOXact_Snapshot snapmgr.c:1050
        CleanupTransaction xact.c:3016
        AbortCurrentTransactionInternal xact.c:3532
        AbortCurrentTransaction xact.c:3449
        PostgresMain postgres.c:4513
        BackendMain backend_startup.c:107
        postmaster_child_launch launch_backend.c:274
        BackendStartup postmaster.c:3377
        ServerLoop postmaster.c:1663
        PostmasterMain postmaster.c:1361
        main main.c:196

* This assertion during transaction abort triggers another abort call (this could be improved):

        ProcArrayEndTransaction procarray.c:677
        AbortTransaction xact.c:2946
        AbortCurrentTransactionInternal xact.c:3531
        AbortCurrentTransaction xact.c:3449
        PostgresMain postgres.c:4513 <--------------- exception handler here
        BackendMain backend_startup.c:107
        postmaster_child_launch launch_backend.c:274
        BackendStartup postmaster.c:3377
        ServerLoop postmaster.c:1663
        PostmasterMain postmaster.c:1361
        main main.c:196

This isn't the same abort attempt - it's the second one, which triggers Assert(TransactionIdIsValid(proc->xid));

In summary:

Each code component functions correctly in isolation
However, when an injection point registered as local by one backend causes another backend to register resources (and potentially other operations),
it can lead to difficult-to-diagnose issues

I see several potential solutions:

* Add injection_points to shared_preload_libraries for all tests
* Implement a mechanism to call prev_shmem_startup_hook for libraries outside shared_preload_libraries
* Modify GetNamedDSMSegment's behavior
* Run all injection_points tests in an isolated environment

In my opinion, the last option seems most appropriate.

Best regards,
Mikhail.

[0]: https://commitfest.postgresql.org/50/5160/
[1]: https://github.com/postgres/postgres/commit/8b2bcf3f287c79eaebf724cba57e5ff664b01e06