Thread: Bogus logic in RelationBuildPartitionDesc

Bogus logic in RelationBuildPartitionDesc

From
Tom Lane
Date:
I just had to retrieve my jaw from the floor after reading this
bit in RelationBuildPartitionDesc:

         * The system cache may be out of date; if so, we may find no pg_class
         * tuple or an old one where relpartbound is NULL.  In that case, try
         * the table directly.  We can't just AcceptInvalidationMessages() and
         * retry the system cache lookup because it's possible that a
         * concurrent ATTACH PARTITION operation has removed itself to the
         * ProcArray but yet added invalidation messages to the shared queue;
         * InvalidateSystemCaches() would work, but seems excessive.

As far as I can see, this argument is wrong on every count, and if it
were right, the code it is defending would still be wrong.

In the first place, it's claiming, based on no evidence, that our whole
approach to syscaches is wrong.  If this code needs to deal with obsolete
syscache entries then so do probably thousands of other places.

In the second place, the argument that AcceptInvalidationMessages wouldn't
work is BS, because the way that that *actually* works is that transaction
commit updates clog and sends inval messages before it releases locks.
So if you've acquired enough of a lock to be sure that the data you want
to read is stable, then you do not need to worry about whether you've
received any relevant inval messages.  You have --- and you don't even
need to call AcceptInvalidationMessages for yourself, because lock
acquisition already did, see e.g. LockRelationOid.

In the third place, if this imaginary risk that the syscache was out of
date were real, the code would be completely failing to deal with it,
because all it is testing is whether it found a null relpartbound value.
That wouldn't handle the case where a non-null relpartbound is obsolete,
which is what you'd expect after ATTACH PARTITION.

Furthermore, if all of the above can be rebutted, then what's the argument
that reading pg_class directly will produce a better answer?  The only way
that any of this could be useful is if you're trying to read data that is
changing under you because you didn't take an adequate lock.  In that case
there's no guarantee that what you will read from pg_class is up-to-date
either.

In reality, what this code is doing is examining relations that it found
by reading pg_inherit, using an MVCC snapshot, so I do not see what is the
argument for supposing that the pg_class cache is more out-of-date than
the pg_inherit data.

Unsurprisingly, the code coverage report shows that this code path is
never taken.  I think we could dike out partdesc.c lines 113-151
altogether, and make the code just above there look more like every
other syscache access in the backend.

If somebody's got some actual evidence that this is necessary, and not
a flight of feverish imagination, let's hear it.  (And maybe let's
develop an isolation test that exercises the code path, because there's
sure little reason to believe it works right now.)

            regards, tom lane



Re: Bogus logic in RelationBuildPartitionDesc

From
Robert Haas
Date:
On Sat, Dec 21, 2019 at 10:28 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I just had to retrieve my jaw from the floor after reading this
> bit in RelationBuildPartitionDesc:
>
>          * The system cache may be out of date; if so, we may find no pg_class
>          * tuple or an old one where relpartbound is NULL.  In that case, try
>          * the table directly.  We can't just AcceptInvalidationMessages() and
>          * retry the system cache lookup because it's possible that a
>          * concurrent ATTACH PARTITION operation has removed itself to the
>          * ProcArray but yet added invalidation messages to the shared queue;
>          * InvalidateSystemCaches() would work, but seems excessive.
>
> As far as I can see, this argument is wrong on every count, and if it
> were right, the code it is defending would still be wrong.
>
> In the first place, it's claiming, based on no evidence, that our whole
> approach to syscaches is wrong.  If this code needs to deal with obsolete
> syscache entries then so do probably thousands of other places.

No, because ATTACH PARTITION uses only ShareUpdateExclusiveLock,
unlike most other DDL.

> In the second place, the argument that AcceptInvalidationMessages wouldn't
> work is BS, because the way that that *actually* works is that transaction
> commit updates clog and sends inval messages before it releases locks.
> So if you've acquired enough of a lock to be sure that the data you want
> to read is stable, then you do not need to worry about whether you've
> received any relevant inval messages.  You have --- and you don't even
> need to call AcceptInvalidationMessages for yourself, because lock
> acquisition already did, see e.g. LockRelationOid.

No, because ATTACH PARTITION uses only ShareUpdateExclusiveLock, which
does not conflict with the AccessShareLock required to build a cache
entry.

> In the third place, if this imaginary risk that the syscache was out of
> date were real, the code would be completely failing to deal with it,
> because all it is testing is whether it found a null relpartbound value.
> That wouldn't handle the case where a non-null relpartbound is obsolete,
> which is what you'd expect after ATTACH PARTITION.

No, that's what you'd expect after DETACH PARTITION, but that takes
AccessExclusiveLock, so the problem doesn't occur.

> Furthermore, if all of the above can be rebutted, then what's the argument
> that reading pg_class directly will produce a better answer?  The only way
> that any of this could be useful is if you're trying to read data that is
> changing under you because you didn't take an adequate lock.  In that case
> there's no guarantee that what you will read from pg_class is up-to-date
> either.

The argument is that the only possible change is the concurrent
addition of a partition, and therefore the only thing that happen is
to go from NULL to non-NULL. It can't go from non-NULL to NULL, nor
from one non-NULL value to another.

> Unsurprisingly, the code coverage report shows that this code path is
> never taken.  I think we could dike out partdesc.c lines 113-151
> altogether, and make the code just above there look more like every
> other syscache access in the backend.

It turns out that I tested this, and that if you do that, it's
possible to produce failures. It's very hard to do so in the context
of a regression test because they are low-probability, but they do
happen. I believe some of the testing details are in the original
thread that's probably linked from the commit message that added those
lines.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company