Thread: race condition in pg_class

race condition in pg_class

From
Smolkin Grigory
Date:
Hello, hackers!
We are running PG13.10 and recently we have encountered what appears to be a bug due to some race condition between ALTER TABLE ... ADD CONSTRAINT and some other catalog-writer, possibly ANALYZE.
The problem is that after successfully creating index on relation (which previosly didnt have any indexes), its pg_class.relhasindex remains set to "false", which is illegal, I think.
Index was built using the following statement:
ALTER TABLE "example" ADD constraint "example_pkey" PRIMARY KEY (id);

PG_CLASS:
# select ctid,oid,xmin,xmax, * from pg_class where oid = 3566558198;
-[ RECORD 1 ]-------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
ctid                | (12,49)
oid                 | 3566558198
xmin                | 1202298791
xmax                | 0
relname             | example
relnamespace        | 16479
reltype             | 3566558200
reloftype           | 0
relowner            | 16386
relam               | 2
relfilenode         | 3566558198
reltablespace       | 0
relpages            | 152251
reltuples           | 1.1565544e+07
relallvisible       | 127837
reltoastrelid       | 3566558203
relhasindex         | f
relisshared         | f
relpersistence      | p
relkind             | r
relnatts            | 12
relchecks           | 0
relhasrules         | f
relhastriggers      | f
relhassubclass      | f
relrowsecurity      | f
relforcerowsecurity | f
relispopulated      | t
relreplident        | d
relispartition      | f
relrewrite          | 0
relfrozenxid        | 1201647807
relminmxid          | 1
relacl              |
reloptions          |
relpartbound        |

PG_INDEX:
# select ctid,xmin,xmax,indexrelid::regclass,indrelid::regclass, * from pg_index where indexrelid = 3569625749;
-[ RECORD 1 ]--+---------------------------------------------
ctid           | (3,30)
xmin           | 1202295045
xmax           | 0
indexrelid     | "example_pkey"
indrelid       | "example"
indexrelid     | 3569625749
indrelid       | 3566558198
indnatts       | 1
indnkeyatts    | 1
indisunique    | t
indisprimary   | t
indisexclusion | f
indimmediate   | t
indisclustered | f
indisvalid     | t
indcheckxmin   | f
indisready     | t
indislive      | t
indisreplident | f
indkey         | 1
indcollation   | 0
indclass       | 3124
indoption      | 0
indexprs       |
indpred        |

Looking into the WAL via waldump given us the following picture (full waldump output is attached):

tx: 1202295045, lsn: AAB1/D38378D0, prev AAB1/D3837208, desc: FPI , blkref #0: rel 1663/16387/3569625749 blk 0 FPW
tx: 1202298790, lsn: AAB1/D3912EC0, prev AAB1/D3912E80, desc: NEW_CID rel 1663/16387/1259; tid 6/24; cmin: 0, cmax: 4294967295, combo: 4294967295
tx: 1202298790, lsn: AAB1/D3927580, prev AAB1/D3926988, desc: COMMIT 2023-10-04 22:41:23.863979 UTC
tx: 1202298791, lsn: AAB1/D393C230, prev AAB1/D393C1F0, desc: HOT_UPDATE off 24 xmax 1202298791 flags 0x20 ; new off 45 xmax 0, blkref #0: rel 1663/16387/1259 blk 6
tx: 1202298791, lsn: AAB1/D394ADA0, prev AAB1/D394AD60, desc: UPDATE off 45 xmax 1202298791 flags 0x00 ; new off 28 xmax 0, blkref #0: rel 1663/16387/1259 blk 5, blkref #1: rel 1663/16387/1259 blk 6
tx: 1202298791, lsn: AAB1/D3961088, prev AAB1/D3961048, desc: NEW_CID rel 1663/16387/1259; tid 12/49; cmin: 50, cmax: 4294967295, combo: 4294967295
tx: 1202295045, lsn: AAB1/D3962E78, prev AAB1/D3962E28, desc: INPLACE off 24, blkref #0: rel 1663/16387/1259 blk 6
tx: 1202295045, lsn: AAB1/D39632A0, prev AAB1/D3963250, desc: COMMIT 2023-10-04 22:41:23.878565 UTC
tx: 1202298791, lsn: AAB1/D3973420, prev AAB1/D39733D0, desc: COMMIT 2023-10-04 22:41:23.884951 UTC

1202295045 - create index statement
1202298790 and 1202298791 are some other concurrent operations, unfortunately I wasnt able to determine what are they

So looks like 1202295045, updated tuple (6,24) in pg_class INPLACE, in which at this point xmax was already set by 1202298791 and new tuple in (12,49) was in created.
So after 1202298791 was commited, that inplace update was effectively lost.
If we do an inclusive PITR with (recovery_target_xid = 1202295045), we can see the following picture (notice relhasindex and xmax):

# select ctid,oid, xmin,xmax,relhasindex,cmin,cmax from pg_class where oid = 3566558198;
-[ RECORD 1 ]-----------
ctid        | (6,24)
oid         | 3566558198
xmin        | 1202298790
xmax        | 1202298791
relhasindex | t
cmin        | 0
cmax        | 0

I've tried to reproduce this scenario with CREATE INDEX and various concurrent statements, but no luck.
Attached full waldump output for the relevant WAL segment.
Attachment

Re: race condition in pg_class

From
"Andrey M. Borodin"
Date:

> On 25 Oct 2023, at 13:39, Smolkin Grigory <smallkeen@gmail.com> wrote:
>
> We are running PG13.10 and recently we have encountered what appears to be a bug due to some race condition between
ALTERTABLE ... ADD CONSTRAINT and some other catalog-writer, possibly ANALYZ 

> I've tried to reproduce this scenario with CREATE INDEX and various concurrent statements, but no luck.

Maybe it would be possible to reproduce with modifying tests for concurrent index creation. For example add “ANALYZE”
here[0]. 
Keep in mind that for easier reproduction it would make sense to increase transaction count radically.


Best regards, Andrey Borodin.


[0] https://github.com/postgres/postgres/blob/master/contrib/amcheck/t/002_cic.pl#L34




Re: race condition in pg_class

From
Tom Lane
Date:
Smolkin Grigory <smallkeen@gmail.com> writes:
> We are running PG13.10 and recently we have encountered what appears to be
> a bug due to some race condition between ALTER TABLE ... ADD CONSTRAINT and
> some other catalog-writer, possibly ANALYZE.
> The problem is that after successfully creating index on relation (which
> previosly didnt have any indexes), its pg_class.relhasindex remains set to
> "false", which is illegal, I think.
> Index was built using the following statement:
> ALTER TABLE "example" ADD constraint "example_pkey" PRIMARY KEY (id);

ALTER TABLE ADD CONSTRAINT would certainly have taken
AccessExclusiveLock on the "example" table, which should be sufficient
to prevent anything else from touching its pg_class row.  The only
mechanism I can think of that might bypass that is a manual UPDATE on
pg_class, which would just manipulate the row as a row without concern
for associated relation-level locks.  Any chance that somebody was
doing something like that?

            regards, tom lane



Re: race condition in pg_class

From
Smolkin Grigory
Date:
> ALTER TABLE ADD CONSTRAINT would certainly have taken
> AccessExclusiveLock on the "example" table, which should be sufficient
> to prevent anything else from touching its pg_class row.  The only
> mechanism I can think of that might bypass that is a manual UPDATE on
> pg_class, which would just manipulate the row as a row without concern
> for associated relation-level locks.  Any chance that somebody was
> doing something like that?

No chance. Our infrastructure dont do that, and users dont just have the privileges to mess with pg_catalog.

ср, 25 окт. 2023 г. в 21:06, Tom Lane <tgl@sss.pgh.pa.us>:
Smolkin Grigory <smallkeen@gmail.com> writes:
> We are running PG13.10 and recently we have encountered what appears to be
> a bug due to some race condition between ALTER TABLE ... ADD CONSTRAINT and
> some other catalog-writer, possibly ANALYZE.
> The problem is that after successfully creating index on relation (which
> previosly didnt have any indexes), its pg_class.relhasindex remains set to
> "false", which is illegal, I think.
> Index was built using the following statement:
> ALTER TABLE "example" ADD constraint "example_pkey" PRIMARY KEY (id);

ALTER TABLE ADD CONSTRAINT would certainly have taken
AccessExclusiveLock on the "example" table, which should be sufficient
to prevent anything else from touching its pg_class row.  The only
mechanism I can think of that might bypass that is a manual UPDATE on
pg_class, which would just manipulate the row as a row without concern
for associated relation-level locks.  Any chance that somebody was
doing something like that?

                        regards, tom lane

Re: race condition in pg_class

From
Noah Misch
Date:
On Wed, Oct 25, 2023 at 01:39:41PM +0300, Smolkin Grigory wrote:
> We are running PG13.10 and recently we have encountered what appears to be
> a bug due to some race condition between ALTER TABLE ... ADD CONSTRAINT and
> some other catalog-writer, possibly ANALYZE.
> The problem is that after successfully creating index on relation (which
> previosly didnt have any indexes), its pg_class.relhasindex remains set to
> "false", which is illegal, I think.
> Index was built using the following statement:
> ALTER TABLE "example" ADD constraint "example_pkey" PRIMARY KEY (id);

This is going to be a problem with any operation that does a transactional
pg_class update without taking a lock that conflicts with ShareLock.  GRANT
doesn't lock the table at all, so I can reproduce this in v17 as follows:

== session 1
create table t (c int);
begin;
grant select on t to public;

== session 2
alter table t add primary key (c);

== back in session 1
commit;


We'll likely need to change how we maintain relhasindex or perhaps take a lock
in GRANT.

> Looking into the WAL via waldump given us the following picture (full
> waldump output is attached):

> 1202295045 - create index statement
> 1202298790 and 1202298791 are some other concurrent operations,
> unfortunately I wasnt able to determine what are they

Can you explore that as follows?

- PITR to just before the COMMIT record.
- Save all rows of pg_class.
- PITR to just after the COMMIT record.
- Save all rows of pg_class.
- Diff the two sets of saved rows.

Which columns changed?  The evidence you've shown would be consistent with a
transaction doing GRANT or REVOKE on dozens of tables.  If the changed column
is something other than relacl, that would be great to know.

On the off-chance it's relevant, what extensions do you have (\dx in psql)?



Re: race condition in pg_class

From
Smolkin Grigory
Date:
> This is going to be a problem with any operation that does a transactional
> pg_class update without taking a lock that conflicts with ShareLock.  GRANT
> doesn't lock the table at all, so I can reproduce this in v17 as follows:
>
> == session 1
> create table t (c int);
> begin;
> grant select on t to public;
>
> == session 2
> alter table t add primary key (c);
>
> == back in session 1
> commit;
>
>
> We'll likely need to change how we maintain relhasindex or perhaps take a lock
> in GRANT.

Oh, that explains it. Thank you very much.

> Can you explore that as follows?
>
>- PITR to just before the COMMIT record.
>- Save all rows of pg_class.
>- PITR to just after the COMMIT record.
>- Save all rows of pg_class.
>- Diff the two sets of saved rows.

Sure, but it will take some time, its a large db with lots of WAL segments to apply.

> extensions

      extname       | extversion
--------------------+------------
 plpgsql            | 1.0
 pg_stat_statements | 1.8
 pg_buffercache     | 1.3
 pgstattuple        | 1.5

Re: race condition in pg_class

From
Noah Misch
Date:
On Thu, Oct 26, 2023 at 09:44:04PM -0700, Noah Misch wrote:
> On Wed, Oct 25, 2023 at 01:39:41PM +0300, Smolkin Grigory wrote:
> > We are running PG13.10 and recently we have encountered what appears to be
> > a bug due to some race condition between ALTER TABLE ... ADD CONSTRAINT and
> > some other catalog-writer, possibly ANALYZE.
> > The problem is that after successfully creating index on relation (which
> > previosly didnt have any indexes), its pg_class.relhasindex remains set to
> > "false", which is illegal, I think.

It's damaging.  The table will behave like it has no indexes.  If something
adds an index later, old indexes will reappear, corrupt, having not received
updates during the relhasindex=false era.  ("pg_amcheck --heapallindexed" can
detect this.)

> > Index was built using the following statement:
> > ALTER TABLE "example" ADD constraint "example_pkey" PRIMARY KEY (id);
> 
> This is going to be a problem with any operation that does a transactional
> pg_class update without taking a lock that conflicts with ShareLock.  GRANT
> doesn't lock the table at all, so I can reproduce this in v17 as follows:
> 
> == session 1
> create table t (c int);
> begin;
> grant select on t to public;
> 
> == session 2
> alter table t add primary key (c);
> 
> == back in session 1
> commit;
> 
> 
> We'll likely need to change how we maintain relhasindex or perhaps take a lock
> in GRANT.

The main choice is accepting more DDL blocking vs. accepting inefficient
relcache builds.  Options I'm seeing:

=== "more DDL blocking" option family

B1. Take ShareUpdateExclusiveLock in GRANT, REVOKE, and anything that makes
    transactional pg_class updates without holding some stronger lock.  New
    asserts could catch future commands failing to do this.

B2. Take some shorter-lived lock around pg_class tuple formation, such that
    GRANT blocks CREATE INDEX, but two CREATE INDEX don't block each other.
    Anything performing a transactional update of a pg_class row would acquire
    the lock in exclusive mode before fetching the old tuple and hold it till
    end of transaction.  relhasindex=true in-place updates would acquire it
    the same way, but they would release it after the inplace update.  I
    expect a new heavyweight lock type, e.g. LOCKTAG_RELATION_DEFINITION, with
    the same key as LOCKTAG_RELATION.  This has less blocking than the
    previous option, but it's more complicated to explain to both users and
    developers.

B3. Have CREATE INDEX do an EvalPlanQual()-style thing to update all successor
    tuple versions.  Like the previous option, this would require new locking,
    but the new lock would not need to persist till end of xact.  It would be
    even more complicated to explain to users and developers.  (If this is
    promising enough to warrant more detail, let me know.)

B4. Use transactional updates to set relhasindex=true.  Two CREATE INDEX
    commands on the same table would block each other.  If we did it the way
    most DDL does today, they'd get "tuple concurrently updated" failures
    after the blocking ends.

=== "inefficient relcache builds" option family

R1. Ignore relhasindex; possibly remove it in v17.  Relcache builds et
    al. will issue more superfluous queries.

R2. As a weird variant of the previous option, keep relhasindex and make all
    transactional updates of pg_class set relhasindex=true pessimistically.
    (VACUUM will set it back to false.)

=== other

O1. This is another case where the sometimes-discussed "pg_class_nt" for
    nontransactional columns would help.  I'm ruling that out as too hard to
    back-patch.


Are there other options important to consider?  I currently like (B1) the
most, followed closely by (R1) and (B2).  A key unknown is the prevalence of
index-free tables.  Low prevalence would argue in favor of (R1).  In my
limited experience, they've been rare.  That said, I assume relcache builds
happen a lot more than GRANTs, so it's harder to bound the damage from (R1)
compared to the damage from (B1).  Thoughts on this decision?

Thanks,
nm



Re: race condition in pg_class

From
Tom Lane
Date:
Noah Misch <noah@leadboat.com> writes:
> On Thu, Oct 26, 2023 at 09:44:04PM -0700, Noah Misch wrote:
>> We'll likely need to change how we maintain relhasindex or perhaps take a lock
>> in GRANT.

> The main choice is accepting more DDL blocking vs. accepting inefficient
> relcache builds.  Options I'm seeing:

It looks to me like you're only thinking about relhasindex, but it
seems to me that any call of heap_inplace_update brings some
risk of this kind.  Excluding the bootstrap-mode-only usage in
create_toast_table, I see four callers:

* index_update_stats updating a pg_class tuple's
  relhasindex, relpages, reltuples, relallvisible

* vac_update_relstats updating a pg_class tuple's
  relpages, reltuples, relallvisible, relhasindex, relhasrules,
  relhastriggers, relfrozenxid, relminmxid

* vac_update_datfrozenxid updating a pg_database tuple's
  datfrozenxid, datminmxid

* dropdb updating a pg_database tuple's datconnlimit

So we have just as much of a problem with GRANTs on databases
as GRANTs on relations.  Also, it looks like we can lose
knowledge of the presence of rules and triggers, which seems
nearly as bad as forgetting about indexes.  The rest of these
updates might not be correctness-critical, although I wonder
how bollixed things could get if we forget an advancement of
relfrozenxid or datfrozenxid (especially if the calling
transaction goes on to make other changes that assume that
the update happened).

BTW, vac_update_datfrozenxid believes (correctly I think) that
it cannot use the syscache copy of a tuple as the basis for in-place
update, because syscache will have detoasted any toastable fields.
These other callers are ignoring that, which seems like it should
result in heap_inplace_update failing with "wrong tuple length".
I wonder how come we're not seeing reports of that from the field.

I'm inclined to propose that heap_inplace_update should check to
make sure that it's operating on the latest version of the tuple
(including, I guess, waiting for an uncommitted update?) and throw
error if not.  I think this is what your B3 option is, but maybe
I misinterpreted.  It might be better to throw error immediately
instead of waiting to see if the other updater commits.

            regards, tom lane



Re: race condition in pg_class

From
Noah Misch
Date:
On Fri, Oct 27, 2023 at 03:32:26PM -0400, Tom Lane wrote:
> Noah Misch <noah@leadboat.com> writes:
> > On Thu, Oct 26, 2023 at 09:44:04PM -0700, Noah Misch wrote:
> >> We'll likely need to change how we maintain relhasindex or perhaps take a lock
> >> in GRANT.
> 
> > The main choice is accepting more DDL blocking vs. accepting inefficient
> > relcache builds.  Options I'm seeing:
> 
> It looks to me like you're only thinking about relhasindex, but it
> seems to me that any call of heap_inplace_update brings some
> risk of this kind.  Excluding the bootstrap-mode-only usage in
> create_toast_table, I see four callers:
> 
> * index_update_stats updating a pg_class tuple's
>   relhasindex, relpages, reltuples, relallvisible
> 
> * vac_update_relstats updating a pg_class tuple's
>   relpages, reltuples, relallvisible, relhasindex, relhasrules,
>   relhastriggers, relfrozenxid, relminmxid
> 
> * vac_update_datfrozenxid updating a pg_database tuple's
>   datfrozenxid, datminmxid
> 
> * dropdb updating a pg_database tuple's datconnlimit
> 
> So we have just as much of a problem with GRANTs on databases
> as GRANTs on relations.  Also, it looks like we can lose
> knowledge of the presence of rules and triggers, which seems
> nearly as bad as forgetting about indexes.  The rest of these
> updates might not be correctness-critical, although I wonder
> how bollixed things could get if we forget an advancement of
> relfrozenxid or datfrozenxid (especially if the calling
> transaction goes on to make other changes that assume that
> the update happened).

Thanks for researching that.  Let's treat frozenxid stuff as critical; I
wouldn't want to advance XID limits based on a datfrozenxid that later gets
rolled back.  I agree relhasrules and relhastriggers are also critical.  The
"inefficient relcache builds" option family can't solve cases like
relfrozenxid and datconnlimit, so that leaves us with the "more DDL blocking"
option family.

> BTW, vac_update_datfrozenxid believes (correctly I think) that
> it cannot use the syscache copy of a tuple as the basis for in-place
> update, because syscache will have detoasted any toastable fields.
> These other callers are ignoring that, which seems like it should
> result in heap_inplace_update failing with "wrong tuple length".
> I wonder how come we're not seeing reports of that from the field.

Good question.  Perhaps we'll need some test cases that exercise each inplace
update against a row having a toast pointer.  It's too easy to go a long time
without encountering those in the field.

> I'm inclined to propose that heap_inplace_update should check to
> make sure that it's operating on the latest version of the tuple
> (including, I guess, waiting for an uncommitted update?) and throw
> error if not.  I think this is what your B3 option is, but maybe
> I misinterpreted.  It might be better to throw error immediately
> instead of waiting to see if the other updater commits.

That's perhaps closer to B2.  To be pedantic, B3 was about not failing or
waiting for GRANT to commit but instead inplace-updating every member of the
update chain.  For B2, I was thinking we don't need to error.  There are two
problematic orders of events.  The easy one is heap_inplace_update() mutating
a tuple that already has an xmax.  That's the one in the test case upthread,
and detecting it is trivial.  The harder one is heap_inplace_update() mutating
a tuple after GRANT fetches the old tuple, before GRANT enters heap_update().
I anticipate a new locktag per catalog that can receive inplace updates,
i.e. LOCKTAG_RELATION_DEFINITION and LOCKTAG_DATABASE_DEFINITION.  Here's a
walk-through for the pg_database case.  GRANT will use the following sequence
of events:

- acquire LOCKTAG_DATABASE_DEFINITION in exclusive mode
- fetch latest pg_database tuple
- heap_update()
- COMMIT, releasing LOCKTAG_DATABASE_DEFINITION

vac_update_datfrozenxid() sequence of events:

- acquire LOCKTAG_DATABASE_DEFINITION in exclusive mode
- (now, all GRANTs on the given database have committed or aborted)
- fetch latest pg_database tuple
- heap_inplace_update()
- release LOCKTAG_DATABASE_DEFINITION, even if xact not ending
- continue with other steps, e.g. vac_truncate_clog()

How does that compare to what you envisioned?  vac_update_datfrozenxid() could
further use xmax as a best-efforts thing to catch conflict with manual UPDATE
statements, but it wouldn't solve the case where the UPDATE had fetched the
tuple but not yet heap_update()'d it.



Re: race condition in pg_class

From
Tom Lane
Date:
Noah Misch <noah@leadboat.com> writes:
> On Fri, Oct 27, 2023 at 03:32:26PM -0400, Tom Lane wrote:
>> I'm inclined to propose that heap_inplace_update should check to
>> make sure that it's operating on the latest version of the tuple
>> (including, I guess, waiting for an uncommitted update?) and throw
>> error if not.  I think this is what your B3 option is, but maybe
>> I misinterpreted.  It might be better to throw error immediately
>> instead of waiting to see if the other updater commits.

> That's perhaps closer to B2.  To be pedantic, B3 was about not failing or
> waiting for GRANT to commit but instead inplace-updating every member of the
> update chain.  For B2, I was thinking we don't need to error.  There are two
> problematic orders of events.  The easy one is heap_inplace_update() mutating
> a tuple that already has an xmax.  That's the one in the test case upthread,
> and detecting it is trivial.  The harder one is heap_inplace_update() mutating
> a tuple after GRANT fetches the old tuple, before GRANT enters heap_update().

Ugh ... you're right, what I was imagining would not catch that last case.

> I anticipate a new locktag per catalog that can receive inplace updates,
> i.e. LOCKTAG_RELATION_DEFINITION and LOCKTAG_DATABASE_DEFINITION.

We could perhaps make this work by using the existing tuple-lock
infrastructure, rather than inventing new locktags (a choice that
spills to a lot of places including clients that examine pg_locks).

I would prefer though to find a solution that only depends on making
heap_inplace_update protect itself, without high-level cooperation
from the possibly-interfering updater.  This is basically because
I'm still afraid that we're defining the problem too narrowly.
For one thing, I have nearly zero confidence that GRANT et al are
the only problematic source of conflicting transactional updates.
For another, I'm worried that some extension may be using
heap_inplace_update against a catalog we're not considering here.
I'd also like to find a solution that fixes the case of a conflicting
manual UPDATE (although certainly that's a stretch goal we may not be
able to reach).

I wonder if there's a way for heap_inplace_update to mark the tuple
header as just-updated in a way that regular heap_update could
recognize.  (For standard catalog updates, we'd then end up erroring
in simple_heap_update, which I think is fine.)  We can't update xmin,
because the VACUUM callers don't have an XID; but maybe there's some
other way?  I'm speculating about putting a funny value into xmax,
or something like that, and having heap_update check that what it
sees in xmax matches what was in the tuple the update started with.

Or we could try to get rid of in-place updates, but that seems like
a mighty big lift.  All of the existing callers, except maybe
the johnny-come-lately dropdb usage, have solid documented reasons
to do it that way.

            regards, tom lane



Re: race condition in pg_class

From
Noah Misch
Date:
On Fri, Oct 27, 2023 at 06:40:55PM -0400, Tom Lane wrote:
> Noah Misch <noah@leadboat.com> writes:
> > On Fri, Oct 27, 2023 at 03:32:26PM -0400, Tom Lane wrote:
> >> I'm inclined to propose that heap_inplace_update should check to
> >> make sure that it's operating on the latest version of the tuple
> >> (including, I guess, waiting for an uncommitted update?) and throw
> >> error if not.  I think this is what your B3 option is, but maybe
> >> I misinterpreted.  It might be better to throw error immediately
> >> instead of waiting to see if the other updater commits.
> 
> > That's perhaps closer to B2.  To be pedantic, B3 was about not failing or
> > waiting for GRANT to commit but instead inplace-updating every member of the
> > update chain.  For B2, I was thinking we don't need to error.  There are two
> > problematic orders of events.  The easy one is heap_inplace_update() mutating
> > a tuple that already has an xmax.  That's the one in the test case upthread,
> > and detecting it is trivial.  The harder one is heap_inplace_update() mutating
> > a tuple after GRANT fetches the old tuple, before GRANT enters heap_update().
> 
> Ugh ... you're right, what I was imagining would not catch that last case.
> 
> > I anticipate a new locktag per catalog that can receive inplace updates,
> > i.e. LOCKTAG_RELATION_DEFINITION and LOCKTAG_DATABASE_DEFINITION.
> 
> We could perhaps make this work by using the existing tuple-lock
> infrastructure, rather than inventing new locktags (a choice that
> spills to a lot of places including clients that examine pg_locks).

That could be okay.  It would be weird to reuse a short-term lock like that
one as something held till end of transaction.  But the alternative of new
locktags ain't perfect, as you say.

> I would prefer though to find a solution that only depends on making
> heap_inplace_update protect itself, without high-level cooperation
> from the possibly-interfering updater.  This is basically because
> I'm still afraid that we're defining the problem too narrowly.
> For one thing, I have nearly zero confidence that GRANT et al are
> the only problematic source of conflicting transactional updates.

Likewise here, but I have fair confidence that an assertion would flush out
the rest.  heap_inplace_update() would assert that the backend holds one of
the acceptable locks.  It could even be an elog; heap_inplace_update() can
tolerate that cost.

> For another, I'm worried that some extension may be using
> heap_inplace_update against a catalog we're not considering here.

A pgxn search finds "citus" using heap_inplace_update().

> I'd also like to find a solution that fixes the case of a conflicting
> manual UPDATE (although certainly that's a stretch goal we may not be
> able to reach).

It would be nice.

> I wonder if there's a way for heap_inplace_update to mark the tuple
> header as just-updated in a way that regular heap_update could
> recognize.  (For standard catalog updates, we'd then end up erroring
> in simple_heap_update, which I think is fine.)  We can't update xmin,
> because the VACUUM callers don't have an XID; but maybe there's some
> other way?  I'm speculating about putting a funny value into xmax,
> or something like that, and having heap_update check that what it
> sees in xmax matches what was in the tuple the update started with.

Hmmm.  Achieving it without an XID would be the real trick.  (With an XID, we
could use xl_heap_lock like heap_update() does.)  Thinking out loud, what if
heap_inplace_update() sets HEAP_XMAX_INVALID and xmax =
TransactionIdAdvance(xmax)?  Or change t_ctid in a similar way.  Then regular
heap_update() could complain if the field changed vs. last seen value.  This
feels like something to regret later in terms of limiting our ability to
harness those fields for more-valuable ends or compact them away in a future
page format.  I can't pinpoint a specific loss, so the idea might have legs.
Nontransactional data in separate tables or in new metapages smells like the
right long-term state.  A project wanting to reuse the tuple header bits could
introduce such storage to unblock its own bit reuse.

> Or we could try to get rid of in-place updates, but that seems like
> a mighty big lift.  All of the existing callers, except maybe
> the johnny-come-lately dropdb usage, have solid documented reasons
> to do it that way.

Yes, removing that smells problematic.



Re: race condition in pg_class

From
Noah Misch
Date:
I prototyped two ways, one with a special t_ctid and one with LockTuple().

On Fri, Oct 27, 2023 at 04:26:12PM -0700, Noah Misch wrote:
> On Fri, Oct 27, 2023 at 06:40:55PM -0400, Tom Lane wrote:
> > Noah Misch <noah@leadboat.com> writes:
> > > On Fri, Oct 27, 2023 at 03:32:26PM -0400, Tom Lane wrote:

> > > I anticipate a new locktag per catalog that can receive inplace updates,
> > > i.e. LOCKTAG_RELATION_DEFINITION and LOCKTAG_DATABASE_DEFINITION.
> > 
> > We could perhaps make this work by using the existing tuple-lock
> > infrastructure, rather than inventing new locktags (a choice that
> > spills to a lot of places including clients that examine pg_locks).
> 
> That could be okay.  It would be weird to reuse a short-term lock like that
> one as something held till end of transaction.  But the alternative of new
> locktags ain't perfect, as you say.

That worked.

> > I would prefer though to find a solution that only depends on making
> > heap_inplace_update protect itself, without high-level cooperation
> > from the possibly-interfering updater.  This is basically because
> > I'm still afraid that we're defining the problem too narrowly.
> > For one thing, I have nearly zero confidence that GRANT et al are
> > the only problematic source of conflicting transactional updates.
> 
> Likewise here, but I have fair confidence that an assertion would flush out
> the rest.  heap_inplace_update() would assert that the backend holds one of
> the acceptable locks.  It could even be an elog; heap_inplace_update() can
> tolerate that cost.

That check would fall in both heap_inplace_update() and heap_update().  After
all, a heap_inplace_update() check won't detect an omission in GRANT.

> > For another, I'm worried that some extension may be using
> > heap_inplace_update against a catalog we're not considering here.
> 
> A pgxn search finds "citus" using heap_inplace_update().
> 
> > I'd also like to find a solution that fixes the case of a conflicting
> > manual UPDATE (although certainly that's a stretch goal we may not be
> > able to reach).
> 
> It would be nice.

I expect most approaches could get there by having ExecModifyTable() arrange
for the expected locking or other actions.  That's analogous to how
heap_update() takes care of sinval even for a manual UPDATE.

> > I wonder if there's a way for heap_inplace_update to mark the tuple
> > header as just-updated in a way that regular heap_update could
> > recognize.  (For standard catalog updates, we'd then end up erroring
> > in simple_heap_update, which I think is fine.)  We can't update xmin,
> > because the VACUUM callers don't have an XID; but maybe there's some
> > other way?  I'm speculating about putting a funny value into xmax,
> > or something like that, and having heap_update check that what it
> > sees in xmax matches what was in the tuple the update started with.
> 
> Hmmm.  Achieving it without an XID would be the real trick.  (With an XID, we
> could use xl_heap_lock like heap_update() does.)  Thinking out loud, what if
> heap_inplace_update() sets HEAP_XMAX_INVALID and xmax =
> TransactionIdAdvance(xmax)?  Or change t_ctid in a similar way.  Then regular
> heap_update() could complain if the field changed vs. last seen value.  This
> feels like something to regret later in terms of limiting our ability to
> harness those fields for more-valuable ends or compact them away in a future
> page format.  I can't pinpoint a specific loss, so the idea might have legs.
> Nontransactional data in separate tables or in new metapages smells like the
> right long-term state.  A project wanting to reuse the tuple header bits could
> introduce such storage to unblock its own bit reuse.

heap_update() does not have the pre-modification xmax today, so I used t_ctid.
heap_modify_tuple() preserves t_ctid, so heap_update() already has the
pre-modification t_ctid in key cases.  For details of how the prototype uses
t_ctid, see comment at "#define InplaceCanaryOffsetNumber".  The prototype
doesn't prevent corruption in the following scenario, because the aborted
ALTER TABLE RENAME overwrites the special t_ctid:

  == session 1
  drop table t;
  create table t (c int);
  begin;
  -- in gdb, set breakpoint on heap_modify_tuple
  grant select on t to public;

  == session 2
  alter table t add primary key (c);
  begin; alter table t rename to t2; rollback;

  == back in session 1
  -- release breakpoint
  -- want error (would get it w/o the begin;alter;rollback)
  commit;

I'm missing how to mark the tuple in a fashion accessible to a second
heap_update() after a rolled-back heap_update().  The mark needs enough bits
"N" so it's implausible for 2^N inplace updates to happen between GRANT
fetching the old tuple and GRANT completing heap_update().  Looking for bits
that could persist across a rolled-back heap_update(), we have 3 in t_ctid, 2
in t_infomask2, and 0 in xmax.  I definitely don't want to paint us into a
corner by spending the t_infomask2 bits on this.  Even if I did, 2^(3+2)=32
wouldn't clearly be enough inplace updates.

Is there a way to salvage the goal of fixing the bug without modifying code
like ExecGrant_common()?  If not, I'm inclined to pick from one of the
following designs:

- Acquire ShareUpdateExclusiveLock in GRANT ((B1) from previous list).  It
  does make GRANT more intrusive; e.g. GRANT will cancel autovacuum.  I'm
  leaning toward this one for two reasons.  First, it doesn't slow
  heap_update() outside of assert builds.  Second, it makes the GRANT
  experience more like the rest our DDL, in that concurrent DDL will make
  GRANT block, not fail.

- GRANT passes to heapam the fixed-size portion of the pre-modification tuple.
  heap_update() compares those bytes to the oldtup in shared buffers to see if
  an inplace update happened.  (HEAD could get the bytes from a new
  heap_update() parameter, while back branches would need a different passing
  approach.)

- LockTuple(), as seen in its attached prototype.  I like this least at the
  moment, because it changes pg_locks content without having a clear advantage
  over the previous option.  Also, the prototype has enough FIXME markers that
  I expect this to get hairy before it's done.

I might change my preference after further prototypes.  Does anyone have a
strong preference between those?  Briefly, I did consider these additional
alternatives:

- Just accept the yet-rarer chance of corruption from this message's test
  procedure.

- Hold a buffer lock long enough to solve things.

- Remember the tuples where we overwrote a special t_ctid, and reverse the
  overwrite during abort processing.  But I/O in the abort path sounds bad.

Thanks,
nm

Attachment

Re: race condition in pg_class

From
Noah Misch
Date:
I'm attaching patches implementing the LockTuple() design.  It turns out we
don't just lose inplace updates.  We also overwrite unrelated tuples,
reproduced at inplace.spec.  Good starting points are README.tuplock and the
heap_inplace_update_scan() header comment.

On Wed, Nov 01, 2023 at 08:09:15PM -0700, Noah Misch wrote:
> On Fri, Oct 27, 2023 at 04:26:12PM -0700, Noah Misch wrote:
> > On Fri, Oct 27, 2023 at 06:40:55PM -0400, Tom Lane wrote:
> > > We could perhaps make this work by using the existing tuple-lock
> > > infrastructure, rather than inventing new locktags (a choice that
> > > spills to a lot of places including clients that examine pg_locks).

> > > I'd also like to find a solution that fixes the case of a conflicting
> > > manual UPDATE (although certainly that's a stretch goal we may not be
> > > able to reach).

I implemented that; search for ri_needLockTagTuple.

> - GRANT passes to heapam the fixed-size portion of the pre-modification tuple.
>   heap_update() compares those bytes to the oldtup in shared buffers to see if
>   an inplace update happened.  (HEAD could get the bytes from a new
>   heap_update() parameter, while back branches would need a different passing
>   approach.)

This could have been fine, but ...

> - LockTuple(), as seen in its attached prototype.  I like this least at the
>   moment, because it changes pg_locks content without having a clear advantage
>   over the previous option.

... I settled on the LockTuple() design for these reasons:

- Solves more conflicts by waiting, instead of by ERROR or by retry loops.
- Extensions wanting inplace updates don't have a big disadvantage over core
  code inplace updates.
- One could use this to stop "tuple concurrently updated" for pg_class rows,
  by using SearchSysCacheLocked1() for all pg_class DDL and making that
  function wait for any existing xmax like inplace_xmax_lock() does.  I don't
  expect to write that, but it's a nice option to have.
- pg_locks shows the new lock acquisitions.

Separable, nontrivial things not fixed in the attached patch stack:

- Inplace update uses transactional CacheInvalidateHeapTuple().  ROLLBACK of
  CREATE INDEX wrongly discards the inval, leading to the relhasindex=t loss
  still seen in inplace-inval.spec.  CacheInvalidateRelmap() does this right.

- AtEOXact_Inval(true) is outside the RecordTransactionCommit() critical
  section, but it is critical.  We must not commit transactional DDL without
  other backends receiving an inval.  (When the inplace inval becomes
  nontransactional, it will face the same threat.)

- Trouble is possible, I bet, if the system crashes between the inplace-update
  memcpy() and XLogInsert().  See the new XXX comment below the memcpy().
  Might solve this by inplace update setting DELAY_CHKPT, writing WAL, and
  finally issuing memcpy() into the buffer.

- [consequences limited to transient failure] Since a PROC_IN_VACUUM backend's
  xmin does not stop pruning, an MVCC scan in that backend can find zero
  tuples when one is live.  This is like what all backends got in the days of
  SnapshotNow catalog scans.  See the pgbench test suite addition.  (Perhaps
  the fix is to make VACUUM do its MVCC scans outside of PROC_IN_VACUUM,
  setting that flag later and unsetting it earlier.)

If you find decisions in this thread's patches are tied to any of those such
that I should not separate those, let's discuss that.  Topics in the patches
that I feel are most fruitful for debate:

- This makes inplace update block if the tuple has an updater.  It's like one
  GRANT blocking another, except an inplace updater won't get "ERROR:  tuple
  concurrently updated" like one of the GRANTs would.  I had implemented
  versions that avoided this blocking by mutating each tuple in the updated
  tuple chain.  That worked, but it had corner cases bad for maintainability,
  listed in the inplace_xmax_lock() header comment.  I'd rather accept the
  blocking, so hackers can rule out those corner cases.  A long-running GRANT
  already hurts VACUUM progress more just by keeping an XID running.

- Pre-checks could make heap_inplace_update_cancel() calls rarer.  Avoiding
  one of those avoids an exclusive buffer lock, and it avoids waiting on
  concurrent heap_update() if any.  We'd pre-check the syscache tuple.
  EventTriggerOnLogin() does it that way, because the code was already in that
  form.  I expect only vac_update_datfrozenxid() concludes !dirty enough to
  matter.  I didn't bother with the optimization, but it would be simple.

- If any citus extension user feels like porting its heap_inplace_update()
  call to this, I'd value hearing about your experience.

- I paid more than my usual attention to test coverage, considering the patch
  stack's intensity compared to most back-patch bug fixes.

I've kept all the above topics brief; feel free to ask for more details.

Thanks,
nm

Attachment

Re: race condition in pg_class

From
Michael Paquier
Date:
On Sun, May 12, 2024 at 04:29:23PM -0700, Noah Misch wrote:
> I'm attaching patches implementing the LockTuple() design.  It turns out we
> don't just lose inplace updates.  We also overwrite unrelated tuples,
> reproduced at inplace.spec.  Good starting points are README.tuplock and the
> heap_inplace_update_scan() header comment.

About inplace050-tests-inj-v1.patch.

+    /* Check if blocked_pid is in injection_wait(). */
+    proc = BackendPidGetProc(blocked_pid);
+    if (proc == NULL)
+        PG_RETURN_BOOL(false);    /* session gone: definitely unblocked */
+    wait_event =
+        pgstat_get_wait_event(UINT32_ACCESS_ONCE(proc->wait_event_info));
+    if (wait_event && strncmp("INJECTION_POINT(",
+                              wait_event,
+                              strlen("INJECTION_POINT(")) == 0)
+        PG_RETURN_BOOL(true);

Hmm.  I am not sure that this is the right interface for the job
because this is not only related to injection points but to the
monitoring of a one or more wait events when running a permutation
step.  Perhaps this is something that should be linked to the spec
files with some property area listing the wait events we're expected
to wait on instead when running a step that we know will wait?
--
Michael

Attachment

Re: race condition in pg_class

From
Noah Misch
Date:
On Mon, May 13, 2024 at 04:59:59PM +0900, Michael Paquier wrote:
> About inplace050-tests-inj-v1.patch.
> 
> +    /* Check if blocked_pid is in injection_wait(). */
> +    proc = BackendPidGetProc(blocked_pid);
> +    if (proc == NULL)
> +        PG_RETURN_BOOL(false);    /* session gone: definitely unblocked */
> +    wait_event =
> +        pgstat_get_wait_event(UINT32_ACCESS_ONCE(proc->wait_event_info));
> +    if (wait_event && strncmp("INJECTION_POINT(",
> +                              wait_event,
> +                              strlen("INJECTION_POINT(")) == 0)
> +        PG_RETURN_BOOL(true);
> 
> Hmm.  I am not sure that this is the right interface for the job
> because this is not only related to injection points but to the
> monitoring of a one or more wait events when running a permutation
> step.

Could you say more about that?  Permutation steps don't monitor wait events
today.  This patch would be the first instance of that.

> Perhaps this is something that should be linked to the spec
> files with some property area listing the wait events we're expected
> to wait on instead when running a step that we know will wait?

The spec syntax doesn't distinguish contention types at all.  The isolation
tester's needs are limited to distinguishing:

  (a) process is waiting on another test session
  (b) process is waiting on automatic background activity (autovacuum, mainly)

Automatic background activity doesn't make a process enter or leave
injection_wait(), so all injection point wait events fall in (a).  (The tester
ignores (b), since those clear up without intervention.  Failing to ignore
them, as the tester did long ago, made output unstable.)



Re: race condition in pg_class

From
Robert Haas
Date:
On Sun, May 12, 2024 at 7:29 PM Noah Misch <noah@leadboat.com> wrote:
> - [consequences limited to transient failure] Since a PROC_IN_VACUUM backend's
>   xmin does not stop pruning, an MVCC scan in that backend can find zero
>   tuples when one is live.  This is like what all backends got in the days of
>   SnapshotNow catalog scans.  See the pgbench test suite addition.  (Perhaps
>   the fix is to make VACUUM do its MVCC scans outside of PROC_IN_VACUUM,
>   setting that flag later and unsetting it earlier.)

Are you saying that this is a problem already, or that the patch
causes it to start happening? If it's the former, that's horrible. If
it's the latter, I'd say that is a fatal defect.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: race condition in pg_class

From
Noah Misch
Date:
On Mon, May 13, 2024 at 03:53:08PM -0400, Robert Haas wrote:
> On Sun, May 12, 2024 at 7:29 PM Noah Misch <noah@leadboat.com> wrote:
> > - [consequences limited to transient failure] Since a PROC_IN_VACUUM backend's
> >   xmin does not stop pruning, an MVCC scan in that backend can find zero
> >   tuples when one is live.  This is like what all backends got in the days of
> >   SnapshotNow catalog scans.  See the pgbench test suite addition.  (Perhaps
> >   the fix is to make VACUUM do its MVCC scans outside of PROC_IN_VACUUM,
> >   setting that flag later and unsetting it earlier.)
> 
> Are you saying that this is a problem already, or that the patch
> causes it to start happening? If it's the former, that's horrible.

The former.



Re: race condition in pg_class

From
Noah Misch
Date:
On Sun, May 12, 2024 at 04:29:23PM -0700, Noah Misch wrote:
> I'm attaching patches implementing the LockTuple() design.

Starting 2024-06-10, I plan to push the first seven of the ten patches:

inplace005-UNEXPECTEDPASS-tap-meson-v1.patch
inplace010-tests-v1.patch
inplace040-waitfuncs-v1.patch
inplace050-tests-inj-v1.patch
inplace060-nodeModifyTable-comments-v1.patch
  Those five just deal in tests, test infrastructure, and comments.
inplace070-rel-locks-missing-v1.patch
  Main risk is new DDL deadlocks.
inplace080-catcache-detoast-inplace-stale-v1.patch
  If it fails to fix the bug it targets, I expect it's a no-op rather than
  breaking things.

I'll leave the last three of the ten needing review.  Those three are beyond
my skill to self-certify.



Re: race condition in pg_class

From
Robert Haas
Date:
On Wed, Jun 5, 2024 at 2:17 PM Noah Misch <noah@leadboat.com> wrote:
> Starting 2024-06-10, I plan to push the first seven of the ten patches:
>
> inplace005-UNEXPECTEDPASS-tap-meson-v1.patch
> inplace010-tests-v1.patch
> inplace040-waitfuncs-v1.patch
> inplace050-tests-inj-v1.patch
> inplace060-nodeModifyTable-comments-v1.patch
>   Those five just deal in tests, test infrastructure, and comments.
> inplace070-rel-locks-missing-v1.patch
>   Main risk is new DDL deadlocks.
> inplace080-catcache-detoast-inplace-stale-v1.patch
>   If it fails to fix the bug it targets, I expect it's a no-op rather than
>   breaking things.
>
> I'll leave the last three of the ten needing review.  Those three are beyond
> my skill to self-certify.

It's not this patch set's fault, but I'm not very pleased to see that
the injection point wait events have been shoehorned into the
"Extension" category - which they are not - instead of being a new
wait_event_type. That would have avoided the ugly wait-event naming
pattern, inconsistent with everything else, introduced by
inplace050-tests-inj-v1.patch.

I think that the comments and commit messages in this patch set could,
in some places, use improvement. For instance,
inplace060-nodeModifyTable-comments-v1.patch reflows a bunch of
comments, which makes it hard to see what actually changed, and the
commit message doesn't tell you, either. A good bit of it seems to be
changing "a view" to "a view INSTEAD OF trigger" or "a view having an
INSTEAD OF trigger," but the reasoning behind that change is not
spelled out anywhere. The reader is left to guess what the other case
is and why the same principles don't apply to it. I don't doubt that
the new comments are more correct than the old ones, but I expect
future patch authors to have difficulty maintaining that state of
affairs.

Similarly, inplace070-rel-locks-missing-v1.patch adds no comments.
IMHO, the commit message also isn't very informative. It disclaims
knowledge of what bug it's fixing, while at the same time leaving the
reader to figure out for themselves how the behavior has changed.
Consequently, I expect writing the release notes for a release
including this patch to be difficult: "We added some locks that block
... something ... in some circumstances ... to prevent ... something."
It's not really the job of the release note author to fill in those
blanks, but rather of the patch author or committer. I don't want to
overburden the act of fixing bugs, but I just feel like more
explanation is needed here. When I see for example that we're adding a
lock acquisition to the end of heap_create(), I can't help but wonder
if it's really true that we don't take a lock on a just-created
relation today. I'm certainly under the impression that we lock
newly-created, uncommitted relations, and a quick test seems to
confirm that. I don't quite know whether that happens, but evidently
this call is guarding against something more subtle than a categorical
failure to lock a relation on creation so I think there should be a
comment explaining what that thing is.

It's also quite surprising that SetRelationHasSubclass() says "take X
lock before calling" and 2 of 4 callers just don't. I guess that's how
it is. But shouldn't we then have an assertion inside that function to
guard against future mistakes? If the reason why we failed to add this
initially is discernible from the commit messages that introduced the
bug, it would be nice to mention what it seems to have been; if not,
it would at least be nice to mention the offending commit(s). I'm also
a bit worried that this is going to cause deadlocks, but I suppose if
it does, that's still better than the status quo.

IsInplaceUpdateOid's header comment says IsInplaceUpdateRelation
instead of IsInplaceUpdateOid.

inplace080-catcache-detoast-inplace-stale-v1.patch seems like another
place where spelling out the rationale in more detail would be helpful
to future readers; for instance, the commit message says that
PgDatabaseToastTable is the only one affected, but it doesn't say why
the others are not, or why this one is. The lengthy comment in
CatalogCacheCreateEntry is also difficult to correlate with the code
which follows. I can't guess whether the two cases called out in the
comment always needed to be handled and were handled save only for
in-place updates, and thus the comment changes were simply taking the
opportunity to elaborate on the existing comments; or whether one of
those cases is preexisting and the other arises from the desire to
handle inplace updates. It could be helpful to mention relevant
identifiers from the code in the comment text e.g.
"systable_recheck_tuple detects ordinary updates by noting changes to
the tuple's visibility information, while the equalTuple() case
detects inplace updates."

IMHO, this patch set underscores the desirability of removing in-place
update altogether. That sounds difficult and not back-patchable, but I
can't classify what this patch set does as anything better than grotty
hacks to work around serious design deficiencies. That is not a vote
against these patches: I see no better way forward. Nonetheless, I
dislike the lack of better options.

I have done only cursory review of the last two patches and don't feel
I'm in a place to certify them, at least not now.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: race condition in pg_class

From
Michael Paquier
Date:
On Thu, Jun 06, 2024 at 09:48:51AM -0400, Robert Haas wrote:
> It's not this patch set's fault, but I'm not very pleased to see that
> the injection point wait events have been shoehorned into the
> "Extension" category - which they are not - instead of being a new
> wait_event_type. That would have avoided the ugly wait-event naming
> pattern, inconsistent with everything else, introduced by
> inplace050-tests-inj-v1.patch.

Not sure to agree with that.  The set of core backend APIs supporting
injection points have nothing to do with wait events.  The library
attached to one or more injection points *may* decide to use a wait
event like what the wait/wakeup calls in modules/injection_points do,
but that's entirely optional.  These rely on custom wait events,
plugged into the Extension category as the code run is itself in an
extension.  I am not arguing against the point that it may be
interesting to plug in custom wait event categories, but the current
design of wait events makes that much harder than what core is
currently able to handle, and I am not sure that this brings much at
the end as long as the wait event strings can be customized.

I've voiced upthread concerns over the naming enforced by the patch
and the way it plugs the namings into the isolation functions, by the
way.
--
Michael

Attachment

Re: race condition in pg_class

From
Robert Haas
Date:
On Thu, Jun 6, 2024 at 7:20 PM Michael Paquier <michael@paquier.xyz> wrote:
> On Thu, Jun 06, 2024 at 09:48:51AM -0400, Robert Haas wrote:
> > It's not this patch set's fault, but I'm not very pleased to see that
> > the injection point wait events have been shoehorned into the
> > "Extension" category - which they are not - instead of being a new
> > wait_event_type. That would have avoided the ugly wait-event naming
> > pattern, inconsistent with everything else, introduced by
> > inplace050-tests-inj-v1.patch.
>
> Not sure to agree with that.  The set of core backend APIs supporting
> injection points have nothing to do with wait events.  The library
> attached to one or more injection points *may* decide to use a wait
> event like what the wait/wakeup calls in modules/injection_points do,
> but that's entirely optional.  These rely on custom wait events,
> plugged into the Extension category as the code run is itself in an
> extension.  I am not arguing against the point that it may be
> interesting to plug in custom wait event categories, but the current
> design of wait events makes that much harder than what core is
> currently able to handle, and I am not sure that this brings much at
> the end as long as the wait event strings can be customized.
>
> I've voiced upthread concerns over the naming enforced by the patch
> and the way it plugs the namings into the isolation functions, by the
> way.

I think the core code should provide an "Injection Point" wait event
type and let extensions add specific wait events there, just like you
did for "Extension". Then this ugly naming would go away. As I see it,
"Extension" is only supposed to be used as a catch-all when we have no
other information, but here we do. If we refuse to use the
wait_event_type field to categorize waits, then people are going to
have to find some other way to get that data into the system, as Noah
has done.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: race condition in pg_class

From
Noah Misch
Date:
On Fri, Jun 07, 2024 at 09:08:03AM -0400, Robert Haas wrote:
> On Thu, Jun 6, 2024 at 7:20 PM Michael Paquier <michael@paquier.xyz> wrote:
> > On Thu, Jun 06, 2024 at 09:48:51AM -0400, Robert Haas wrote:
> > > It's not this patch set's fault, but I'm not very pleased to see that
> > > the injection point wait events have been shoehorned into the
> > > "Extension" category - which they are not - instead of being a new
> > > wait_event_type. That would have avoided the ugly wait-event naming
> > > pattern, inconsistent with everything else, introduced by
> > > inplace050-tests-inj-v1.patch.
> >
> > Not sure to agree with that.  The set of core backend APIs supporting
> > injection points have nothing to do with wait events.  The library
> > attached to one or more injection points *may* decide to use a wait
> > event like what the wait/wakeup calls in modules/injection_points do,
> > but that's entirely optional.  These rely on custom wait events,
> > plugged into the Extension category as the code run is itself in an
> > extension.  I am not arguing against the point that it may be
> > interesting to plug in custom wait event categories, but the current
> > design of wait events makes that much harder than what core is
> > currently able to handle, and I am not sure that this brings much at
> > the end as long as the wait event strings can be customized.
> >
> > I've voiced upthread concerns over the naming enforced by the patch
> > and the way it plugs the namings into the isolation functions, by the
> > way.
> 
> I think the core code should provide an "Injection Point" wait event
> type and let extensions add specific wait events there, just like you
> did for "Extension".

Michael, could you accept the core code offering that, or not?  If so, I am
content to implement that.  If not, for injection point wait events, I have
just one priority.  The isolation tester already detects lmgr locks without
the test writer teaching it about each lock individually.  I want it to have
that same capability for injection points.  Do you think we can find something
everyone can accept, having that property?  These wait events show up in tests
only, and I'm happy to make the cosmetics be anything compatible with that
detection ability.



Re: race condition in pg_class

From
Noah Misch
Date:
On Thu, Jun 06, 2024 at 09:48:51AM -0400, Robert Haas wrote:
> On Wed, Jun 5, 2024 at 2:17 PM Noah Misch <noah@leadboat.com> wrote:
> > Starting 2024-06-10, I plan to push the first seven of the ten patches:
> >
> > inplace005-UNEXPECTEDPASS-tap-meson-v1.patch
> > inplace010-tests-v1.patch
> > inplace040-waitfuncs-v1.patch
> > inplace050-tests-inj-v1.patch
> > inplace060-nodeModifyTable-comments-v1.patch
> >   Those five just deal in tests, test infrastructure, and comments.
> > inplace070-rel-locks-missing-v1.patch
> >   Main risk is new DDL deadlocks.
> > inplace080-catcache-detoast-inplace-stale-v1.patch
> >   If it fails to fix the bug it targets, I expect it's a no-op rather than
> >   breaking things.
> >
> > I'll leave the last three of the ten needing review.  Those three are beyond
> > my skill to self-certify.
> 
> It's not this patch set's fault, but I'm not very pleased to see that
> the injection point wait events have been shoehorned into the
> "Extension" category

I've replied on that branch of the thread.

> I think that the comments and commit messages in this patch set could,
> in some places, use improvement. For instance,
> inplace060-nodeModifyTable-comments-v1.patch reflows a bunch of
> comments, which makes it hard to see what actually changed, and the
> commit message doesn't tell you, either. A good bit of it seems to be
> changing "a view" to "a view INSTEAD OF trigger" or "a view having an
> INSTEAD OF trigger," but the reasoning behind that change is not
> spelled out anywhere. The reader is left to guess what the other case
> is and why the same principles don't apply to it. I don't doubt that
> the new comments are more correct than the old ones, but I expect
> future patch authors to have difficulty maintaining that state of
> affairs.

The two kinds are trigger-updatable views and auto-updatable views.  I've
added sentences about that to the nodeModifyTable.c header comment.  One could
argue for dropping the INSTEAD OF comment changes outside of the header.

> Similarly, inplace070-rel-locks-missing-v1.patch adds no comments.
> IMHO, the commit message also isn't very informative. It disclaims
> knowledge of what bug it's fixing, while at the same time leaving the
> reader to figure out for themselves how the behavior has changed.
> Consequently, I expect writing the release notes for a release
> including this patch to be difficult: "We added some locks that block
> ... something ... in some circumstances ... to prevent ... something."
> It's not really the job of the release note author to fill in those
> blanks, but rather of the patch author or committer. I don't want to

I had been thinking release notes should just say "Add missing DDL lock
acquisitions".  One can cure a breach of our locking standards without proving
some specific bad outcome.  However, one could counter that commands like
GRANT follow a different standard, and perhaps SetRelationHasSubclass() should
use the GRANT standard.  Hence, I researched the bugs this fixes and split
inplace070-rel-locks-missing into three patches:

1. [inplace065-lock-SequenceChangePersistence] Lock in
   SequenceChangePersistence(), where the omission can lose nextval()
   increments of the sequence.

2. [inplace071-lock-SetRelationHasSubclass] Lock in SetRelationHasSubclass().
   This one has only minor benefits; see the new commit message.  A fair
   alternative would be tuple-level locking in inplace120-locktag, like that
   patch adds to GRANT.  That might avoid some deadlocks.  I feel like the
   minor benefits justify the way I chose, but it's a weak preference.

3. [inplace075-lock-heap_create] Add to heap creation:

> overburden the act of fixing bugs, but I just feel like more
> explanation is needed here. When I see for example that we're adding a
> lock acquisition to the end of heap_create(), I can't help but wonder
> if it's really true that we don't take a lock on a just-created
> relation today. I'm certainly under the impression that we lock
> newly-created, uncommitted relations, and a quick test seems to
> confirm that. I don't quite know whether that happens, but evidently
> this call is guarding against something more subtle than a categorical
> failure to lock a relation on creation so I think there should be a
> comment explaining what that thing is.

I've covered that in the new log message.  To lock as early as possible, I've
moved this up a layer, to just after relid assignment.  One could argue this
change belongs in inplace120 rather than its own patch, since it's only here
to eliminate a harmless exception to the rule inplace120 asserts.

I've removed the update_relispartition() that appeared in
inplace070-rel-locks-missing-v1.patch.  Only an older, unpublished draft of
the rules (that inplace110-successors adds to README.tuplock) required that
lock.  The lock might be worthwhile for avoiding "tuple concurrently updated",
but it's out of scope for $SUBJECT.

> It's also quite surprising that SetRelationHasSubclass() says "take X
> lock before calling" and 2 of 4 callers just don't. I guess that's how
> it is. But shouldn't we then have an assertion inside that function to
> guard against future mistakes? If the reason why we failed to add this

Works for me.  Done.  I've moved the LockHeldByMe() change from
inplace110-successors to this patch, since the assertion wants it.

> initially is discernible from the commit messages that introduced the
> bug, it would be nice to mention what it seems to have been; if not,
> it would at least be nice to mention the offending commit(s). I'm also

Done.

> a bit worried that this is going to cause deadlocks, but I suppose if
> it does, that's still better than the status quo.
> 
> IsInplaceUpdateOid's header comment says IsInplaceUpdateRelation
> instead of IsInplaceUpdateOid.

Fixed.

> inplace080-catcache-detoast-inplace-stale-v1.patch seems like another
> place where spelling out the rationale in more detail would be helpful
> to future readers; for instance, the commit message says that
> PgDatabaseToastTable is the only one affected, but it doesn't say why
> the others are not, or why this one is. The lengthy comment in

I've updated the commit message to answer that.

> CatalogCacheCreateEntry is also difficult to correlate with the code
> which follows. I can't guess whether the two cases called out in the
> comment always needed to be handled and were handled save only for
> in-place updates, and thus the comment changes were simply taking the
> opportunity to elaborate on the existing comments; or whether one of
> those cases is preexisting and the other arises from the desire to
> handle inplace updates. It could be helpful to mention relevant
> identifiers from the code in the comment text e.g.
> "systable_recheck_tuple detects ordinary updates by noting changes to
> the tuple's visibility information, while the equalTuple() case
> detects inplace updates."

The patch was elaborating on existing comments.  Reading the patch again
today, the elaboration no longer feels warranted.  Hence, I've rewritten that
comment addition.  I've included identifiers, and the patch no longer adds
comment material orthogonal to inplace updates.

Thanks,
nm

Attachment

Re: race condition in pg_class

From
Michael Paquier
Date:
On Mon, Jun 10, 2024 at 07:19:27PM -0700, Noah Misch wrote:
> On Fri, Jun 07, 2024 at 09:08:03AM -0400, Robert Haas wrote:
>> I think the core code should provide an "Injection Point" wait event
>> type and let extensions add specific wait events there, just like you
>> did for "Extension".
>
> Michael, could you accept the core code offering that, or not?  If so, I am
> content to implement that.  If not, for injection point wait events, I have
> just one priority.  The isolation tester already detects lmgr locks without
> the test writer teaching it about each lock individually.  I want it to have
> that same capability for injection points. Do you think we can find something
> everyone can accept, having that property?  These wait events show up in tests
> only, and I'm happy to make the cosmetics be anything compatible with that
> detection ability.

Adding a wait event class for injection point is an interesting
suggestion that would simplify the detection in the isolation function
quite a bit.  Are you sure that this is something that would be fit
for v17 material?  TBH, I am not sure.

At the end, the test coverage has the highest priority and the bugs
you are addressing are complex enough that isolation tests of this
level are a necessity, so I don't object to what
inplace050-tests-inj-v2.patch introduces with the naming dependency
for the time being on HEAD.  I'll just adapt and live with that
depending on what I deal with, while trying to improve HEAD later on.

I'm still wondering if there is something that could be more elegant
than a dedicated class for injection points, but I cannot think about
something that would be better for isolation tests on top of my head.
If there is something I can think of, I'll just go and implement it :)
--
Michael

Attachment

Re: race condition in pg_class

From
Michail Nikolaev
Date:
Hello, everyone.

I am not sure, but I think that issue may be related to the issue described in https://www.postgresql.org/message-id/CANtu0ojXmqjmEzp-%3DaJSxjsdE76iAsRgHBoK0QtYHimb_mEfsg%40mail.gmail.com

It looks like REINDEX CONCURRENTLY could interfere with ON CONFLICT UPDATE in some strange way.

Best regards,
Mikhail.

Re: race condition in pg_class

From
Noah Misch
Date:
On Tue, Jun 11, 2024 at 01:37:21PM +0900, Michael Paquier wrote:
> On Mon, Jun 10, 2024 at 07:19:27PM -0700, Noah Misch wrote:
> > On Fri, Jun 07, 2024 at 09:08:03AM -0400, Robert Haas wrote:
> >> I think the core code should provide an "Injection Point" wait event
> >> type and let extensions add specific wait events there, just like you
> >> did for "Extension".
> > 
> > Michael, could you accept the core code offering that, or not?  If so, I am
> > content to implement that.  If not, for injection point wait events, I have
> > just one priority.  The isolation tester already detects lmgr locks without
> > the test writer teaching it about each lock individually.  I want it to have
> > that same capability for injection points. Do you think we can find something
> > everyone can accept, having that property?  These wait events show up in tests
> > only, and I'm happy to make the cosmetics be anything compatible with that
> > detection ability.
> 
> Adding a wait event class for injection point is an interesting
> suggestion that would simplify the detection in the isolation function
> quite a bit.  Are you sure that this is something that would be fit
> for v17 material?  TBH, I am not sure.

If I were making a list of changes always welcome post-beta, it wouldn't
include adding wait event types.  But I don't hesitate to add one if it
unblocks a necessary test for a bug present in all versions.

> At the end, the test coverage has the highest priority and the bugs
> you are addressing are complex enough that isolation tests of this
> level are a necessity, so I don't object to what
> inplace050-tests-inj-v2.patch introduces with the naming dependency
> for the time being on HEAD.  I'll just adapt and live with that
> depending on what I deal with, while trying to improve HEAD later on.

Here's what I'm reading for each person's willingness to tolerate each option:

STRATEGY                        | Paquier | Misch | Haas
--------------------------------------------------------
new "Injection Point" wait type | maybe   | yes   | yes
INJECTION_POINT(...) naming     | yes     | yes   | unknown
isolation spec says event names | yes     | no    | unknown

Corrections and additional strategy lines welcome.  Robert, how do you judge
the lines where I've listed you as "unknown"?

> I'm still wondering if there is something that could be more elegant
> than a dedicated class for injection points, but I cannot think about
> something that would be better for isolation tests on top of my head.
> If there is something I can think of, I'll just go and implement it :)

I once considered changing them to use advisory lock waits instead of
ConditionVariableSleep(), but I recall that was worse from the perspective of
injection points in critical sections.



Re: race condition in pg_class

From
Robert Haas
Date:
On Wed, Jun 12, 2024 at 1:54 PM Noah Misch <noah@leadboat.com> wrote:
> If I were making a list of changes always welcome post-beta, it wouldn't
> include adding wait event types.  But I don't hesitate to add one if it
> unblocks a necessary test for a bug present in all versions.

However, injection points themselves are not present in all versions,
so even if we invent a new wait-event type, we'll have difficulty
testing older versions, unless we're planning to back-patch all of
that infrastructure, which I assume we aren't.

Personally, I think the fact that injection point wait events were put
under Extension is a design mistake that should be corrected before 17
is out of beta.

> Here's what I'm reading for each person's willingness to tolerate each option:
>
> STRATEGY                        | Paquier | Misch | Haas
> --------------------------------------------------------
> new "Injection Point" wait type | maybe   | yes   | yes
> INJECTION_POINT(...) naming     | yes     | yes   | unknown
> isolation spec says event names | yes     | no    | unknown
>
> Corrections and additional strategy lines welcome.  Robert, how do you judge
> the lines where I've listed you as "unknown"?

I'd tolerate INJECTION_POINT() if we had no other option but I think
it's clearly inferior. Does the last line refer to putting the
specific wait event names in the isolation spec file? If so, I'd also
be fine with that.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: race condition in pg_class

From
Noah Misch
Date:
On Wed, Jun 12, 2024 at 02:08:31PM -0400, Robert Haas wrote:
> On Wed, Jun 12, 2024 at 1:54 PM Noah Misch <noah@leadboat.com> wrote:
> > If I were making a list of changes always welcome post-beta, it wouldn't
> > include adding wait event types.  But I don't hesitate to add one if it
> > unblocks a necessary test for a bug present in all versions.
> 
> However, injection points themselves are not present in all versions,
> so even if we invent a new wait-event type, we'll have difficulty
> testing older versions, unless we're planning to back-patch all of
> that infrastructure, which I assume we aren't.

Right.  We could put the injection point tests in v18 only instead of v17+v18.
I feel that would be an overreaction to a dispute about names that show up
only in tests.  Even so, I could accept that.

> Personally, I think the fact that injection point wait events were put
> under Extension is a design mistake that should be corrected before 17
> is out of beta.

Works for me.  I don't personally have a problem with the use of Extension,
since it is a src/test/modules extension creating them.

> > Here's what I'm reading for each person's willingness to tolerate each option:
> >
> > STRATEGY                        | Paquier | Misch | Haas
> > --------------------------------------------------------
> > new "Injection Point" wait type | maybe   | yes   | yes
> > INJECTION_POINT(...) naming     | yes     | yes   | unknown
> > isolation spec says event names | yes     | no    | unknown
> >
> > Corrections and additional strategy lines welcome.  Robert, how do you judge
> > the lines where I've listed you as "unknown"?
> 
> I'd tolerate INJECTION_POINT() if we had no other option but I think
> it's clearly inferior. Does the last line refer to putting the
> specific wait event names in the isolation spec file? If so, I'd also
> be fine with that.

Yes, the last line does refer to that.  Updated table:

STRATEGY                        | Paquier | Misch | Haas
--------------------------------------------------------
new "Injection Point" wait type | maybe   | yes   | yes
INJECTION_POINT(...) naming     | yes     | yes   | no
isolation spec says event names | yes     | no    | yes

I find that's adequate support for the first line.  If there are no objections
in the next 24hr, I will implement that.



Re: race condition in pg_class

From
Noah Misch
Date:
On Wed, Jun 12, 2024 at 03:02:43PM +0200, Michail Nikolaev wrote:
> I am not sure, but I think that issue may be related to the issue described
> in
> https://www.postgresql.org/message-id/CANtu0ojXmqjmEzp-%3DaJSxjsdE76iAsRgHBoK0QtYHimb_mEfsg%40mail.gmail.com
> 
> It looks like REINDEX CONCURRENTLY could interfere with ON CONFLICT UPDATE
> in some strange way.

Can you say more about the connection you see between $SUBJECT and that?  That
looks like a valid report of an important bug, but I'm not following the
potential relationship to $SUBJECT.

On your other thread, it would be useful to see stack traces from the high-CPU
processes once the live lock has ended all query completion.



Re: race condition in pg_class

From
Michail Nikolaev
Date:
Hello!

> Can you say more about the connection you see between $SUBJECT and that?  That
> looks like a valid report of an important bug, but I'm not following the
> potential relationship to $SUBJECT.

I was guided by the following logic:
* A pg_class race condition can cause table indexes to look stale.
* REINDEX updates indexes
* errors can be explained by different backends using different arbiter indexes

> On your other thread, it would be useful to see stack traces from the high-CPU
> processes once the live lock has ended all query completion.
I'll do.

Best regards,
Mikhail.

Re: race condition in pg_class

From
Michael Paquier
Date:
On Wed, Jun 12, 2024 at 12:32:23PM -0700, Noah Misch wrote:
> On Wed, Jun 12, 2024 at 02:08:31PM -0400, Robert Haas wrote:
>> Personally, I think the fact that injection point wait events were put
>> under Extension is a design mistake that should be corrected before 17
>> is out of beta.

Well, isolation tests and the way to wait for specific points in them
is something I've thought about when working on the initial injpoint
infrastructure, but all my ideas went down to the fact that this is
not specific to injection points: I've also wanted to be able to cause
an isolation to wait for a specific event (class,name).  A hardcoded
sleep is an example.  Even if I discourage anything like that in the
in-core tests because they're slow on fast machines and can be
unreliable on slow machines, it is a fact that they are used by
out-of-core code and that extension developers find them acceptable.

> Works for me.  I don't personally have a problem with the use of Extension,
> since it is a src/test/modules extension creating them.

That's the original reason why Extension has been used in this case,
because the points are assigned in an extension.

> Yes, the last line does refer to that.  Updated table:
>
> STRATEGY                        | Paquier | Misch | Haas
> --------------------------------------------------------
> new "Injection Point" wait type | maybe   | yes   | yes
> INJECTION_POINT(...) naming     | yes     | yes   | no
> isolation spec says event names | yes     | no    | yes
>
> I find that's adequate support for the first line.  If there are no objections
> in the next 24hr, I will implement that.

OK.  That sounds like a consensus to me, useful enough for the cases
at hand.
--
Michael

Attachment

Re: race condition in pg_class

From
Noah Misch
Date:
On Wed, Jun 12, 2024 at 10:02:00PM +0200, Michail Nikolaev wrote:
> > Can you say more about the connection you see between $SUBJECT and that?
> That
> > looks like a valid report of an important bug, but I'm not following the
> > potential relationship to $SUBJECT.
> 
> I was guided by the following logic:
> * A pg_class race condition can cause table indexes to look stale.
> * REINDEX updates indexes
> * errors can be explained by different backends using different arbiter
> indexes

Got it.  The race condition of $SUBJECT involves inplace updates, and the
wrong content becomes permanent.  Hence, I suspect they're unrelated.



Re: race condition in pg_class

From
Noah Misch
Date:
On Mon, Jun 10, 2024 at 07:45:25PM -0700, Noah Misch wrote:
> On Thu, Jun 06, 2024 at 09:48:51AM -0400, Robert Haas wrote:
> > It's not this patch set's fault, but I'm not very pleased to see that
> > the injection point wait events have been shoehorned into the
> > "Extension" category
> 
> I've replied on that branch of the thread.

I think the attached covers all comments to date.  I gave everything v3, but
most patches have just a no-conflict rebase vs. v2.  The exceptions are
inplace031-inj-wait-event (implements the holding from that branch of the
thread) and inplace050-tests-inj (updated to cooperate with inplace031).  Much
of inplace031-inj-wait-event is essentially s/Extension/Custom/ for the
infrastructure common to the two custom wait event types.

Thanks,
nm

Attachment

Re: race condition in pg_class

From
Michael Paquier
Date:
On Thu, Jun 13, 2024 at 05:35:49PM -0700, Noah Misch wrote:
> I think the attached covers all comments to date.  I gave everything v3, but
> most patches have just a no-conflict rebase vs. v2.  The exceptions are
> inplace031-inj-wait-event (implements the holding from that branch of the
> thread) and inplace050-tests-inj (updated to cooperate with inplace031).  Much
> of inplace031-inj-wait-event is essentially s/Extension/Custom/ for the
> infrastructure common to the two custom wait event types.

Looking at inplace031-inj-wait-event..

The comment at the top of GetWaitEventCustomNames() requires an
update, still mentioning extensions.

GetWaitEventCustomIdentifier() is incorrect, and should return
"InjectionPoint" in the default case of this class name, no?  I would
just pass the classID to GetWaitEventCustomIdentifier().

It is suboptimal to have pg_get_wait_events() do two scans of
WaitEventCustomHashByName.  Wouldn't it be better to do a single scan,
returning a set of (class_name,event_name) fed to the tuplestore of
this SRF?

 uint32
 WaitEventExtensionNew(const char *wait_event_name)
 {
+    return WaitEventCustomNew(PG_WAIT_EXTENSION, wait_event_name);
+}
+
+uint32
+WaitEventInjectionPointNew(const char *wait_event_name)
+{
+    return WaitEventCustomNew(PG_WAIT_INJECTIONPOINT, wait_event_name);
+}

Hmm.  The advantage of two routines is that it is possible to control
the class IDs allowed to use the custom wait events.  Shouldn't the
second routine be documented in xfunc.sgml?

wait_event_names.txt also needs tweaks, in the shape of a new class
name for the new class "InjectionPoint" so as it can be documented for
its default case.  That's a fallback if an event ID cannot be found,
which should not be the case, still that's more correct than showing
"Extension" for all class IDs covered by custom wait events.
--
Michael

Attachment

Re: race condition in pg_class

From
Noah Misch
Date:
On Fri, Jun 14, 2024 at 09:58:59AM +0900, Michael Paquier wrote:
> Looking at inplace031-inj-wait-event..
> 
> The comment at the top of GetWaitEventCustomNames() requires an
> update, still mentioning extensions.

Thanks.  Fixed locally.

> GetWaitEventCustomIdentifier() is incorrect, and should return
> "InjectionPoint" in the default case of this class name, no?

I intentionally didn't provide a default event ID for InjectionPoint.
PG_WAIT_EXTENSION needs a default case for backward compatibility, if nothing
else.  For this second custom type, it's needless complexity.  The value
0x0B000000U won't just show up like PG_WAIT_EXTENSION does.
GetLWLockIdentifier() also has no default case.  How do you see it?

> I would
> just pass the classID to GetWaitEventCustomIdentifier().

As you say, that would allow eventId==0 to raise "could not find custom wait
event" for PG_WAIT_INJECTIONPOINT instead of wrongly returning "Extension".
Even if 0x0B000000U somehow does show up, having pg_stat_activity report
"Extension" instead of an error, in a developer test run, feels unimportant to
me.

> It is suboptimal to have pg_get_wait_events() do two scans of
> WaitEventCustomHashByName.  Wouldn't it be better to do a single scan,
> returning a set of (class_name,event_name) fed to the tuplestore of
> this SRF?

Micro-optimization of pg_get_wait_events() doesn't matter.  I did consider
that or pushing more of the responsibility into wait_events.c, but I
considered it on code repetition grounds, not performance grounds.

>  uint32
>  WaitEventExtensionNew(const char *wait_event_name)
>  {
> +    return WaitEventCustomNew(PG_WAIT_EXTENSION, wait_event_name);
> +}
> +
> +uint32
> +WaitEventInjectionPointNew(const char *wait_event_name)
> +{
> +    return WaitEventCustomNew(PG_WAIT_INJECTIONPOINT, wait_event_name);
> +}
> 
> Hmm.  The advantage of two routines is that it is possible to control
> the class IDs allowed to use the custom wait events.  Shouldn't the
> second routine be documented in xfunc.sgml?

The patch added to xfunc.sgml an example of using it.  I'd be more inclined to
delete the WaitEventExtensionNew() docbook documentation than to add its level
of detail for WaitEventInjectionPointNew().  We don't have that kind of
documentation for most extension-facing C functions.



Re: race condition in pg_class

From
Michael Paquier
Date:
On Thu, Jun 13, 2024 at 07:42:25PM -0700, Noah Misch wrote:
> On Fri, Jun 14, 2024 at 09:58:59AM +0900, Michael Paquier wrote:
>> GetWaitEventCustomIdentifier() is incorrect, and should return
>> "InjectionPoint" in the default case of this class name, no?
>
> I intentionally didn't provide a default event ID for InjectionPoint.
> PG_WAIT_EXTENSION needs a default case for backward compatibility, if nothing
> else.  For this second custom type, it's needless complexity.  The value
> 0x0B000000U won't just show up like PG_WAIT_EXTENSION does.
> GetLWLockIdentifier() also has no default case.  How do you see it?

I would add a default for consistency as this is just a few extra
lines, but if you feel strongly about that, I'm OK as well.  It makes
a bit easier the detection of incorrect wait event numbers set
incorrectly in extensions depending on the class wanted.

> The patch added to xfunc.sgml an example of using it.  I'd be more inclined to
> delete the WaitEventExtensionNew() docbook documentation than to add its level
> of detail for WaitEventInjectionPointNew().  We don't have that kind of
> documentation for most extension-facing C functions.

It's one of the areas where I think that we should have more
documentation, not less of it, so I'd rather keep it and maintaining
it is not really a pain (?).  The backend gets complicated enough
these days that limiting what developers have to guess on their own is
a better long-term approach because the Postgres out-of-core ecosystem
is expanding a lot (aka have also in-core documentation for hooks,
even if there's been a lot of reluctance historically about having
them).
--
Michael

Attachment

Re: race condition in pg_class

From
Noah Misch
Date:
On Sun, Jun 16, 2024 at 09:28:05AM +0900, Michael Paquier wrote:
> On Thu, Jun 13, 2024 at 07:42:25PM -0700, Noah Misch wrote:
> > On Fri, Jun 14, 2024 at 09:58:59AM +0900, Michael Paquier wrote:
> >> GetWaitEventCustomIdentifier() is incorrect, and should return
> >> "InjectionPoint" in the default case of this class name, no?
> > 
> > I intentionally didn't provide a default event ID for InjectionPoint.
> > PG_WAIT_EXTENSION needs a default case for backward compatibility, if nothing
> > else.  For this second custom type, it's needless complexity.  The value
> > 0x0B000000U won't just show up like PG_WAIT_EXTENSION does.
> > GetLWLockIdentifier() also has no default case.  How do you see it?
> 
> I would add a default for consistency as this is just a few extra
> lines, but if you feel strongly about that, I'm OK as well.  It makes
> a bit easier the detection of incorrect wait event numbers set
> incorrectly in extensions depending on the class wanted.

It would be odd to detect exactly 0x0B000000U and not other invalid inputs,
like 0x0A000001U where only 0x0B000001U is valid.  I'm attaching roughly what
it would take.  Shall I squash this into inplace031?

The thing I feel strongly about here is keeping focus on fixing $SUBJECT bugs
that are actually corrupting data out there.  I think we should all limit our
interest in the verbiage of strings that appear only when running developer
tests, especially when $SUBJECT is a bug fix.  When the string appears only
after C code passes invalid input to other C code, it matters even less.

> > The patch added to xfunc.sgml an example of using it.  I'd be more inclined to
> > delete the WaitEventExtensionNew() docbook documentation than to add its level
> > of detail for WaitEventInjectionPointNew().  We don't have that kind of
> > documentation for most extension-facing C functions.
> 
> It's one of the areas where I think that we should have more
> documentation, not less of it, so I'd rather keep it and maintaining
> it is not really a pain (?).  The backend gets complicated enough
> these days that limiting what developers have to guess on their own is
> a better long-term approach because the Postgres out-of-core ecosystem
> is expanding a lot (aka have also in-core documentation for hooks,
> even if there's been a lot of reluctance historically about having
> them).

[getting deeply off topic -- let's move this to another thread if it needs to
expand] I like reducing the need to guess.  So far in this inplace update
project (this thread plus postgr.es/m/20240615223718.42.nmisch@google.com),
three patches just fix comments.  Even comments carry quite a price, but I
value them.  When we hand-maintain documentation of a C function in both its
header comment and another place, I get skeptical about whether hackers
(including myself) will actually keep them in sync and skeptical of the
incremental value of maintaining the second version.

Attachment

Re: race condition in pg_class

From
Michael Paquier
Date:
On Sun, Jun 16, 2024 at 07:07:08AM -0700, Noah Misch wrote:
> It would be odd to detect exactly 0x0B000000U and not other invalid inputs,
> like 0x0A000001U where only 0x0B000001U is valid.  I'm attaching roughly what
> it would take.  Shall I squash this into inplace031?

Agreed that merging both together is cleaner.  Moving the event class
into the key of WaitEventCustomEntryByInfo leads to a more consistent
final result.

> The thing I feel strongly about here is keeping focus on fixing $SUBJECT bugs
> that are actually corrupting data out there.

Agreed to focus on that first.
--
Michael

Attachment

Re: race condition in pg_class

From
Noah Misch
Date:
On Thu, Jun 13, 2024 at 05:35:49PM -0700, Noah Misch wrote:
> On Mon, Jun 10, 2024 at 07:45:25PM -0700, Noah Misch wrote:
> > On Thu, Jun 06, 2024 at 09:48:51AM -0400, Robert Haas wrote:
> > > It's not this patch set's fault, but I'm not very pleased to see that
> > > the injection point wait events have been shoehorned into the
> > > "Extension" category
> > 
> > I've replied on that branch of the thread.
> 
> I think the attached covers all comments to date.  I gave everything v3, but
> most patches have just a no-conflict rebase vs. v2.  The exceptions are
> inplace031-inj-wait-event (implements the holding from that branch of the
> thread) and inplace050-tests-inj (updated to cooperate with inplace031).  Much
> of inplace031-inj-wait-event is essentially s/Extension/Custom/ for the
> infrastructure common to the two custom wait event types.

Starting 2024-06-27, I'd like to push
inplace080-catcache-detoast-inplace-stale and earlier patches, self-certifying
them if needed.  Then I'll submit the last three to the commitfest.  Does
anyone want me to delay that step?

Two more test-related changes compared to v3:

- In inplace010-tests, add to 027_stream_regress.pl a test that catalog
  contents match between primary and standby.  If one of these patches broke
  replay of inplace updates, this would help catch it.

- In inplace031-inj-wait-event, make sysviews.sql indifferent to whether
  InjectionPoint wait events exist.  installcheck need this if other activity
  created such an event since the last postmaster restart.

Attachment

Re: race condition in pg_class

From
Noah Misch
Date:
On Fri, Jun 21, 2024 at 02:28:42PM -0700, Noah Misch wrote:
> On Thu, Jun 13, 2024 at 05:35:49PM -0700, Noah Misch wrote:
> > I think the attached covers all comments to date.  I gave everything v3, but
> > most patches have just a no-conflict rebase vs. v2.  The exceptions are
> > inplace031-inj-wait-event (implements the holding from that branch of the
> > thread) and inplace050-tests-inj (updated to cooperate with inplace031).  Much
> > of inplace031-inj-wait-event is essentially s/Extension/Custom/ for the
> > infrastructure common to the two custom wait event types.
> 
> Starting 2024-06-27, I'd like to push
> inplace080-catcache-detoast-inplace-stale and earlier patches, self-certifying
> them if needed.  Then I'll submit the last three to the commitfest.  Does
> anyone want me to delay that step?

Pushed.  Buildfarm member prion is failing the new inplace-inval.spec, almost
surely because prion uses -DCATCACHE_FORCE_RELEASE and inplace-inval.spec is
testing an extant failure to inval a cache entry.  Naturally, inexorable inval
masks the extant bug.  Ideally, I'd just skip the test under any kind of cache
clobber option.  I don't know a pleasant way to do that, so these are
known-feasible things I'm considering:

1. Neutralize the test in all branches, probably by having it just not report
   the final answer.  Undo in the later fix patch.

2. v14+ has pg_backend_memory_contexts.  In the test, run some plpgsql that
   uses heuristics on that to deduce whether caches are getting released.
   Have a separate expected output for the cache-release scenario.  Perhaps
   also have the test treat installcheck like cache-release, since
   installcheck could experience sinval reset with similar consequences.
   Neutralize the test in v12 & v13.

3. Add a test module with a C function that reports whether any kind of cache
   clobber is active.  Call it in this test.  Have a separate expected output
   for the cache-release scenario.

Preferences or other ideas?  I'm waffling between (1) and (2).  I'll give it
more thought over the next day.

Thanks,
nm



Re: race condition in pg_class

From
Tom Lane
Date:
Noah Misch <noah@leadboat.com> writes:
> Pushed.  Buildfarm member prion is failing the new inplace-inval.spec, almost
> surely because prion uses -DCATCACHE_FORCE_RELEASE and inplace-inval.spec is
> testing an extant failure to inval a cache entry.  Naturally, inexorable inval
> masks the extant bug.  Ideally, I'd just skip the test under any kind of cache
> clobber option.  I don't know a pleasant way to do that, so these are
> known-feasible things I'm considering:

> 1. Neutralize the test in all branches, probably by having it just not report
>    the final answer.  Undo in the later fix patch.

> 2. v14+ has pg_backend_memory_contexts.  In the test, run some plpgsql that
>    uses heuristics on that to deduce whether caches are getting released.
>    Have a separate expected output for the cache-release scenario.  Perhaps
>    also have the test treat installcheck like cache-release, since
>    installcheck could experience sinval reset with similar consequences.
>    Neutralize the test in v12 & v13.

> 3. Add a test module with a C function that reports whether any kind of cache
>    clobber is active.  Call it in this test.  Have a separate expected output
>    for the cache-release scenario.

> Preferences or other ideas?  I'm waffling between (1) and (2).  I'll give it
> more thought over the next day.

I'd just go for (1).  We were doing fine without this test case.
I can't see expending effort towards hiding its result rather
than actually fixing anything.

            regards, tom lane



Re: race condition in pg_class

From
Noah Misch
Date:
On Fri, Jun 28, 2024 at 01:17:22AM -0400, Tom Lane wrote:
> Noah Misch <noah@leadboat.com> writes:
> > Pushed.  Buildfarm member prion is failing the new inplace-inval.spec, almost
> > surely because prion uses -DCATCACHE_FORCE_RELEASE and inplace-inval.spec is
> > testing an extant failure to inval a cache entry.  Naturally, inexorable inval
> > masks the extant bug.  Ideally, I'd just skip the test under any kind of cache
> > clobber option.  I don't know a pleasant way to do that, so these are
> > known-feasible things I'm considering:
> 
> > 1. Neutralize the test in all branches, probably by having it just not report
> >    the final answer.  Undo in the later fix patch.
> 
> > 2. v14+ has pg_backend_memory_contexts.  In the test, run some plpgsql that
> >    uses heuristics on that to deduce whether caches are getting released.
> >    Have a separate expected output for the cache-release scenario.  Perhaps
> >    also have the test treat installcheck like cache-release, since
> >    installcheck could experience sinval reset with similar consequences.
> >    Neutralize the test in v12 & v13.
> 
> > 3. Add a test module with a C function that reports whether any kind of cache
> >    clobber is active.  Call it in this test.  Have a separate expected output
> >    for the cache-release scenario.
> 
> > Preferences or other ideas?  I'm waffling between (1) and (2).  I'll give it
> > more thought over the next day.
> 
> I'd just go for (1).  We were doing fine without this test case.
> I can't see expending effort towards hiding its result rather
> than actually fixing anything.

Good point, any effort on (2) would be wasted once the fixes get certified.  I
pushed (1).  I'm attaching the rebased fix patches.

Attachment

Re: race condition in pg_class

From
Alexander Lakhin
Date:
Hello Noah,

29.06.2024 05:42, Noah Misch wrote:
> Good point, any effort on (2) would be wasted once the fixes get certified.  I
> pushed (1).  I'm attaching the rebased fix patches.

Please look at a new anomaly, introduced by inplace110-successors-v5.patch:
CREATE TABLE t (i int) PARTITION BY LIST(i);
CREATE TABLE p1 (i int);
ALTER TABLE t ATTACH PARTITION p1 FOR VALUES IN (1);
ALTER TABLE t DETACH PARTITION p1;
ANALYZE t;

triggers unexpected
ERROR:  tuple to be updated was already modified by an operation triggered by the current command

Best regards,
Alexander



Re: race condition in pg_class

From
Noah Misch
Date:
On Wed, Jul 03, 2024 at 06:00:00AM +0300, Alexander Lakhin wrote:
> 29.06.2024 05:42, Noah Misch wrote:
> > Good point, any effort on (2) would be wasted once the fixes get certified.  I
> > pushed (1).  I'm attaching the rebased fix patches.
> 
> Please look at a new anomaly, introduced by inplace110-successors-v5.patch:
> CREATE TABLE t (i int) PARTITION BY LIST(i);
> CREATE TABLE p1 (i int);
> ALTER TABLE t ATTACH PARTITION p1 FOR VALUES IN (1);
> ALTER TABLE t DETACH PARTITION p1;
> ANALYZE t;
> 
> triggers unexpected
> ERROR:  tuple to be updated was already modified by an operation triggered by the current command

Thanks.  Today, it's okay to issue heap_inplace_update() after heap_update()
without an intervening CommandCounterIncrement().  The patch makes the CCI
required.  The ANALYZE in your example reaches this with a heap_update to set
relhassubclass=f.  I've fixed this by just adding a CCI (and adding to the
tests in vacuum.sql).

The alternative would be to allow inplace updates on TM_SelfModified tuples.
I can't think of a specific problem with allowing that, but I feel that would
make system state interactions harder to reason about.  It might be optimal to
allow that in back branches only, to reduce the chance of releasing a bug like
the one you found.

Attachment

Re: race condition in pg_class

From
Noah Misch
Date:
On Wed, Jul 03, 2024 at 04:09:54PM -0700, Noah Misch wrote:
> On Wed, Jul 03, 2024 at 06:00:00AM +0300, Alexander Lakhin wrote:
> > 29.06.2024 05:42, Noah Misch wrote:
> > > Good point, any effort on (2) would be wasted once the fixes get certified.  I
> > > pushed (1).  I'm attaching the rebased fix patches.
> > 
> > Please look at a new anomaly, introduced by inplace110-successors-v5.patch:
> > CREATE TABLE t (i int) PARTITION BY LIST(i);
> > CREATE TABLE p1 (i int);
> > ALTER TABLE t ATTACH PARTITION p1 FOR VALUES IN (1);
> > ALTER TABLE t DETACH PARTITION p1;
> > ANALYZE t;
> > 
> > triggers unexpected
> > ERROR:  tuple to be updated was already modified by an operation triggered by the current command
> 
> Thanks.  Today, it's okay to issue heap_inplace_update() after heap_update()
> without an intervening CommandCounterIncrement().

Correction: it's not okay today.  If code does that, heap_inplace_update()
mutates a tuple that is going to become invisible at CCI.  The lack of CCI
yields a minor live bug in v14+.  Its consequences seem to be limited to
failing to update reltuples for a partitioned table having zero partitions.

> The patch makes the CCI
> required.  The ANALYZE in your example reaches this with a heap_update to set
> relhassubclass=f.  I've fixed this by just adding a CCI (and adding to the
> tests in vacuum.sql).

That's still the right fix, but I've separated it into its own patch and
expanded the test.  All the non-comment changes between v5 and v6 are now part
of the separate patch.

> The alternative would be to allow inplace updates on TM_SelfModified tuples.
> I can't think of a specific problem with allowing that, but I feel that would
> make system state interactions harder to reason about.  It might be optimal to
> allow that in back branches only, to reduce the chance of releasing a bug like
> the one you found.

Allowing a mutation of a TM_SelfModified tuple is bad, since that tuple is
going to become dead soon.  Mutating its successor could be okay.  Since we'd
expect such code to be unreachable, I'm not keen carry such code.  For that
scenario, I'd rather keep the error you encountered.  Other opinions?

Attachment

Re: race condition in pg_class

From
Alexander Lakhin
Date:
Hello Noah,

28.06.2024 08:13, Noah Misch wrote:
> Pushed. ...

Please look also at another anomaly, I've discovered.

An Assert added with d5f788b41 may be falsified with:
CREATE TABLE t(a int PRIMARY KEY);
INSERT INTO t VALUES (1);
CREATE VIEW v AS SELECT * FROM t;

MERGE INTO v USING (VALUES (1)) AS va(a) ON v.a = va.a
   WHEN MATCHED THEN DO NOTHING
   WHEN NOT MATCHED THEN DO NOTHING;

TRAP: failed Assert("resultRelInfo->ri_TrigDesc"), File: "nodeModifyTable.c", Line: 2891, PID: 1590670

Best regards,
Alexander



Re: race condition in pg_class

From
Noah Misch
Date:
On Thu, Jul 04, 2024 at 08:00:00AM +0300, Alexander Lakhin wrote:
> 28.06.2024 08:13, Noah Misch wrote:
> > Pushed. ...
> 
> Please look also at another anomaly, I've discovered.
> 
> An Assert added with d5f788b41 may be falsified with:
> CREATE TABLE t(a int PRIMARY KEY);
> INSERT INTO t VALUES (1);
> CREATE VIEW v AS SELECT * FROM t;
> 
> MERGE INTO v USING (VALUES (1)) AS va(a) ON v.a = va.a
>   WHEN MATCHED THEN DO NOTHING
>   WHEN NOT MATCHED THEN DO NOTHING;
> 
> TRAP: failed Assert("resultRelInfo->ri_TrigDesc"), File: "nodeModifyTable.c", Line: 2891, PID: 1590670

Thanks.  When all the MERGE actions are DO NOTHING, view_has_instead_trigger()
returns true, so we use the wholerow code regardless of the view's triggers or
auto update capability.  The behavior is fine, so I'm fixing the new assertion
and comments with new patch inplace087-merge-DO-NOTHING-v8.patch.  The closest
relevant tests processed zero rows, so they narrowly avoided witnessing this
assertion.

Attachment

Re: race condition in pg_class

From
Noah Misch
Date:
On Thu, Jul 04, 2024 at 03:08:16PM -0700, Noah Misch wrote:
> On Thu, Jul 04, 2024 at 08:00:00AM +0300, Alexander Lakhin wrote:
> > 28.06.2024 08:13, Noah Misch wrote:
> > > Pushed. ...
> > 
> > Please look also at another anomaly, I've discovered.
> > 
> > An Assert added with d5f788b41 may be falsified with:
> > CREATE TABLE t(a int PRIMARY KEY);
> > INSERT INTO t VALUES (1);
> > CREATE VIEW v AS SELECT * FROM t;
> > 
> > MERGE INTO v USING (VALUES (1)) AS va(a) ON v.a = va.a
> >   WHEN MATCHED THEN DO NOTHING
> >   WHEN NOT MATCHED THEN DO NOTHING;
> > 
> > TRAP: failed Assert("resultRelInfo->ri_TrigDesc"), File: "nodeModifyTable.c", Line: 2891, PID: 1590670
> 
> Thanks.  When all the MERGE actions are DO NOTHING, view_has_instead_trigger()
> returns true

I've pushed the two patches for your reports.  To placate cfbot, I'm attaching
the remaining patches.

Attachment

Re: race condition in pg_class

From
Alexander Lakhin
Date:
Hello Noah,

28.06.2024 08:13, Noah Misch wrote:
> Pushed.

A recent buildfarm test failure [1] showed that the
intra-grant-inplace-db.spec test added with 0844b3968 may fail
on a slow machine (per my understanding):

test intra-grant-inplace-db       ... FAILED     4302 ms

@@ -21,8 +21,7 @@
      WHERE datname = current_catalog
          AND age(datfrozenxid) > (SELECT min(age(x)) FROM frozen_witness);

-?column?
-----------------------
-datfrozenxid retreated
-(1 row)
+?column?
+--------
+(0 rows)

whilst the previous (successful) run shows much shorter duration:
test intra-grant-inplace-db       ... ok          540 ms

I reproduced this failure on a VM slowed down so that the test duration
reached 4+ seconds, with 100 test: intra-grant-inplace-db in
isolation_schedule:
test intra-grant-inplace-db       ... ok         4324 ms
test intra-grant-inplace-db       ... FAILED     4633 ms
test intra-grant-inplace-db       ... ok         4649 ms

But as the test going to be modified by the inplace110-successors-v8.patch
and the modified test (with all three latest patches applied) passes
reliably in the same conditions, maybe this failure doesn't deserve a
deeper exploration.

What do you think?

[1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=habu&dt=2024-07-18%2003%3A08%3A08

Best regards,
Alexander



Re: race condition in pg_class

From
Noah Misch
Date:
On Sat, Jul 20, 2024 at 11:00:00AM +0300, Alexander Lakhin wrote:
> 28.06.2024 08:13, Noah Misch wrote:
> > Pushed.
> 
> A recent buildfarm test failure [1] showed that the
> intra-grant-inplace-db.spec test added with 0844b3968 may fail
> on a slow machine

> But as the test going to be modified by the inplace110-successors-v8.patch
> and the modified test (with all three latest patches applied) passes
> reliably in the same conditions, maybe this failure doesn't deserve a
> deeper exploration.

Agreed.  Let's just wait for code review of the actual bug fix, not develop a
separate change to stabilize the test.  One flake in three weeks is low enough
to make that okay.

> [1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=habu&dt=2024-07-18%2003%3A08%3A08



Re: race condition in pg_class

From
Tom Lane
Date:
Noah Misch <noah@leadboat.com> writes:
> On Sat, Jul 20, 2024 at 11:00:00AM +0300, Alexander Lakhin wrote:
>> A recent buildfarm test failure [1] showed that the
>> intra-grant-inplace-db.spec test added with 0844b3968 may fail
>> on a slow machine

>> But as the test going to be modified by the inplace110-successors-v8.patch
>> and the modified test (with all three latest patches applied) passes
>> reliably in the same conditions, maybe this failure doesn't deserve a
>> deeper exploration.

> Agreed.  Let's just wait for code review of the actual bug fix, not develop a
> separate change to stabilize the test.  One flake in three weeks is low enough
> to make that okay.

It's now up to three similar failures in the past ten days: in
addition to

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=habu&dt=2024-07-18%2003%3A08%3A08

I see

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=urutu&dt=2024-07-22%2018%3A00%3A46

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=taipan&dt=2024-07-28%2012%3A20%3A37

Is it time to worry yet?  If this were HEAD only, I'd not be too
concerned; but two of these three are on allegedly-stable branches.
And we have releases coming up fast.

(BTW, I don't think taipan qualifies as a slow machine.)

            regards, tom lane



Re: race condition in pg_class

From
Noah Misch
Date:
On Sun, Jul 28, 2024 at 11:50:33AM -0400, Tom Lane wrote:
> Noah Misch <noah@leadboat.com> writes:
> > On Sat, Jul 20, 2024 at 11:00:00AM +0300, Alexander Lakhin wrote:
> >> A recent buildfarm test failure [1] showed that the
> >> intra-grant-inplace-db.spec test added with 0844b3968 may fail
> 
> >> But as the test going to be modified by the inplace110-successors-v8.patch
> >> and the modified test (with all three latest patches applied) passes
> >> reliably in the same conditions, maybe this failure doesn't deserve a
> >> deeper exploration.
> 
> > Agreed.  Let's just wait for code review of the actual bug fix, not develop a
> > separate change to stabilize the test.  One flake in three weeks is low enough
> > to make that okay.
> 
> It's now up to three similar failures in the past ten days

> Is it time to worry yet?  If this were HEAD only, I'd not be too
> concerned; but two of these three are on allegedly-stable branches.
> And we have releases coming up fast.

I don't know; neither decision feels terrible to me.  A bug fix that would
address both the data corruption causes and those buildfarm failures has been
awaiting review on this thread for 77 days.  The data corruption causes are
more problematic than 0.03% of buildfarm runs getting noise failures.  Two
wrongs don't make a right, but a commit masking that level of buildfarm noise
also feels like sending the wrong message.



Re: race condition in pg_class

From
Tom Lane
Date:
Noah Misch <noah@leadboat.com> writes:
> On Sun, Jul 28, 2024 at 11:50:33AM -0400, Tom Lane wrote:
>> Is it time to worry yet?  If this were HEAD only, I'd not be too
>> concerned; but two of these three are on allegedly-stable branches.
>> And we have releases coming up fast.

> I don't know; neither decision feels terrible to me.

Yeah, same here.  Obviously, it'd be better to spend effort on getting
the bug fix committed than to spend effort on some cosmetic
workaround.

The fact that the failure is in the isolation tests not the core
regression tests reduces my level of concern somewhat about shipping
it this way.  I think that packagers typically run the core tests
not check-world during package verification, so they won't hit this.

            regards, tom lane



Re: race condition in pg_class

From
Heikki Linnakangas
Date:
On 14/07/2024 20:48, Noah Misch wrote:
> I've pushed the two patches for your reports.  To placate cfbot, I'm attaching
> the remaining patches.

inplace090-LOCKTAG_TUPLE-eoxact-v8.patch: Makes sense. A comment would 
be in order, it looks pretty random as it is. Something like:

/*
  * Tuple locks are currently only held for short durations within a
  * transaction. Check that we didn't forget to release one.
  */

inplace110-successors-v8.patch: Makes sense.

The README changes would be better as part of the third patch, as this 
patch doesn't actually do any of the new locking described in the 
README, and it fixes the "inplace update updates wrong tuple" bug even 
without those tuple locks.

> + * ... [any slow preparation not requiring oldtup] ...
> + * heap_inplace_update_scan([...], &tup, &inplace_state);
> + * if (!HeapTupleIsValid(tup))
> + *    elog(ERROR, [...]);
> + * ... [buffer is exclusive-locked; mutate "tup"] ...
> + * if (dirty)
> + *    heap_inplace_update_finish(inplace_state, tup);
> + * else
> + *    heap_inplace_update_cancel(inplace_state);

I wonder if the functions should be called "systable_*" and placed in 
genam.c rather than in heapam.c. The interface looks more like the 
existing systable functions. It feels like a modularity violation for a 
function in heapam.c to take an argument like "indexId", and call back 
into systable_* functions.

>     /*----------
>      * XXX A crash here can allow datfrozenxid() to get ahead of relfrozenxid:
>      *
>      * ["D" is a VACUUM (ONLY_DATABASE_STATS)]
>      * ["R" is a VACUUM tbl]
>      * D: vac_update_datfrozenid() -> systable_beginscan(pg_class)
>     * c * D: systable_getnext() returns pg_class tuple of tbl
>      * R: memcpy() into pg_class tuple of tbl
>      * D: raise pg_database.datfrozenxid, XLogInsert(), finish
>      * [crash]
>      * [recovery restores datfrozenxid w/o relfrozenxid]
>      */

Hmm, that's a tight race, but feels bad to leave it unfixed. One 
approach would be to modify the tuple on the buffer only after 
WAL-logging it. That way, D cannot read the updated value before it has 
been WAL logged. Just need to make sure that the change still gets 
included in the WAL record. Maybe something like:

if (RelationNeedsWAL(relation))
{
     /*
      * Make a temporary copy of the page that includes the change, in
      * case the a full-page image is logged
      */
     PGAlignedBlock tmppage;

     memcpy(tmppage.data, page, BLCKSZ);

     /* copy the tuple to the temporary copy */
     memcpy(...);

     XLogRegisterBlock(0, ..., tmppage, REGBUF_STANDARD);
     XLogInsert();
}

/* copy the tuple to the buffer */
memcpy(...);


>   pg_class heap_inplace_update_scan() callers: before the call, acquire
>   LOCKTAG_RELATION in mode ShareLock (CREATE INDEX), ShareUpdateExclusiveLock
>   (VACUUM), or a mode with strictly more conflicts.  If the update targets a
>   row of RELKIND_INDEX (but not RELKIND_PARTITIONED_INDEX), that lock must be
>   on the table.  Locking the index rel is optional.  (This allows VACUUM to
>   overwrite per-index pg_class while holding a lock on the table alone.)  We
>   could allow weaker locks, in which case the next paragraph would simply call
>   for stronger locks for its class of commands.  heap_inplace_update_scan()
>   acquires and releases LOCKTAG_TUPLE in InplaceUpdateTupleLock, an alias for
>   ExclusiveLock, on each tuple it overwrites.
> 
>   pg_class heap_update() callers: before copying the tuple to modify, take a
>   lock that conflicts with at least one of those from the preceding paragraph.
>   SearchSysCacheLocked1() is one convenient way to acquire LOCKTAG_TUPLE.
>   After heap_update(), release any LOCKTAG_TUPLE.  Most of these callers opt
>   to acquire just the LOCKTAG_RELATION.

These rules seem complicated. Phrasing this slightly differently, if I 
understand correctly: for a heap_update() caller, it's always sufficient 
to hold LOCKTAG_TUPLE, but if you happen to hold some other lock on the 
relation that conflicts with those mentioned in the first paragraph, 
then you can skip the LOCKTAG_TUPLE lock.

Could we just stipulate that you must always hold LOCKTAG_TUPLE when you 
call heap_update() on pg_class or pg_database? That'd make the rule simple.

-- 
Heikki Linnakangas
Neon (https://neon.tech)




Re: race condition in pg_class

From
Noah Misch
Date:
Thanks for reviewing.

On Fri, Aug 16, 2024 at 12:26:28PM +0300, Heikki Linnakangas wrote:
> On 14/07/2024 20:48, Noah Misch wrote:
> > I've pushed the two patches for your reports.  To placate cfbot, I'm attaching
> > the remaining patches.
> 
> inplace090-LOCKTAG_TUPLE-eoxact-v8.patch: Makes sense. A comment would be in
> order, it looks pretty random as it is. Something like:
> 
> /*
>  * Tuple locks are currently only held for short durations within a
>  * transaction. Check that we didn't forget to release one.
>  */

Will add.

> inplace110-successors-v8.patch: Makes sense.
> 
> The README changes would be better as part of the third patch, as this patch
> doesn't actually do any of the new locking described in the README, and it
> fixes the "inplace update updates wrong tuple" bug even without those tuple
> locks.

That should work.  Will confirm.

> > + * ... [any slow preparation not requiring oldtup] ...
> > + * heap_inplace_update_scan([...], &tup, &inplace_state);
> > + * if (!HeapTupleIsValid(tup))
> > + *    elog(ERROR, [...]);
> > + * ... [buffer is exclusive-locked; mutate "tup"] ...
> > + * if (dirty)
> > + *    heap_inplace_update_finish(inplace_state, tup);
> > + * else
> > + *    heap_inplace_update_cancel(inplace_state);
> 
> I wonder if the functions should be called "systable_*" and placed in
> genam.c rather than in heapam.c. The interface looks more like the existing
> systable functions. It feels like a modularity violation for a function in
> heapam.c to take an argument like "indexId", and call back into systable_*
> functions.

Yes, _scan() and _cancel() especially are wrappers around systable.  Some API
options follow.  Any preference or other ideas?

==== direct s/heap_/systable_/ rename

 systable_inplace_update_scan([...], &tup, &inplace_state);
 if (!HeapTupleIsValid(tup))
    elog(ERROR, [...]);
 ... [buffer is exclusive-locked; mutate "tup"] ...
 if (dirty)
    systable_inplace_update_finish(inplace_state, tup);
 else
    systable_inplace_update_cancel(inplace_state);

==== make the first and last steps more systable-like

 systable_inplace_update_begin([...], &tup, &inplace_state);
 if (!HeapTupleIsValid(tup))
    elog(ERROR, [...]);
 ... [buffer is exclusive-locked; mutate "tup"] ...
 if (dirty)
    systable_inplace_update(inplace_state, tup);
 systable_inplace_update_end(inplace_state);

==== no systable_ wrapper for middle step, more like CatalogTupleUpdate

 systable_inplace_update_begin([...], &tup, &inplace_state);
 if (!HeapTupleIsValid(tup))
    elog(ERROR, [...]);
 ... [buffer is exclusive-locked; mutate "tup"] ...
 if (dirty)
    heap_inplace_update(relation,
                        systable_inplace_old_tuple(inplace_state),
                        tup,
                        systable_inplace_buffer(inplace_state));
 systable_inplace_update_end(inplace_state);

> >     /*----------
> >      * XXX A crash here can allow datfrozenxid() to get ahead of relfrozenxid:
> >      *
> >      * ["D" is a VACUUM (ONLY_DATABASE_STATS)]
> >      * ["R" is a VACUUM tbl]
> >      * D: vac_update_datfrozenid() -> systable_beginscan(pg_class)
> >     * c * D: systable_getnext() returns pg_class tuple of tbl
> >      * R: memcpy() into pg_class tuple of tbl
> >      * D: raise pg_database.datfrozenxid, XLogInsert(), finish
> >      * [crash]
> >      * [recovery restores datfrozenxid w/o relfrozenxid]
> >      */
> 
> Hmm, that's a tight race, but feels bad to leave it unfixed. One approach
> would be to modify the tuple on the buffer only after WAL-logging it. That
> way, D cannot read the updated value before it has been WAL logged. Just
> need to make sure that the change still gets included in the WAL record.
> Maybe something like:
> 
> if (RelationNeedsWAL(relation))
> {
>     /*
>      * Make a temporary copy of the page that includes the change, in
>      * case the a full-page image is logged
>      */
>     PGAlignedBlock tmppage;
> 
>     memcpy(tmppage.data, page, BLCKSZ);
> 
>     /* copy the tuple to the temporary copy */
>     memcpy(...);
> 
>     XLogRegisterBlock(0, ..., tmppage, REGBUF_STANDARD);
>     XLogInsert();
> }
> 
> /* copy the tuple to the buffer */
> memcpy(...);

Yes, that's the essence of
inplace180-datfrozenxid-overtakes-relfrozenxid-v1.patch from
https://postgr.es/m/flat/20240620012908.92.nmisch@google.com.

> >   pg_class heap_inplace_update_scan() callers: before the call, acquire
> >   LOCKTAG_RELATION in mode ShareLock (CREATE INDEX), ShareUpdateExclusiveLock
> >   (VACUUM), or a mode with strictly more conflicts.  If the update targets a
> >   row of RELKIND_INDEX (but not RELKIND_PARTITIONED_INDEX), that lock must be
> >   on the table.  Locking the index rel is optional.  (This allows VACUUM to
> >   overwrite per-index pg_class while holding a lock on the table alone.)  We
> >   could allow weaker locks, in which case the next paragraph would simply call
> >   for stronger locks for its class of commands.  heap_inplace_update_scan()
> >   acquires and releases LOCKTAG_TUPLE in InplaceUpdateTupleLock, an alias for
> >   ExclusiveLock, on each tuple it overwrites.
> > 
> >   pg_class heap_update() callers: before copying the tuple to modify, take a
> >   lock that conflicts with at least one of those from the preceding paragraph.
> >   SearchSysCacheLocked1() is one convenient way to acquire LOCKTAG_TUPLE.
> >   After heap_update(), release any LOCKTAG_TUPLE.  Most of these callers opt
> >   to acquire just the LOCKTAG_RELATION.
> 
> These rules seem complicated. Phrasing this slightly differently, if I
> understand correctly: for a heap_update() caller, it's always sufficient to
> hold LOCKTAG_TUPLE, but if you happen to hold some other lock on the
> relation that conflicts with those mentioned in the first paragraph, then
> you can skip the LOCKTAG_TUPLE lock.

Yes.

> Could we just stipulate that you must always hold LOCKTAG_TUPLE when you
> call heap_update() on pg_class or pg_database? That'd make the rule simple.

We could.  That would change more code sites.  Rough estimate:

$ git grep -E CatalogTupleUpd'.*(class|relrelation|relationRelation)' | wc -l
23

If the count were 2, I'd say let's simplify the rule like you're exploring.
(I originally had a complicated rule for pg_database, but I abandoned that
when it helped few code sites.)  If it were 100, I'd say the complicated rule
is worth it.  A count of 23 makes both choices fair.

Long-term, I hope relfrozenxid gets reimplemented with storage outside
pg_class, removing the need for inplace updates.  So the additional 23 code
sites might change back at a future date.  That shouldn't be a big
consideration, though.

Another option here would be to preface that README section with a simplified
view, something like, "If a warning brought you here, take a tuple lock.  The
rest of this section is just for people needing to understand the conditions
for --enable-casserts emitting that warning."  How about that instead of
simplifying the rules?



Re: race condition in pg_class

From
Heikki Linnakangas
Date:
On 17/08/2024 07:07, Noah Misch wrote:
> On Fri, Aug 16, 2024 at 12:26:28PM +0300, Heikki Linnakangas wrote:
>> On 14/07/2024 20:48, Noah Misch wrote:
>>> + * ... [any slow preparation not requiring oldtup] ...
>>> + * heap_inplace_update_scan([...], &tup, &inplace_state);
>>> + * if (!HeapTupleIsValid(tup))
>>> + *    elog(ERROR, [...]);
>>> + * ... [buffer is exclusive-locked; mutate "tup"] ...
>>> + * if (dirty)
>>> + *    heap_inplace_update_finish(inplace_state, tup);
>>> + * else
>>> + *    heap_inplace_update_cancel(inplace_state);
>>
>> I wonder if the functions should be called "systable_*" and placed in
>> genam.c rather than in heapam.c. The interface looks more like the existing
>> systable functions. It feels like a modularity violation for a function in
>> heapam.c to take an argument like "indexId", and call back into systable_*
>> functions.
> 
> Yes, _scan() and _cancel() especially are wrappers around systable.  Some API
> options follow.  Any preference or other ideas?
> 
> ==== direct s/heap_/systable_/ rename [option 1]
> 
>   systable_inplace_update_scan([...], &tup, &inplace_state);
>   if (!HeapTupleIsValid(tup))
>     elog(ERROR, [...]);
>   ... [buffer is exclusive-locked; mutate "tup"] ...
>   if (dirty)
>     systable_inplace_update_finish(inplace_state, tup);
>   else
>     systable_inplace_update_cancel(inplace_state);
> 
> ==== make the first and last steps more systable-like [option 2]
> 
>   systable_inplace_update_begin([...], &tup, &inplace_state);
>   if (!HeapTupleIsValid(tup))
>     elog(ERROR, [...]);
>   ... [buffer is exclusive-locked; mutate "tup"] ...
>   if (dirty)
>     systable_inplace_update(inplace_state, tup);
>   systable_inplace_update_end(inplace_state);
> 
> ==== no systable_ wrapper for middle step, more like CatalogTupleUpdate [option 3]
> 
>   systable_inplace_update_begin([...], &tup, &inplace_state);
>   if (!HeapTupleIsValid(tup))
>     elog(ERROR, [...]);
>   ... [buffer is exclusive-locked; mutate "tup"] ...
>   if (dirty)
>     heap_inplace_update(relation,
>                         systable_inplace_old_tuple(inplace_state),
>                         tup,
>                         systable_inplace_buffer(inplace_state));
>   systable_inplace_update_end(inplace_state);

My order of preference is: 2, 1, 3.

>> Could we just stipulate that you must always hold LOCKTAG_TUPLE when you
>> call heap_update() on pg_class or pg_database? That'd make the rule simple.
> 
> We could.  That would change more code sites.  Rough estimate:
> 
> $ git grep -E CatalogTupleUpd'.*(class|relrelation|relationRelation)' | wc -l
> 23
> 
> If the count were 2, I'd say let's simplify the rule like you're exploring.
> (I originally had a complicated rule for pg_database, but I abandoned that
> when it helped few code sites.)  If it were 100, I'd say the complicated rule
> is worth it.  A count of 23 makes both choices fair.

Ok.

How many of those for RELKIND_INDEX vs tables? I'm thinking if we should 
always require a tuple lock on indexes, if that would make a difference.

> Long-term, I hope relfrozenxid gets reimplemented with storage outside
> pg_class, removing the need for inplace updates.  So the additional 23 code
> sites might change back at a future date.  That shouldn't be a big
> consideration, though.
> 
> Another option here would be to preface that README section with a simplified
> view, something like, "If a warning brought you here, take a tuple lock.  The
> rest of this section is just for people needing to understand the conditions
> for --enable-casserts emitting that warning."  How about that instead of
> simplifying the rules?

Works for me. Or perhaps the rules could just be explained more 
succinctly. Something like:

-----
pg_class heap_inplace_update_scan() callers: before the call, acquire a 
lock on the relation in mode ShareUpdateExclusiveLock or stricter. If 
the update targets a row of RELKIND_INDEX (but not 
RELKIND_PARTITIONED_INDEX), that lock must be on the table, locking the 
index rel is not necessary.  (This allows VACUUM to overwrite per-index 
pg_class while holding a lock on the table alone.) 
heap_inplace_update_scan() acquires and releases LOCKTAG_TUPLE in 
InplaceUpdateTupleLock, an alias for ExclusiveLock, on each tuple it 
overwrites.

pg_class heap_update() callers: before copying the tuple to modify, take 
a lock on the tuple, or a ShareUpdateExclusiveLock or stricter on the 
relation.

SearchSysCacheLocked1() is one convenient way to acquire the tuple lock. 
Most heap_update() callers already hold a suitable lock on the relation 
for other reasons, and can skip the tuple lock. If you do acquire the 
tuple lock, release it immediately after the update.


pg_database: before copying the tuple to modify, all updaters of 
pg_database rows acquire LOCKTAG_TUPLE.  (Few updaters acquire 
LOCKTAG_OBJECT on the database OID, so it wasn't worth extending that as 
a second option.)
-----

-- 
Heikki Linnakangas
Neon (https://neon.tech)




Re: race condition in pg_class

From
Nitin Motiani
Date:
On Thu, Aug 29, 2024 at 8:11 PM Noah Misch <noah@leadboat.com> wrote:
>
> On Tue, Aug 20, 2024 at 11:59:45AM +0300, Heikki Linnakangas wrote:
> > My order of preference is: 2, 1, 3.
>
> I kept tuple locking responsibility in heapam.c.  That's simpler and better
> for modularity, but it does mean we release+acquire after any xmax wait.
> Before, we avoided that if the next genam.c scan found the same TID.  (If the
> next scan finds the same TID, the xmax probably aborted.)  I think DDL aborts
> are rare enough to justify simplifying as this version does.  I don't expect
> anyone to notice the starvation outside of tests built to show it.  (With
> previous versions, one can show it with a purpose-built test that commits
> instead of aborting, like the "001_pgbench_grant@9" test.)
>
> This move also loses the optimization of unpinning before XactLockTableWait().
> heap_update() doesn't optimize that way, so that's fine.
>
> The move ended up more like (1), though I did do
> s/systable_inplace_update_scan/systable_inplace_update_begin/ like in (2).  I
> felt that worked better than (2) to achieve lock release before
> CacheInvalidateHeapTuple().  Alternatives that could be fine:
>
From a consistency point of view, I find it cleaner if we can have all
the heap_inplace_lock and heap_inplace_unlock in the same set of
functions. So here those would be the systable_inplace_* functions.

> - In the cancel case, call both systable_inplace_update_cancel and
>   systable_inplace_update_end.  _finish or _cancel would own unlock, while
>   _end would own systable_endscan().
>
What happens to CacheInvalidateHeapTuple() in this approach?  I think
it will still need to be brought to the genam.c layer if we are
releasing the lock in systable_inplace_update_finish.

> - Hoist the CacheInvalidateHeapTuple() up to the genam.c layer.  While
>   tolerable now, this gets less attractive after the inplace160 patch from
>   https://postgr.es/m/flat/20240523000548.58.nmisch@google.com
>
I skimmed through the inplace160 patch. It wasn't clear to me why this
becomes less attractive with the patch. I see there is a new
CacheInvalidateHeapTupleInPlace but that looks like it would be called
while holding the lock. And then there is an
AcceptInvalidationMessages which can perhaps be moved to the genam.c
layer too. Is the concern that one invalidation call will be in the
heapam layer and the other will be in the genam layer?

Also I have a small question from inplace120.

I see that all the places we check resultRelInfo->ri_needLockTagTuple,
we can just call
IsInplaceUpdateRelation(resultRelInfo->ri_RelationDesc). Is there a
big advantage of storing a separate bool field? Also there is another
write to ri_RelationDesc in CatalogOpenIndexes in
src/backlog/catalog/indexing.c. I think ri_needLockTagTuple needs to
be set there also to keep it consistent with ri_RelationDesc. Please
let me know if I am missing something about the usage of the new
field.

Thanks & Regards,
Nitin Motiani
Google



Re: race condition in pg_class

From
Noah Misch
Date:
On Thu, Aug 29, 2024 at 09:08:43PM +0530, Nitin Motiani wrote:
> On Thu, Aug 29, 2024 at 8:11 PM Noah Misch <noah@leadboat.com> wrote:
> > On Tue, Aug 20, 2024 at 11:59:45AM +0300, Heikki Linnakangas wrote:
> > > My order of preference is: 2, 1, 3.
> >
> > I kept tuple locking responsibility in heapam.c.  That's simpler and better
> > for modularity, but it does mean we release+acquire after any xmax wait.
> > Before, we avoided that if the next genam.c scan found the same TID.  (If the
> > next scan finds the same TID, the xmax probably aborted.)  I think DDL aborts
> > are rare enough to justify simplifying as this version does.  I don't expect
> > anyone to notice the starvation outside of tests built to show it.  (With
> > previous versions, one can show it with a purpose-built test that commits
> > instead of aborting, like the "001_pgbench_grant@9" test.)
> >
> > This move also loses the optimization of unpinning before XactLockTableWait().
> > heap_update() doesn't optimize that way, so that's fine.
> >
> > The move ended up more like (1), though I did do
> > s/systable_inplace_update_scan/systable_inplace_update_begin/ like in (2).  I
> > felt that worked better than (2) to achieve lock release before
> > CacheInvalidateHeapTuple().  Alternatives that could be fine:
> >
> From a consistency point of view, I find it cleaner if we can have all
> the heap_inplace_lock and heap_inplace_unlock in the same set of
> functions. So here those would be the systable_inplace_* functions.

That will technically be the case after inplace160, and I could make it so
here by inlining heap_inplace_unlock() into its heapam.c caller.  Would that
be cleaner or less clean?

> > - In the cancel case, call both systable_inplace_update_cancel and
> >   systable_inplace_update_end.  _finish or _cancel would own unlock, while
> >   _end would own systable_endscan().
> >
> What happens to CacheInvalidateHeapTuple() in this approach?  I think
> it will still need to be brought to the genam.c layer if we are
> releasing the lock in systable_inplace_update_finish.

Cancel scenarios don't do invalidation.  (Same under other alternatives.)

> > - Hoist the CacheInvalidateHeapTuple() up to the genam.c layer.  While
> >   tolerable now, this gets less attractive after the inplace160 patch from
> >   https://postgr.es/m/flat/20240523000548.58.nmisch@google.com
> >
> I skimmed through the inplace160 patch. It wasn't clear to me why this
> becomes less attractive with the patch. I see there is a new
> CacheInvalidateHeapTupleInPlace but that looks like it would be called
> while holding the lock. And then there is an
> AcceptInvalidationMessages which can perhaps be moved to the genam.c
> layer too. Is the concern that one invalidation call will be in the
> heapam layer and the other will be in the genam layer?

That, or a critical section would start in heapam.c, then end in genam.c.
Current call tree at inplace160 v4:

genam.c:systable_inplace_update_finish
 heapam.c:heap_inplace_update
  PreInplace_Inval
  START_CRIT_SECTION
  BUFFER_LOCK_UNLOCK
  AtInplace_Inval
  END_CRIT_SECTION
  UnlockTuple
  AcceptInvalidationMessages

If we hoisted all of invalidation up to the genam.c layer, a critical section
that starts in heapam.c would end in genam.c:

genam.c:systable_inplace_update_finish
 PreInplace_Inval
 heapam.c:heap_inplace_update
  START_CRIT_SECTION
 BUFFER_LOCK_UNLOCK
 AtInplace_Inval
 END_CRIT_SECTION
 UnlockTuple
 AcceptInvalidationMessages

If we didn't accept splitting the critical section but did accept splitting
invalidation responsibilities, one gets perhaps:

genam.c:systable_inplace_update_finish
 PreInplace_Inval
 heapam.c:heap_inplace_update
  START_CRIT_SECTION
  BUFFER_LOCK_UNLOCK
  AtInplace_Inval
  END_CRIT_SECTION
  UnlockTuple
 AcceptInvalidationMessages

That's how I ended up at inplace120 v9's design.

> Also I have a small question from inplace120.
> 
> I see that all the places we check resultRelInfo->ri_needLockTagTuple,
> we can just call
> IsInplaceUpdateRelation(resultRelInfo->ri_RelationDesc). Is there a
> big advantage of storing a separate bool field? Also there is another

No, not a big advantage.  I felt it was more in line with the typical style of
src/backend/executor.

> write to ri_RelationDesc in CatalogOpenIndexes in
> src/backlog/catalog/indexing.c. I think ri_needLockTagTuple needs to
> be set there also to keep it consistent with ri_RelationDesc. Please
> let me know if I am missing something about the usage of the new
> field.

Can you say more about consequences you found?

Only the full executor reads the field, doing so when it fetches the most
recent version of a row.  CatalogOpenIndexes() callers lack the full
executor's practice of fetching the most recent version of a row, so they
couldn't benefit reading the field.

I don't think any CatalogOpenIndexes() caller passes its ResultRelInfo to the
full executor, and "typedef struct ResultRelInfo *CatalogIndexState" exists in
part to keep it that way.  Since CatalogOpenIndexes() skips ri_TrigDesc and
other fields, I would expect other malfunctions if some caller tried.

Thanks,
nm



Re: race condition in pg_class

From
Nitin Motiani
Date:
On Sat, Aug 31, 2024 at 6:40 AM Noah Misch <noah@leadboat.com> wrote:
>
> On Thu, Aug 29, 2024 at 09:08:43PM +0530, Nitin Motiani wrote:
> > On Thu, Aug 29, 2024 at 8:11 PM Noah Misch <noah@leadboat.com> wrote:
> > > On Tue, Aug 20, 2024 at 11:59:45AM +0300, Heikki Linnakangas wrote:
> > > > My order of preference is: 2, 1, 3.
> > >
> > > I kept tuple locking responsibility in heapam.c.  That's simpler and better
> > > for modularity, but it does mean we release+acquire after any xmax wait.
> > > Before, we avoided that if the next genam.c scan found the same TID.  (If the
> > > next scan finds the same TID, the xmax probably aborted.)  I think DDL aborts
> > > are rare enough to justify simplifying as this version does.  I don't expect
> > > anyone to notice the starvation outside of tests built to show it.  (With
> > > previous versions, one can show it with a purpose-built test that commits
> > > instead of aborting, like the "001_pgbench_grant@9" test.)
> > >
> > > This move also loses the optimization of unpinning before XactLockTableWait().
> > > heap_update() doesn't optimize that way, so that's fine.
> > >
> > > The move ended up more like (1), though I did do
> > > s/systable_inplace_update_scan/systable_inplace_update_begin/ like in (2).  I
> > > felt that worked better than (2) to achieve lock release before
> > > CacheInvalidateHeapTuple().  Alternatives that could be fine:
> > >
> > From a consistency point of view, I find it cleaner if we can have all
> > the heap_inplace_lock and heap_inplace_unlock in the same set of
> > functions. So here those would be the systable_inplace_* functions.
>
> That will technically be the case after inplace160, and I could make it so
> here by inlining heap_inplace_unlock() into its heapam.c caller.  Would that
> be cleaner or less clean?
>

I am not sure. It seems more inconsistent to take the lock using
heap_inplace_lock but then just unlock by calling LockBuffer. On the
other hand, it doesn't seem that different from the way
SearchSysCacheLocked1 and UnlockTuple are used in inplace120. If we
are doing it this way, perhaps it would be good to rename
heap_inplace_update to heap_inplace_update_and_unlock.

> > > - In the cancel case, call both systable_inplace_update_cancel and
> > >   systable_inplace_update_end.  _finish or _cancel would own unlock, while
> > >   _end would own systable_endscan().
> > >
> > What happens to CacheInvalidateHeapTuple() in this approach?  I think
> > it will still need to be brought to the genam.c layer if we are
> > releasing the lock in systable_inplace_update_finish.
>
> Cancel scenarios don't do invalidation.  (Same under other alternatives.)
>

Sorry, I wasn't clear about this one. Let me rephrase. My
understanding is that the code in this approach would look like below
:
if (dirty)
   systable_inplace_update_finish(inplace_state, tup);
else
   systable_inplace_update_cancel(inplace_state);
systable_inplace_update_end(inplace_state);

And that in this structure, both _finish and _cancel will call
heap_inplace_unlock and then _end will call systable_endscan. So even
with this structure, the invalidation has to happen inside _finish
after the unlock. So this also pulls the invalidation to the genam.c
layer. Am I understanding this correctly?

> > > - Hoist the CacheInvalidateHeapTuple() up to the genam.c layer.  While
> > >   tolerable now, this gets less attractive after the inplace160 patch from
> > >   https://postgr.es/m/flat/20240523000548.58.nmisch@google.com
> > >
> > I skimmed through the inplace160 patch. It wasn't clear to me why this
> > becomes less attractive with the patch. I see there is a new
> > CacheInvalidateHeapTupleInPlace but that looks like it would be called
> > while holding the lock. And then there is an
> > AcceptInvalidationMessages which can perhaps be moved to the genam.c
> > layer too. Is the concern that one invalidation call will be in the
> > heapam layer and the other will be in the genam layer?
>
> That, or a critical section would start in heapam.c, then end in genam.c.
> Current call tree at inplace160 v4:
>
> genam.c:systable_inplace_update_finish
>  heapam.c:heap_inplace_update
>   PreInplace_Inval
>   START_CRIT_SECTION
>   BUFFER_LOCK_UNLOCK
>   AtInplace_Inval
>   END_CRIT_SECTION
>   UnlockTuple
>   AcceptInvalidationMessages
>
> If we hoisted all of invalidation up to the genam.c layer, a critical section
> that starts in heapam.c would end in genam.c:
>
> genam.c:systable_inplace_update_finish
>  PreInplace_Inval
>  heapam.c:heap_inplace_update
>   START_CRIT_SECTION
>  BUFFER_LOCK_UNLOCK
>  AtInplace_Inval
>  END_CRIT_SECTION
>  UnlockTuple
>  AcceptInvalidationMessages
>
> If we didn't accept splitting the critical section but did accept splitting
> invalidation responsibilities, one gets perhaps:
>
> genam.c:systable_inplace_update_finish
>  PreInplace_Inval
>  heapam.c:heap_inplace_update
>   START_CRIT_SECTION
>   BUFFER_LOCK_UNLOCK
>   AtInplace_Inval
>   END_CRIT_SECTION
>   UnlockTuple
>  AcceptInvalidationMessages
>

How about this alternative?

 genam.c:systable_inplace_update_finish
   PreInplace_Inval
   START_CRIT_SECTION
   heapam.c:heap_inplace_update
   BUFFER_LOCK_UNLOCK
   AtInplace_Inval
   END_CRIT_SECTION
   UnlockTuple
   AcceptInvalidationMessages

Looking at inplace160, it seems that the start of the critical section
is right after PreInplace_Inval. So why not pull START_CRIT_SECTION
and END_CRIT_SECTION out to the genam.c layer? Alternatively since
heap_inplace_update is commented as a subroutine of
systable_inplace_update_finish, should everything just be moved to the
genam.c layer? Although it looks like you already considered and
rejected this approach. So just pulling out the critical sections
start and end is fine. Am I missing something here?

If the above alternatives are not possible, it's probably fine to go
ahead with the current patch with the function renamed to
heap_inplace_update_and_unlock (or something similar) as mentioned
earlier?

> That's how I ended up at inplace120 v9's design.
>
> > Also I have a small question from inplace120.
> >
> > I see that all the places we check resultRelInfo->ri_needLockTagTuple,
> > we can just call
> > IsInplaceUpdateRelation(resultRelInfo->ri_RelationDesc). Is there a
> > big advantage of storing a separate bool field? Also there is another
>
> No, not a big advantage.  I felt it was more in line with the typical style of
> src/backend/executor.
>

Thanks for the clarification. For ri_TrigDesc, I see the following
comment in execMain.c :

/* make a copy so as not to depend on relcache info not changing... */
resultRelInfo->ri_TrigDesc = CopyTriggerDesc(resultRelationDesc->trigdesc);

So in this case I see more value in having a separate field compared
to the bool field for ri_needLockTagTuple.

> > write to ri_RelationDesc in CatalogOpenIndexes in
> > src/backlog/catalog/indexing.c. I think ri_needLockTagTuple needs to
> > be set there also to keep it consistent with ri_RelationDesc. Please
> > let me know if I am missing something about the usage of the new
> > field.
>
> Can you say more about consequences you found?
>

My apologies that I wasn't clear. I haven't found any consequences. I
just find it a smell that there are two fields which are not
independent and they go out of sync. And that's why my preference is
to not have a dependent field unless there is a specific advantage.

> Only the full executor reads the field, doing so when it fetches the most
> recent version of a row.  CatalogOpenIndexes() callers lack the full
> executor's practice of fetching the most recent version of a row, so they
> couldn't benefit reading the field.
>
> I don't think any CatalogOpenIndexes() caller passes its ResultRelInfo to the
> full executor, and "typedef struct ResultRelInfo *CatalogIndexState" exists in
> part to keep it that way.  Since CatalogOpenIndexes() skips ri_TrigDesc and
> other fields, I would expect other malfunctions if some caller tried.
>

Sorry, I missed the typedef. Thanks for pointing that out. I agree
that the likelihood of any malfunction is low. But even for the
ri_TrigDesc, CatalogOpenIndexes() still sets it to NULL. So shouldn't
ri_needLockTagTuple also be set to a default value of false? My
preference would be not to have a separate bool field to avoid
thinking about these scenarios.

Thanks & Regards
Nitin Motiani
Google



Re: race condition in pg_class

From
Noah Misch
Date:
On Tue, Sep 03, 2024 at 09:24:52PM +0530, Nitin Motiani wrote:
> On Sat, Aug 31, 2024 at 6:40 AM Noah Misch <noah@leadboat.com> wrote:
> > On Thu, Aug 29, 2024 at 09:08:43PM +0530, Nitin Motiani wrote:
> > > On Thu, Aug 29, 2024 at 8:11 PM Noah Misch <noah@leadboat.com> wrote:
> > > > - In the cancel case, call both systable_inplace_update_cancel and
> > > >   systable_inplace_update_end.  _finish or _cancel would own unlock, while
> > > >   _end would own systable_endscan().
> > > >
> > > What happens to CacheInvalidateHeapTuple() in this approach?  I think
> > > it will still need to be brought to the genam.c layer if we are
> > > releasing the lock in systable_inplace_update_finish.

> understanding is that the code in this approach would look like below
> :
> if (dirty)
>    systable_inplace_update_finish(inplace_state, tup);
> else
>    systable_inplace_update_cancel(inplace_state);
> systable_inplace_update_end(inplace_state);
> 
> And that in this structure, both _finish and _cancel will call
> heap_inplace_unlock and then _end will call systable_endscan. So even
> with this structure, the invalidation has to happen inside _finish
> after the unlock.

Right.

> So this also pulls the invalidation to the genam.c
> layer. Am I understanding this correctly?

Compared to the v9 patch, the "call both" alternative would just move the
systable_endscan() call to a new systable_inplace_update_end().  It wouldn't
move anything across the genam:heapam boundary.
systable_inplace_update_finish() would remain a thin wrapper around a heapam
function.

> > > > - Hoist the CacheInvalidateHeapTuple() up to the genam.c layer.  While
> > > >   tolerable now, this gets less attractive after the inplace160 patch from
> > > >   https://postgr.es/m/flat/20240523000548.58.nmisch@google.com
> > > >
> > > I skimmed through the inplace160 patch. It wasn't clear to me why this
> > > becomes less attractive with the patch. I see there is a new
> > > CacheInvalidateHeapTupleInPlace but that looks like it would be called
> > > while holding the lock. And then there is an
> > > AcceptInvalidationMessages which can perhaps be moved to the genam.c
> > > layer too. Is the concern that one invalidation call will be in the
> > > heapam layer and the other will be in the genam layer?
> >
> > That, or a critical section would start in heapam.c, then end in genam.c.
> > Current call tree at inplace160 v4:

> How about this alternative?
> 
>  genam.c:systable_inplace_update_finish
>    PreInplace_Inval
>    START_CRIT_SECTION
>    heapam.c:heap_inplace_update
>    BUFFER_LOCK_UNLOCK
>    AtInplace_Inval
>    END_CRIT_SECTION
>    UnlockTuple
>    AcceptInvalidationMessages

> Looking at inplace160, it seems that the start of the critical section
> is right after PreInplace_Inval. So why not pull START_CRIT_SECTION
> and END_CRIT_SECTION out to the genam.c layer?

heap_inplace_update() has an elog(ERROR) that needs to happen outside any
critical section.  Since the condition for that elog deals with tuple header
internals, it belongs at the heapam layer more than the systable layer.

> Alternatively since
> heap_inplace_update is commented as a subroutine of
> systable_inplace_update_finish, should everything just be moved to the
> genam.c layer? Although it looks like you already considered and
> rejected this approach.

Calling XLogInsert(RM_HEAP_ID) in genam.c would be a worse modularity
violation than the one that led to the changes between v8 and v9.  I think
even calling CacheInvalidateHeapTuple() in genam.c would be a worse modularity
violation than the one attributed to v8.  Modularity would have the
heap_inplace function resemble heap_update() handling of invals.

> If the above alternatives are not possible, it's probably fine to go
> ahead with the current patch with the function renamed to
> heap_inplace_update_and_unlock (or something similar) as mentioned
> earlier?

I like that name.  The next version will use it.

> > > I see that all the places we check resultRelInfo->ri_needLockTagTuple,
> > > we can just call
> > > IsInplaceUpdateRelation(resultRelInfo->ri_RelationDesc). Is there a
> > > big advantage of storing a separate bool field? Also there is another
> >
> > No, not a big advantage.  I felt it was more in line with the typical style of
> > src/backend/executor.

> just find it a smell that there are two fields which are not
> independent and they go out of sync. And that's why my preference is
> to not have a dependent field unless there is a specific advantage.

Got it.  This check happens for every tuple of every UPDATE, so performance
may be a factor.  Some designs and their merits:

==== a. ri_needLockTagTuple
Performance: best: check one value for nonzero
Drawback: one more value lifecycle to understand
Drawback: users of ResultRelInfo w/o InitResultRelInfo() could miss this

==== b. call IsInplaceUpdateRelation
Performance: worst: two extern function calls, then compare against two values

==== c. make IsInplaceUpdateRelation() and IsInplaceUpdateOid() inline, and call
Performance: high: compare against two values
Drawback: unlike catalog.c peers
Drawback: extensions that call these must recompile if these change

==== d. add IsInplaceUpdateRelationInline() and IsInplaceUpdateOidInline(), and call
Performance: high: compare against two values
Drawback: more symbols to understand
Drawback: extensions might call these, reaching the drawback of (c)

I think my preference order is (a), (c), (d), (b).  How do you see it?

> But even for the
> ri_TrigDesc, CatalogOpenIndexes() still sets it to NULL. So shouldn't
> ri_needLockTagTuple also be set to a default value of false?

CatalogOpenIndexes() explicitly zero-initializes two fields and relies on
makeNode() zeroing for dozens of others.  Hence, omitting the initialization
fits the function's local convention better than including it.  (PostgreSQL
has no policy or dominant practice about redundant zero-initialization.)



Re: race condition in pg_class

From
Nitin Motiani
Date:
On Wed, Sep 4, 2024 at 2:53 AM Noah Misch <noah@leadboat.com> wrote:
>
>
> > So this also pulls the invalidation to the genam.c
> > layer. Am I understanding this correctly?
>
> Compared to the v9 patch, the "call both" alternative would just move the
> systable_endscan() call to a new systable_inplace_update_end().  It wouldn't
> move anything across the genam:heapam boundary.
> systable_inplace_update_finish() would remain a thin wrapper around a heapam
> function.
>

Thanks for the clarification.

> > > > > - Hoist the CacheInvalidateHeapTuple() up to the genam.c layer.  While
> > > > >   tolerable now, this gets less attractive after the inplace160 patch from
> > > > >   https://postgr.es/m/flat/20240523000548.58.nmisch@google.com
> > > > >
> > > > I skimmed through the inplace160 patch. It wasn't clear to me why this
> > > > becomes less attractive with the patch. I see there is a new
> > > > CacheInvalidateHeapTupleInPlace but that looks like it would be called
> > > > while holding the lock. And then there is an
> > > > AcceptInvalidationMessages which can perhaps be moved to the genam.c
> > > > layer too. Is the concern that one invalidation call will be in the
> > > > heapam layer and the other will be in the genam layer?
> > >
> > > That, or a critical section would start in heapam.c, then end in genam.c.
> > > Current call tree at inplace160 v4:
>
> > How about this alternative?
> >
> >  genam.c:systable_inplace_update_finish
> >    PreInplace_Inval
> >    START_CRIT_SECTION
> >    heapam.c:heap_inplace_update
> >    BUFFER_LOCK_UNLOCK
> >    AtInplace_Inval
> >    END_CRIT_SECTION
> >    UnlockTuple
> >    AcceptInvalidationMessages
>
> > Looking at inplace160, it seems that the start of the critical section
> > is right after PreInplace_Inval. So why not pull START_CRIT_SECTION
> > and END_CRIT_SECTION out to the genam.c layer?
>
> heap_inplace_update() has an elog(ERROR) that needs to happen outside any
> critical section.  Since the condition for that elog deals with tuple header
> internals, it belongs at the heapam layer more than the systable layer.
>

Understood. How about this alternative then? The tuple length check
and the elog(ERROR) gets its own function. Something like
heap_inplace_update_validate or
heap_inplace_update_validate_tuple_length. So in that case, it would
look like this :

  genam.c:systable_inplace_update_finish
    heapam.c:heap_inplace_update_validate/heap_inplace_update_precheck
    PreInplace_Inval
    START_CRIT_SECTION
    heapam.c:heap_inplace_update
    BUFFER_LOCK_UNLOCK
    AtInplace_Inval
    END_CRIT_SECTION
    UnlockTuple
    AcceptInvalidationMessages

This is starting to get complicated though so I don't have any issues
with just renaming the heap_inplace_update to
heap_inplace_update_and_unlock.

> > Alternatively since
> > heap_inplace_update is commented as a subroutine of
> > systable_inplace_update_finish, should everything just be moved to the
> > genam.c layer? Although it looks like you already considered and
> > rejected this approach.
>
> Calling XLogInsert(RM_HEAP_ID) in genam.c would be a worse modularity
> violation than the one that led to the changes between v8 and v9.  I think
> even calling CacheInvalidateHeapTuple() in genam.c would be a worse modularity
> violation than the one attributed to v8.  Modularity would have the
> heap_inplace function resemble heap_update() handling of invals.
>

Understood. Thanks.

> > If the above alternatives are not possible, it's probably fine to go
> > ahead with the current patch with the function renamed to
> > heap_inplace_update_and_unlock (or something similar) as mentioned
> > earlier?
>
> I like that name.  The next version will use it.
>

So either we go with this or try the above approach of having a
separate function _validate/_precheck/_validate_tuple_length. I don't
have a strong opinion on either of these approaches.

> > > > I see that all the places we check resultRelInfo->ri_needLockTagTuple,
> > > > we can just call
> > > > IsInplaceUpdateRelation(resultRelInfo->ri_RelationDesc). Is there a
> > > > big advantage of storing a separate bool field? Also there is another
> > >
> > > No, not a big advantage.  I felt it was more in line with the typical style of
> > > src/backend/executor.
>
> > just find it a smell that there are two fields which are not
> > independent and they go out of sync. And that's why my preference is
> > to not have a dependent field unless there is a specific advantage.
>
> Got it.  This check happens for every tuple of every UPDATE, so performance
> may be a factor.  Some designs and their merits:
>

Thanks. If performance is a factor, it makes sense to keep it.

> ==== a. ri_needLockTagTuple
> Performance: best: check one value for nonzero
> Drawback: one more value lifecycle to understand
> Drawback: users of ResultRelInfo w/o InitResultRelInfo() could miss this
>
> ==== b. call IsInplaceUpdateRelation
> Performance: worst: two extern function calls, then compare against two values
>
> ==== c. make IsInplaceUpdateRelation() and IsInplaceUpdateOid() inline, and call
> Performance: high: compare against two values
> Drawback: unlike catalog.c peers
> Drawback: extensions that call these must recompile if these change
>
> ==== d. add IsInplaceUpdateRelationInline() and IsInplaceUpdateOidInline(), and call
> Performance: high: compare against two values
> Drawback: more symbols to understand
> Drawback: extensions might call these, reaching the drawback of (c)
>
> I think my preference order is (a), (c), (d), (b).  How do you see it?
>

My preference order would be the same. In general I like (c) more than
(a) but recompiling extensions sounds like a major drawback so here
the preference is (a).

Can we do (a) along with some extra checks? To elaborate, execMain.c
has a function called CheckValidateRelationRel which is called by
ExecInitModifyTable in nodeModifyTable.c. Can we add the following
assert assert (or just a debug assert) in this function?

 Assert(rel->ri_needsLockTagTuple ==  IsInplaceUpdateRelation(rel->relationDesc)

This can safeguard against users of ResultRelInfo missing this field.
An alternative might be to only do debug assertions in the functions
which use the field. But it seems simpler to just do it once in the
ExecInitModifyTable.

> > But even for the
> > ri_TrigDesc, CatalogOpenIndexes() still sets it to NULL. So shouldn't
> > ri_needLockTagTuple also be set to a default value of false?
>
> CatalogOpenIndexes() explicitly zero-initializes two fields and relies on
> makeNode() zeroing for dozens of others.  Hence, omitting the initialization
> fits the function's local convention better than including it.  (PostgreSQL
> has no policy or dominant practice about redundant zero-initialization.)

Thanks. Makes sense.

Thanks & Regards
Nitin Motiani
Google



Re: race condition in pg_class

From
Nitin Motiani
Date:
On Thu, Sep 5, 2024 at 1:27 AM Noah Misch <noah@leadboat.com> wrote:
>
> On Wed, Sep 04, 2024 at 09:00:32PM +0530, Nitin Motiani wrote:
> > How about this alternative then? The tuple length check
> > and the elog(ERROR) gets its own function. Something like
> > heap_inplace_update_validate or
> > heap_inplace_update_validate_tuple_length. So in that case, it would
> > look like this :
> >
> >   genam.c:systable_inplace_update_finish
> >     heapam.c:heap_inplace_update_validate/heap_inplace_update_precheck
> >     PreInplace_Inval
> >     START_CRIT_SECTION
> >     heapam.c:heap_inplace_update
> >     BUFFER_LOCK_UNLOCK
> >     AtInplace_Inval
> >     END_CRIT_SECTION
> >     UnlockTuple
> >     AcceptInvalidationMessages
> >
> > This is starting to get complicated though so I don't have any issues
> > with just renaming the heap_inplace_update to
> > heap_inplace_update_and_unlock.
>
> Complexity aside, I don't see the _precheck design qualifying as a modularity
> improvement.
>
> >  Assert(rel->ri_needsLockTagTuple ==  IsInplaceUpdateRelation(rel->relationDesc)
> >
> > This can safeguard against users of ResultRelInfo missing this field.
>
> v10 does the rename and adds that assertion.  This question remains open:
>

Looks good. A couple of minor comments :
1. In the inplace110 commit message, there are still references to
heap_inplace_update. Should it be clarified that the function has been
renamed?
2. Should there be a comment above the ri_needLockTag definition in
execNodes.h that we are caching this value to avoid function calls to
IsInPlaceUpdateRelation for every tuple? Similar to how the comment
above ri_TrigFunctions mentions that it is cached lookup info.

> On Thu, Aug 22, 2024 at 12:32:00AM -0700, Noah Misch wrote:
> > On Tue, Aug 20, 2024 at 11:59:45AM +0300, Heikki Linnakangas wrote:
> > > How many of those for RELKIND_INDEX vs tables? I'm thinking if we should
> > > always require a tuple lock on indexes, if that would make a difference.
> >
> > Three sites.  See attached inplace125 patch.  Is it a net improvement?  If so,
> > I'll squash it into inplace120.
>
> If nobody has an opinion, I'll discard inplace125.  I feel it's not a net
> improvement, but either way is fine with me.

Seems moderately simpler to me. But there is still special handling
for the RELKIND_INDEX. Just that instead of doing it in
systable_inplace_update_begin, we have a special case in heap_update.
So overall it's only a small improvement and I'm fine either way.

Thanks & Regards
Nitin Motiani
Google



Re: race condition in pg_class

From
Noah Misch
Date:
On Thu, Sep 05, 2024 at 07:10:04PM +0530, Nitin Motiani wrote:
> On Thu, Sep 5, 2024 at 1:27 AM Noah Misch <noah@leadboat.com> wrote:
> > On Wed, Sep 04, 2024 at 09:00:32PM +0530, Nitin Motiani wrote:
> > >  Assert(rel->ri_needsLockTagTuple ==  IsInplaceUpdateRelation(rel->relationDesc)
> > >
> > > This can safeguard against users of ResultRelInfo missing this field.
> >
> > v10 does the rename and adds that assertion.  This question remains open:
> 
> Looks good. A couple of minor comments :
> 1. In the inplace110 commit message, there are still references to
> heap_inplace_update. Should it be clarified that the function has been
> renamed?

PGXN has only one caller of this function, so I think that wouldn't help
readers enough.  If someone gets a compiler error about the old name, they'll
figure it out without commit log guidance.  If a person doesn't get a compiler
error, they didn't need to read about the fact of the rename.

> 2. Should there be a comment above the ri_needLockTag definition in
> execNodes.h that we are caching this value to avoid function calls to
> IsInPlaceUpdateRelation for every tuple? Similar to how the comment
> above ri_TrigFunctions mentions that it is cached lookup info.

Current comment:

    /* updates do LockTuple() before oldtup read; see README.tuplock */
    bool        ri_needLockTagTuple;

Once the comment doesn't fit in one line, pgindent rules make it take a
minimum of four lines.  I don't think words about avoiding function calls
would add enough value to justify the vertical space, because a person
starting to remove it would see where it's called.  That's not to say the
addition would be negligent.  If someone else were writing the patch and had
included that, I wouldn't be deleting the material.



Re: race condition in pg_class

From
Nitin Motiani
Date:
On Fri, Sep 6, 2024 at 3:34 AM Noah Misch <noah@leadboat.com> wrote:
>
> On Thu, Sep 05, 2024 at 07:10:04PM +0530, Nitin Motiani wrote:
> > On Thu, Sep 5, 2024 at 1:27 AM Noah Misch <noah@leadboat.com> wrote:
> > > On Wed, Sep 04, 2024 at 09:00:32PM +0530, Nitin Motiani wrote:
> > > >  Assert(rel->ri_needsLockTagTuple ==  IsInplaceUpdateRelation(rel->relationDesc)
> > > >
> > > > This can safeguard against users of ResultRelInfo missing this field.
> > >
> > > v10 does the rename and adds that assertion.  This question remains open:
> >
> > Looks good. A couple of minor comments :
> > 1. In the inplace110 commit message, there are still references to
> > heap_inplace_update. Should it be clarified that the function has been
> > renamed?
>
> PGXN has only one caller of this function, so I think that wouldn't help
> readers enough.  If someone gets a compiler error about the old name, they'll
> figure it out without commit log guidance.  If a person doesn't get a compiler
> error, they didn't need to read about the fact of the rename.
>
> > 2. Should there be a comment above the ri_needLockTag definition in
> > execNodes.h that we are caching this value to avoid function calls to
> > IsInPlaceUpdateRelation for every tuple? Similar to how the comment
> > above ri_TrigFunctions mentions that it is cached lookup info.
>
> Current comment:
>
>         /* updates do LockTuple() before oldtup read; see README.tuplock */
>         bool            ri_needLockTagTuple;
>
> Once the comment doesn't fit in one line, pgindent rules make it take a
> minimum of four lines.  I don't think words about avoiding function calls
> would add enough value to justify the vertical space, because a person
> starting to remove it would see where it's called.  That's not to say the
> addition would be negligent.  If someone else were writing the patch and had
> included that, I wouldn't be deleting the material.

Thanks. I have no other comments.



Re: race condition in pg_class

From
Noah Misch
Date:
On Fri, Sep 06, 2024 at 03:22:48PM +0530, Nitin Motiani wrote:
> Thanks. I have no other comments.

https://commitfest.postgresql.org/49/5090/ remains in status="Needs review".
When someone moves it to status="Ready for Committer", I will commit
inplace090, inplace110, and inplace120 patches.  If one of you is comfortable
with that, please modify the status.



Re: race condition in pg_class

From
Nitin Motiani
Date:
On Sat, Sep 7, 2024 at 12:25 AM Noah Misch <noah@leadboat.com> wrote:
>
> On Fri, Sep 06, 2024 at 03:22:48PM +0530, Nitin Motiani wrote:
> > Thanks. I have no other comments.
>
> https://commitfest.postgresql.org/49/5090/ remains in status="Needs review".
> When someone moves it to status="Ready for Committer", I will commit
> inplace090, inplace110, and inplace120 patches.  If one of you is comfortable
> with that, please modify the status.

Done.



Re: race condition in pg_class

From
Noah Misch
Date:
On Thu, Sep 19, 2024 at 02:33:46PM -0700, Noah Misch wrote:
> On Mon, Sep 09, 2024 at 10:55:32AM +0530, Nitin Motiani wrote:
> > On Sat, Sep 7, 2024 at 12:25 AM Noah Misch <noah@leadboat.com> wrote:
> > > https://commitfest.postgresql.org/49/5090/ remains in status="Needs review".
> > > When someone moves it to status="Ready for Committer", I will commit
> > > inplace090, inplace110, and inplace120 patches.  If one of you is comfortable
> > > with that, please modify the status.
> > 
> > Done.
> 
> FYI, here are the branch-specific patches.  I plan to push these after the v17
> release freeze lifts next week.

Pushed, but the pushes contained at least one defect:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=akepa&dt=2024-09-24%2022%3A29%3A02

I will act on that and other buildfarm failures that show up.



Re: race condition in pg_class

From
Noah Misch
Date:
On Thu, Aug 22, 2024 at 12:32:00AM -0700, Noah Misch wrote:
> This move also loses the optimization of unpinning before XactLockTableWait().
> heap_update() doesn't optimize that way, so that's fine.

In this other thread, I'm proposing to go back to unpinning:
https://postgr.es/m/20241027214035.8a.nmisch@google.com



Re: race condition in pg_class

From
Andres Freund
Date:
Hi,

On 2024-05-12 16:29:23 -0700, Noah Misch wrote:
> Author:     Noah Misch <noah@leadboat.com>
> Commit:     Noah Misch <noah@leadboat.com>
> 
>     Make TAP todo_start effects the same under Meson and prove_check.
>     
>     This could have caused spurious failures only on SPARC Linux, because
>     today's only todo_start tests for that platform.  Back-patch to v16,
>     where Meson support first appeared.
>     
>     Reviewed by FIXME.
>     
>     Discussion: https://postgr.es/m/FIXME
> 
> diff --git a/src/tools/testwrap b/src/tools/testwrap
> index d01e610..9a270be 100755
> --- a/src/tools/testwrap
> +++ b/src/tools/testwrap
> @@ -41,12 +41,22 @@ env_dict = {**os.environ,
>              'TESTDATADIR': os.path.join(testdir, 'data'),
>              'TESTLOGDIR': os.path.join(testdir, 'log')}
>  
> -sp = subprocess.run(args.test_command, env=env_dict)
> +sp = subprocess.Popen(args.test_command, env=env_dict, stdout=subprocess.PIPE)
> +# Meson categorizes a passing TODO test point as bad
> +# (https://github.com/mesonbuild/meson/issues/13183).  Remove the TODO
> +# directive, so Meson computes the file result like Perl does.  This could
> +# have the side effect of delaying stdout lines relative to stderr.  That
> +# doesn't affect the log file, and the TAP protocol uses stdout only.
> +for line in sp.stdout:
> +    if line.startswith(b'ok '):
> +        line = line.replace(b' # TODO ', b' # testwrap-overridden-TODO ', 1)
> +    sys.stdout.buffer.write(line)
> +returncode = sp.wait()

This has the issue that it causes the testwrap output to be buffered, which
makes running tests with ``meson test -v <testname>` update the output less
promptly, only updating whenever the output buffer is flushed.

That's not the end of the world, but it'd be nice to get the output more
promptly again. It doesn't matter that much when running the tests normally,
but if you run them with valgrind or such and you just want to see the first
failure, because it's going to take an hour to finish all tests...

The easiest fix is to just explicitly flush after each line, as in the
attached.

Greetings,

Andres Freund

Attachment

Re: race condition in pg_class

From
Noah Misch
Date:
On Tue, Mar 18, 2025 at 03:03:52PM -0400, Andres Freund wrote:
> Subject: [PATCH v1] meson: Flush stdout in testwrap
> 
> Otherwise the progress won't reliably be displayed during a test.
> ---
>  src/tools/testwrap | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/src/tools/testwrap b/src/tools/testwrap
> index 8ae8fb79ba7..21105146c9d 100755
> --- a/src/tools/testwrap
> +++ b/src/tools/testwrap
> @@ -61,6 +61,7 @@ for line in sp.stdout:
>      if line.startswith(b'ok '):
>          line = line.replace(b' # TODO ', b' # testwrap-overridden-TODO ', 1)
>      sys.stdout.buffer.write(line)
> +    sys.stdout.flush()
>  returncode = sp.wait()

Fine with me.



Re: race condition in pg_class

From
Andres Freund
Date:
Hi,

On 2025-03-18 12:17:41 -0700, Noah Misch wrote:
> On Tue, Mar 18, 2025 at 03:03:52PM -0400, Andres Freund wrote:
> > Subject: [PATCH v1] meson: Flush stdout in testwrap
> > 
> > Otherwise the progress won't reliably be displayed during a test.
> > ---
> >  src/tools/testwrap | 1 +
> >  1 file changed, 1 insertion(+)
> > 
> > diff --git a/src/tools/testwrap b/src/tools/testwrap
> > index 8ae8fb79ba7..21105146c9d 100755
> > --- a/src/tools/testwrap
> > +++ b/src/tools/testwrap
> > @@ -61,6 +61,7 @@ for line in sp.stdout:
> >      if line.startswith(b'ok '):
> >          line = line.replace(b' # TODO ', b' # testwrap-overridden-TODO ', 1)
> >      sys.stdout.buffer.write(line)
> > +    sys.stdout.flush()
> >  returncode = sp.wait()
> 
> Fine with me.

Cool. Pushed.

Thanks,

Andres