Thread: race condition in pg_class
Hello, hackers!
We are running PG13.10 and recently we have encountered what appears to be a bug due to some race condition between ALTER TABLE ... ADD CONSTRAINT and some other catalog-writer, possibly ANALYZE.
The problem is that after successfully creating index on relation (which previosly didnt have any indexes), its pg_class.relhasindex remains set to "false", which is illegal, I think.
Index was built using the following statement:
ALTER TABLE "example" ADD constraint "example_pkey" PRIMARY KEY (id);
PG_CLASS:
# select ctid,oid,xmin,xmax, * from pg_class where oid = 3566558198;
-[ RECORD 1 ]-------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
ctid | (12,49)
oid | 3566558198
xmin | 1202298791
xmax | 0
relname | example
relnamespace | 16479
reltype | 3566558200
reloftype | 0
relowner | 16386
relam | 2
relfilenode | 3566558198
reltablespace | 0
relpages | 152251
reltuples | 1.1565544e+07
relallvisible | 127837
reltoastrelid | 3566558203
relhasindex | f
relisshared | f
relpersistence | p
relkind | r
relnatts | 12
relchecks | 0
relhasrules | f
relhastriggers | f
relhassubclass | f
relrowsecurity | f
relforcerowsecurity | f
relispopulated | t
relreplident | d
relispartition | f
relrewrite | 0
relfrozenxid | 1201647807
relminmxid | 1
relacl |
reloptions |
relpartbound |
PG_INDEX:
# select ctid,xmin,xmax,indexrelid::regclass,indrelid::regclass, * from pg_index where indexrelid = 3569625749;
-[ RECORD 1 ]--+---------------------------------------------
ctid | (3,30)
xmin | 1202295045
xmax | 0
indexrelid | "example_pkey"
indrelid | "example"
indexrelid | 3569625749
indrelid | 3566558198
indnatts | 1
indnkeyatts | 1
indisunique | t
indisprimary | t
indisexclusion | f
indimmediate | t
indisclustered | f
indisvalid | t
indcheckxmin | f
indisready | t
indislive | t
indisreplident | f
indkey | 1
indcollation | 0
indclass | 3124
indoption | 0
indexprs |
indpred |
Looking into the WAL via waldump given us the following picture (full waldump output is attached):
tx: 1202295045, lsn: AAB1/D38378D0, prev AAB1/D3837208, desc: FPI , blkref #0: rel 1663/16387/3569625749 blk 0 FPW
tx: 1202298790, lsn: AAB1/D3912EC0, prev AAB1/D3912E80, desc: NEW_CID rel 1663/16387/1259; tid 6/24; cmin: 0, cmax: 4294967295, combo: 4294967295
tx: 1202298790, lsn: AAB1/D3927580, prev AAB1/D3926988, desc: COMMIT 2023-10-04 22:41:23.863979 UTC
tx: 1202298791, lsn: AAB1/D393C230, prev AAB1/D393C1F0, desc: HOT_UPDATE off 24 xmax 1202298791 flags 0x20 ; new off 45 xmax 0, blkref #0: rel 1663/16387/1259 blk 6
tx: 1202298791, lsn: AAB1/D394ADA0, prev AAB1/D394AD60, desc: UPDATE off 45 xmax 1202298791 flags 0x00 ; new off 28 xmax 0, blkref #0: rel 1663/16387/1259 blk 5, blkref #1: rel 1663/16387/1259 blk 6
tx: 1202298791, lsn: AAB1/D3961088, prev AAB1/D3961048, desc: NEW_CID rel 1663/16387/1259; tid 12/49; cmin: 50, cmax: 4294967295, combo: 4294967295
tx: 1202295045, lsn: AAB1/D3962E78, prev AAB1/D3962E28, desc: INPLACE off 24, blkref #0: rel 1663/16387/1259 blk 6
tx: 1202295045, lsn: AAB1/D39632A0, prev AAB1/D3963250, desc: COMMIT 2023-10-04 22:41:23.878565 UTC
tx: 1202298791, lsn: AAB1/D3973420, prev AAB1/D39733D0, desc: COMMIT 2023-10-04 22:41:23.884951 UTC
1202295045 - create index statement
1202298790 and 1202298791 are some other concurrent operations, unfortunately I wasnt able to determine what are they
So looks like 1202295045, updated tuple (6,24) in pg_class INPLACE, in which at this point xmax was already set by 1202298791 and new tuple in (12,49) was in created.
So after 1202298791 was commited, that inplace update was effectively lost.
If we do an inclusive PITR with (recovery_target_xid = 1202295045), we can see the following picture (notice relhasindex and xmax):
# select ctid,oid, xmin,xmax,relhasindex,cmin,cmax from pg_class where oid = 3566558198;
-[ RECORD 1 ]-----------
ctid | (6,24)
oid | 3566558198
xmin | 1202298790
xmax | 1202298791
relhasindex | t
cmin | 0
cmax | 0
I've tried to reproduce this scenario with CREATE INDEX and various concurrent statements, but no luck.
Attached full waldump output for the relevant WAL segment.
We are running PG13.10 and recently we have encountered what appears to be a bug due to some race condition between ALTER TABLE ... ADD CONSTRAINT and some other catalog-writer, possibly ANALYZE.
The problem is that after successfully creating index on relation (which previosly didnt have any indexes), its pg_class.relhasindex remains set to "false", which is illegal, I think.
Index was built using the following statement:
ALTER TABLE "example" ADD constraint "example_pkey" PRIMARY KEY (id);
PG_CLASS:
# select ctid,oid,xmin,xmax, * from pg_class where oid = 3566558198;
-[ RECORD 1 ]-------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
ctid | (12,49)
oid | 3566558198
xmin | 1202298791
xmax | 0
relname | example
relnamespace | 16479
reltype | 3566558200
reloftype | 0
relowner | 16386
relam | 2
relfilenode | 3566558198
reltablespace | 0
relpages | 152251
reltuples | 1.1565544e+07
relallvisible | 127837
reltoastrelid | 3566558203
relhasindex | f
relisshared | f
relpersistence | p
relkind | r
relnatts | 12
relchecks | 0
relhasrules | f
relhastriggers | f
relhassubclass | f
relrowsecurity | f
relforcerowsecurity | f
relispopulated | t
relreplident | d
relispartition | f
relrewrite | 0
relfrozenxid | 1201647807
relminmxid | 1
relacl |
reloptions |
relpartbound |
PG_INDEX:
# select ctid,xmin,xmax,indexrelid::regclass,indrelid::regclass, * from pg_index where indexrelid = 3569625749;
-[ RECORD 1 ]--+---------------------------------------------
ctid | (3,30)
xmin | 1202295045
xmax | 0
indexrelid | "example_pkey"
indrelid | "example"
indexrelid | 3569625749
indrelid | 3566558198
indnatts | 1
indnkeyatts | 1
indisunique | t
indisprimary | t
indisexclusion | f
indimmediate | t
indisclustered | f
indisvalid | t
indcheckxmin | f
indisready | t
indislive | t
indisreplident | f
indkey | 1
indcollation | 0
indclass | 3124
indoption | 0
indexprs |
indpred |
Looking into the WAL via waldump given us the following picture (full waldump output is attached):
tx: 1202295045, lsn: AAB1/D38378D0, prev AAB1/D3837208, desc: FPI , blkref #0: rel 1663/16387/3569625749 blk 0 FPW
tx: 1202298790, lsn: AAB1/D3912EC0, prev AAB1/D3912E80, desc: NEW_CID rel 1663/16387/1259; tid 6/24; cmin: 0, cmax: 4294967295, combo: 4294967295
tx: 1202298790, lsn: AAB1/D3927580, prev AAB1/D3926988, desc: COMMIT 2023-10-04 22:41:23.863979 UTC
tx: 1202298791, lsn: AAB1/D393C230, prev AAB1/D393C1F0, desc: HOT_UPDATE off 24 xmax 1202298791 flags 0x20 ; new off 45 xmax 0, blkref #0: rel 1663/16387/1259 blk 6
tx: 1202298791, lsn: AAB1/D394ADA0, prev AAB1/D394AD60, desc: UPDATE off 45 xmax 1202298791 flags 0x00 ; new off 28 xmax 0, blkref #0: rel 1663/16387/1259 blk 5, blkref #1: rel 1663/16387/1259 blk 6
tx: 1202298791, lsn: AAB1/D3961088, prev AAB1/D3961048, desc: NEW_CID rel 1663/16387/1259; tid 12/49; cmin: 50, cmax: 4294967295, combo: 4294967295
tx: 1202295045, lsn: AAB1/D3962E78, prev AAB1/D3962E28, desc: INPLACE off 24, blkref #0: rel 1663/16387/1259 blk 6
tx: 1202295045, lsn: AAB1/D39632A0, prev AAB1/D3963250, desc: COMMIT 2023-10-04 22:41:23.878565 UTC
tx: 1202298791, lsn: AAB1/D3973420, prev AAB1/D39733D0, desc: COMMIT 2023-10-04 22:41:23.884951 UTC
1202295045 - create index statement
1202298790 and 1202298791 are some other concurrent operations, unfortunately I wasnt able to determine what are they
So looks like 1202295045, updated tuple (6,24) in pg_class INPLACE, in which at this point xmax was already set by 1202298791 and new tuple in (12,49) was in created.
So after 1202298791 was commited, that inplace update was effectively lost.
If we do an inclusive PITR with (recovery_target_xid = 1202295045), we can see the following picture (notice relhasindex and xmax):
# select ctid,oid, xmin,xmax,relhasindex,cmin,cmax from pg_class where oid = 3566558198;
-[ RECORD 1 ]-----------
ctid | (6,24)
oid | 3566558198
xmin | 1202298790
xmax | 1202298791
relhasindex | t
cmin | 0
cmax | 0
I've tried to reproduce this scenario with CREATE INDEX and various concurrent statements, but no luck.
Attached full waldump output for the relevant WAL segment.
Attachment
> On 25 Oct 2023, at 13:39, Smolkin Grigory <smallkeen@gmail.com> wrote: > > We are running PG13.10 and recently we have encountered what appears to be a bug due to some race condition between ALTERTABLE ... ADD CONSTRAINT and some other catalog-writer, possibly ANALYZ > I've tried to reproduce this scenario with CREATE INDEX and various concurrent statements, but no luck. Maybe it would be possible to reproduce with modifying tests for concurrent index creation. For example add “ANALYZE” here[0]. Keep in mind that for easier reproduction it would make sense to increase transaction count radically. Best regards, Andrey Borodin. [0] https://github.com/postgres/postgres/blob/master/contrib/amcheck/t/002_cic.pl#L34
Smolkin Grigory <smallkeen@gmail.com> writes: > We are running PG13.10 and recently we have encountered what appears to be > a bug due to some race condition between ALTER TABLE ... ADD CONSTRAINT and > some other catalog-writer, possibly ANALYZE. > The problem is that after successfully creating index on relation (which > previosly didnt have any indexes), its pg_class.relhasindex remains set to > "false", which is illegal, I think. > Index was built using the following statement: > ALTER TABLE "example" ADD constraint "example_pkey" PRIMARY KEY (id); ALTER TABLE ADD CONSTRAINT would certainly have taken AccessExclusiveLock on the "example" table, which should be sufficient to prevent anything else from touching its pg_class row. The only mechanism I can think of that might bypass that is a manual UPDATE on pg_class, which would just manipulate the row as a row without concern for associated relation-level locks. Any chance that somebody was doing something like that? regards, tom lane
> ALTER TABLE ADD CONSTRAINT would certainly have taken
> AccessExclusiveLock on the "example" table, which should be sufficient
> to prevent anything else from touching its pg_class row. The only
> mechanism I can think of that might bypass that is a manual UPDATE on
> pg_class, which would just manipulate the row as a row without concern
> for associated relation-level locks. Any chance that somebody was
> doing something like that?
No chance. Our infrastructure dont do that, and users dont just have the privileges to mess with pg_catalog.
> AccessExclusiveLock on the "example" table, which should be sufficient
> to prevent anything else from touching its pg_class row. The only
> mechanism I can think of that might bypass that is a manual UPDATE on
> pg_class, which would just manipulate the row as a row without concern
> for associated relation-level locks. Any chance that somebody was
> doing something like that?
No chance. Our infrastructure dont do that, and users dont just have the privileges to mess with pg_catalog.
ср, 25 окт. 2023 г. в 21:06, Tom Lane <tgl@sss.pgh.pa.us>:
Smolkin Grigory <smallkeen@gmail.com> writes:
> We are running PG13.10 and recently we have encountered what appears to be
> a bug due to some race condition between ALTER TABLE ... ADD CONSTRAINT and
> some other catalog-writer, possibly ANALYZE.
> The problem is that after successfully creating index on relation (which
> previosly didnt have any indexes), its pg_class.relhasindex remains set to
> "false", which is illegal, I think.
> Index was built using the following statement:
> ALTER TABLE "example" ADD constraint "example_pkey" PRIMARY KEY (id);
ALTER TABLE ADD CONSTRAINT would certainly have taken
AccessExclusiveLock on the "example" table, which should be sufficient
to prevent anything else from touching its pg_class row. The only
mechanism I can think of that might bypass that is a manual UPDATE on
pg_class, which would just manipulate the row as a row without concern
for associated relation-level locks. Any chance that somebody was
doing something like that?
regards, tom lane
On Wed, Oct 25, 2023 at 01:39:41PM +0300, Smolkin Grigory wrote: > We are running PG13.10 and recently we have encountered what appears to be > a bug due to some race condition between ALTER TABLE ... ADD CONSTRAINT and > some other catalog-writer, possibly ANALYZE. > The problem is that after successfully creating index on relation (which > previosly didnt have any indexes), its pg_class.relhasindex remains set to > "false", which is illegal, I think. > Index was built using the following statement: > ALTER TABLE "example" ADD constraint "example_pkey" PRIMARY KEY (id); This is going to be a problem with any operation that does a transactional pg_class update without taking a lock that conflicts with ShareLock. GRANT doesn't lock the table at all, so I can reproduce this in v17 as follows: == session 1 create table t (c int); begin; grant select on t to public; == session 2 alter table t add primary key (c); == back in session 1 commit; We'll likely need to change how we maintain relhasindex or perhaps take a lock in GRANT. > Looking into the WAL via waldump given us the following picture (full > waldump output is attached): > 1202295045 - create index statement > 1202298790 and 1202298791 are some other concurrent operations, > unfortunately I wasnt able to determine what are they Can you explore that as follows? - PITR to just before the COMMIT record. - Save all rows of pg_class. - PITR to just after the COMMIT record. - Save all rows of pg_class. - Diff the two sets of saved rows. Which columns changed? The evidence you've shown would be consistent with a transaction doing GRANT or REVOKE on dozens of tables. If the changed column is something other than relacl, that would be great to know. On the off-chance it's relevant, what extensions do you have (\dx in psql)?
> This is going to be a problem with any operation that does a transactional
> pg_class update without taking a lock that conflicts with ShareLock. GRANT
> doesn't lock the table at all, so I can reproduce this in v17 as follows:
>
> == session 1
> create table t (c int);
> begin;
> grant select on t to public;
>
> == session 2
> alter table t add primary key (c);
>
> == back in session 1
> commit;
>
>
> We'll likely need to change how we maintain relhasindex or perhaps take a lock
> in GRANT.
Oh, that explains it. Thank you very much.
> Can you explore that as follows?
>
>- PITR to just before the COMMIT record.
>- Save all rows of pg_class.
>- PITR to just after the COMMIT record.
>- Save all rows of pg_class.
>- Diff the two sets of saved rows.
Sure, but it will take some time, its a large db with lots of WAL segments to apply.
> extensions
extname | extversion
--------------------+------------
plpgsql | 1.0
pg_stat_statements | 1.8
pg_buffercache | 1.3
pgstattuple | 1.5
> pg_class update without taking a lock that conflicts with ShareLock. GRANT
> doesn't lock the table at all, so I can reproduce this in v17 as follows:
>
> == session 1
> create table t (c int);
> begin;
> grant select on t to public;
>
> == session 2
> alter table t add primary key (c);
>
> == back in session 1
> commit;
>
>
> We'll likely need to change how we maintain relhasindex or perhaps take a lock
> in GRANT.
Oh, that explains it. Thank you very much.
> Can you explore that as follows?
>
>- PITR to just before the COMMIT record.
>- Save all rows of pg_class.
>- PITR to just after the COMMIT record.
>- Save all rows of pg_class.
>- Diff the two sets of saved rows.
Sure, but it will take some time, its a large db with lots of WAL segments to apply.
> extensions
extname | extversion
--------------------+------------
plpgsql | 1.0
pg_stat_statements | 1.8
pg_buffercache | 1.3
pgstattuple | 1.5
On Thu, Oct 26, 2023 at 09:44:04PM -0700, Noah Misch wrote: > On Wed, Oct 25, 2023 at 01:39:41PM +0300, Smolkin Grigory wrote: > > We are running PG13.10 and recently we have encountered what appears to be > > a bug due to some race condition between ALTER TABLE ... ADD CONSTRAINT and > > some other catalog-writer, possibly ANALYZE. > > The problem is that after successfully creating index on relation (which > > previosly didnt have any indexes), its pg_class.relhasindex remains set to > > "false", which is illegal, I think. It's damaging. The table will behave like it has no indexes. If something adds an index later, old indexes will reappear, corrupt, having not received updates during the relhasindex=false era. ("pg_amcheck --heapallindexed" can detect this.) > > Index was built using the following statement: > > ALTER TABLE "example" ADD constraint "example_pkey" PRIMARY KEY (id); > > This is going to be a problem with any operation that does a transactional > pg_class update without taking a lock that conflicts with ShareLock. GRANT > doesn't lock the table at all, so I can reproduce this in v17 as follows: > > == session 1 > create table t (c int); > begin; > grant select on t to public; > > == session 2 > alter table t add primary key (c); > > == back in session 1 > commit; > > > We'll likely need to change how we maintain relhasindex or perhaps take a lock > in GRANT. The main choice is accepting more DDL blocking vs. accepting inefficient relcache builds. Options I'm seeing: === "more DDL blocking" option family B1. Take ShareUpdateExclusiveLock in GRANT, REVOKE, and anything that makes transactional pg_class updates without holding some stronger lock. New asserts could catch future commands failing to do this. B2. Take some shorter-lived lock around pg_class tuple formation, such that GRANT blocks CREATE INDEX, but two CREATE INDEX don't block each other. Anything performing a transactional update of a pg_class row would acquire the lock in exclusive mode before fetching the old tuple and hold it till end of transaction. relhasindex=true in-place updates would acquire it the same way, but they would release it after the inplace update. I expect a new heavyweight lock type, e.g. LOCKTAG_RELATION_DEFINITION, with the same key as LOCKTAG_RELATION. This has less blocking than the previous option, but it's more complicated to explain to both users and developers. B3. Have CREATE INDEX do an EvalPlanQual()-style thing to update all successor tuple versions. Like the previous option, this would require new locking, but the new lock would not need to persist till end of xact. It would be even more complicated to explain to users and developers. (If this is promising enough to warrant more detail, let me know.) B4. Use transactional updates to set relhasindex=true. Two CREATE INDEX commands on the same table would block each other. If we did it the way most DDL does today, they'd get "tuple concurrently updated" failures after the blocking ends. === "inefficient relcache builds" option family R1. Ignore relhasindex; possibly remove it in v17. Relcache builds et al. will issue more superfluous queries. R2. As a weird variant of the previous option, keep relhasindex and make all transactional updates of pg_class set relhasindex=true pessimistically. (VACUUM will set it back to false.) === other O1. This is another case where the sometimes-discussed "pg_class_nt" for nontransactional columns would help. I'm ruling that out as too hard to back-patch. Are there other options important to consider? I currently like (B1) the most, followed closely by (R1) and (B2). A key unknown is the prevalence of index-free tables. Low prevalence would argue in favor of (R1). In my limited experience, they've been rare. That said, I assume relcache builds happen a lot more than GRANTs, so it's harder to bound the damage from (R1) compared to the damage from (B1). Thoughts on this decision? Thanks, nm
Noah Misch <noah@leadboat.com> writes: > On Thu, Oct 26, 2023 at 09:44:04PM -0700, Noah Misch wrote: >> We'll likely need to change how we maintain relhasindex or perhaps take a lock >> in GRANT. > The main choice is accepting more DDL blocking vs. accepting inefficient > relcache builds. Options I'm seeing: It looks to me like you're only thinking about relhasindex, but it seems to me that any call of heap_inplace_update brings some risk of this kind. Excluding the bootstrap-mode-only usage in create_toast_table, I see four callers: * index_update_stats updating a pg_class tuple's relhasindex, relpages, reltuples, relallvisible * vac_update_relstats updating a pg_class tuple's relpages, reltuples, relallvisible, relhasindex, relhasrules, relhastriggers, relfrozenxid, relminmxid * vac_update_datfrozenxid updating a pg_database tuple's datfrozenxid, datminmxid * dropdb updating a pg_database tuple's datconnlimit So we have just as much of a problem with GRANTs on databases as GRANTs on relations. Also, it looks like we can lose knowledge of the presence of rules and triggers, which seems nearly as bad as forgetting about indexes. The rest of these updates might not be correctness-critical, although I wonder how bollixed things could get if we forget an advancement of relfrozenxid or datfrozenxid (especially if the calling transaction goes on to make other changes that assume that the update happened). BTW, vac_update_datfrozenxid believes (correctly I think) that it cannot use the syscache copy of a tuple as the basis for in-place update, because syscache will have detoasted any toastable fields. These other callers are ignoring that, which seems like it should result in heap_inplace_update failing with "wrong tuple length". I wonder how come we're not seeing reports of that from the field. I'm inclined to propose that heap_inplace_update should check to make sure that it's operating on the latest version of the tuple (including, I guess, waiting for an uncommitted update?) and throw error if not. I think this is what your B3 option is, but maybe I misinterpreted. It might be better to throw error immediately instead of waiting to see if the other updater commits. regards, tom lane
On Fri, Oct 27, 2023 at 03:32:26PM -0400, Tom Lane wrote: > Noah Misch <noah@leadboat.com> writes: > > On Thu, Oct 26, 2023 at 09:44:04PM -0700, Noah Misch wrote: > >> We'll likely need to change how we maintain relhasindex or perhaps take a lock > >> in GRANT. > > > The main choice is accepting more DDL blocking vs. accepting inefficient > > relcache builds. Options I'm seeing: > > It looks to me like you're only thinking about relhasindex, but it > seems to me that any call of heap_inplace_update brings some > risk of this kind. Excluding the bootstrap-mode-only usage in > create_toast_table, I see four callers: > > * index_update_stats updating a pg_class tuple's > relhasindex, relpages, reltuples, relallvisible > > * vac_update_relstats updating a pg_class tuple's > relpages, reltuples, relallvisible, relhasindex, relhasrules, > relhastriggers, relfrozenxid, relminmxid > > * vac_update_datfrozenxid updating a pg_database tuple's > datfrozenxid, datminmxid > > * dropdb updating a pg_database tuple's datconnlimit > > So we have just as much of a problem with GRANTs on databases > as GRANTs on relations. Also, it looks like we can lose > knowledge of the presence of rules and triggers, which seems > nearly as bad as forgetting about indexes. The rest of these > updates might not be correctness-critical, although I wonder > how bollixed things could get if we forget an advancement of > relfrozenxid or datfrozenxid (especially if the calling > transaction goes on to make other changes that assume that > the update happened). Thanks for researching that. Let's treat frozenxid stuff as critical; I wouldn't want to advance XID limits based on a datfrozenxid that later gets rolled back. I agree relhasrules and relhastriggers are also critical. The "inefficient relcache builds" option family can't solve cases like relfrozenxid and datconnlimit, so that leaves us with the "more DDL blocking" option family. > BTW, vac_update_datfrozenxid believes (correctly I think) that > it cannot use the syscache copy of a tuple as the basis for in-place > update, because syscache will have detoasted any toastable fields. > These other callers are ignoring that, which seems like it should > result in heap_inplace_update failing with "wrong tuple length". > I wonder how come we're not seeing reports of that from the field. Good question. Perhaps we'll need some test cases that exercise each inplace update against a row having a toast pointer. It's too easy to go a long time without encountering those in the field. > I'm inclined to propose that heap_inplace_update should check to > make sure that it's operating on the latest version of the tuple > (including, I guess, waiting for an uncommitted update?) and throw > error if not. I think this is what your B3 option is, but maybe > I misinterpreted. It might be better to throw error immediately > instead of waiting to see if the other updater commits. That's perhaps closer to B2. To be pedantic, B3 was about not failing or waiting for GRANT to commit but instead inplace-updating every member of the update chain. For B2, I was thinking we don't need to error. There are two problematic orders of events. The easy one is heap_inplace_update() mutating a tuple that already has an xmax. That's the one in the test case upthread, and detecting it is trivial. The harder one is heap_inplace_update() mutating a tuple after GRANT fetches the old tuple, before GRANT enters heap_update(). I anticipate a new locktag per catalog that can receive inplace updates, i.e. LOCKTAG_RELATION_DEFINITION and LOCKTAG_DATABASE_DEFINITION. Here's a walk-through for the pg_database case. GRANT will use the following sequence of events: - acquire LOCKTAG_DATABASE_DEFINITION in exclusive mode - fetch latest pg_database tuple - heap_update() - COMMIT, releasing LOCKTAG_DATABASE_DEFINITION vac_update_datfrozenxid() sequence of events: - acquire LOCKTAG_DATABASE_DEFINITION in exclusive mode - (now, all GRANTs on the given database have committed or aborted) - fetch latest pg_database tuple - heap_inplace_update() - release LOCKTAG_DATABASE_DEFINITION, even if xact not ending - continue with other steps, e.g. vac_truncate_clog() How does that compare to what you envisioned? vac_update_datfrozenxid() could further use xmax as a best-efforts thing to catch conflict with manual UPDATE statements, but it wouldn't solve the case where the UPDATE had fetched the tuple but not yet heap_update()'d it.
Noah Misch <noah@leadboat.com> writes: > On Fri, Oct 27, 2023 at 03:32:26PM -0400, Tom Lane wrote: >> I'm inclined to propose that heap_inplace_update should check to >> make sure that it's operating on the latest version of the tuple >> (including, I guess, waiting for an uncommitted update?) and throw >> error if not. I think this is what your B3 option is, but maybe >> I misinterpreted. It might be better to throw error immediately >> instead of waiting to see if the other updater commits. > That's perhaps closer to B2. To be pedantic, B3 was about not failing or > waiting for GRANT to commit but instead inplace-updating every member of the > update chain. For B2, I was thinking we don't need to error. There are two > problematic orders of events. The easy one is heap_inplace_update() mutating > a tuple that already has an xmax. That's the one in the test case upthread, > and detecting it is trivial. The harder one is heap_inplace_update() mutating > a tuple after GRANT fetches the old tuple, before GRANT enters heap_update(). Ugh ... you're right, what I was imagining would not catch that last case. > I anticipate a new locktag per catalog that can receive inplace updates, > i.e. LOCKTAG_RELATION_DEFINITION and LOCKTAG_DATABASE_DEFINITION. We could perhaps make this work by using the existing tuple-lock infrastructure, rather than inventing new locktags (a choice that spills to a lot of places including clients that examine pg_locks). I would prefer though to find a solution that only depends on making heap_inplace_update protect itself, without high-level cooperation from the possibly-interfering updater. This is basically because I'm still afraid that we're defining the problem too narrowly. For one thing, I have nearly zero confidence that GRANT et al are the only problematic source of conflicting transactional updates. For another, I'm worried that some extension may be using heap_inplace_update against a catalog we're not considering here. I'd also like to find a solution that fixes the case of a conflicting manual UPDATE (although certainly that's a stretch goal we may not be able to reach). I wonder if there's a way for heap_inplace_update to mark the tuple header as just-updated in a way that regular heap_update could recognize. (For standard catalog updates, we'd then end up erroring in simple_heap_update, which I think is fine.) We can't update xmin, because the VACUUM callers don't have an XID; but maybe there's some other way? I'm speculating about putting a funny value into xmax, or something like that, and having heap_update check that what it sees in xmax matches what was in the tuple the update started with. Or we could try to get rid of in-place updates, but that seems like a mighty big lift. All of the existing callers, except maybe the johnny-come-lately dropdb usage, have solid documented reasons to do it that way. regards, tom lane
On Fri, Oct 27, 2023 at 06:40:55PM -0400, Tom Lane wrote: > Noah Misch <noah@leadboat.com> writes: > > On Fri, Oct 27, 2023 at 03:32:26PM -0400, Tom Lane wrote: > >> I'm inclined to propose that heap_inplace_update should check to > >> make sure that it's operating on the latest version of the tuple > >> (including, I guess, waiting for an uncommitted update?) and throw > >> error if not. I think this is what your B3 option is, but maybe > >> I misinterpreted. It might be better to throw error immediately > >> instead of waiting to see if the other updater commits. > > > That's perhaps closer to B2. To be pedantic, B3 was about not failing or > > waiting for GRANT to commit but instead inplace-updating every member of the > > update chain. For B2, I was thinking we don't need to error. There are two > > problematic orders of events. The easy one is heap_inplace_update() mutating > > a tuple that already has an xmax. That's the one in the test case upthread, > > and detecting it is trivial. The harder one is heap_inplace_update() mutating > > a tuple after GRANT fetches the old tuple, before GRANT enters heap_update(). > > Ugh ... you're right, what I was imagining would not catch that last case. > > > I anticipate a new locktag per catalog that can receive inplace updates, > > i.e. LOCKTAG_RELATION_DEFINITION and LOCKTAG_DATABASE_DEFINITION. > > We could perhaps make this work by using the existing tuple-lock > infrastructure, rather than inventing new locktags (a choice that > spills to a lot of places including clients that examine pg_locks). That could be okay. It would be weird to reuse a short-term lock like that one as something held till end of transaction. But the alternative of new locktags ain't perfect, as you say. > I would prefer though to find a solution that only depends on making > heap_inplace_update protect itself, without high-level cooperation > from the possibly-interfering updater. This is basically because > I'm still afraid that we're defining the problem too narrowly. > For one thing, I have nearly zero confidence that GRANT et al are > the only problematic source of conflicting transactional updates. Likewise here, but I have fair confidence that an assertion would flush out the rest. heap_inplace_update() would assert that the backend holds one of the acceptable locks. It could even be an elog; heap_inplace_update() can tolerate that cost. > For another, I'm worried that some extension may be using > heap_inplace_update against a catalog we're not considering here. A pgxn search finds "citus" using heap_inplace_update(). > I'd also like to find a solution that fixes the case of a conflicting > manual UPDATE (although certainly that's a stretch goal we may not be > able to reach). It would be nice. > I wonder if there's a way for heap_inplace_update to mark the tuple > header as just-updated in a way that regular heap_update could > recognize. (For standard catalog updates, we'd then end up erroring > in simple_heap_update, which I think is fine.) We can't update xmin, > because the VACUUM callers don't have an XID; but maybe there's some > other way? I'm speculating about putting a funny value into xmax, > or something like that, and having heap_update check that what it > sees in xmax matches what was in the tuple the update started with. Hmmm. Achieving it without an XID would be the real trick. (With an XID, we could use xl_heap_lock like heap_update() does.) Thinking out loud, what if heap_inplace_update() sets HEAP_XMAX_INVALID and xmax = TransactionIdAdvance(xmax)? Or change t_ctid in a similar way. Then regular heap_update() could complain if the field changed vs. last seen value. This feels like something to regret later in terms of limiting our ability to harness those fields for more-valuable ends or compact them away in a future page format. I can't pinpoint a specific loss, so the idea might have legs. Nontransactional data in separate tables or in new metapages smells like the right long-term state. A project wanting to reuse the tuple header bits could introduce such storage to unblock its own bit reuse. > Or we could try to get rid of in-place updates, but that seems like > a mighty big lift. All of the existing callers, except maybe > the johnny-come-lately dropdb usage, have solid documented reasons > to do it that way. Yes, removing that smells problematic.
I prototyped two ways, one with a special t_ctid and one with LockTuple(). On Fri, Oct 27, 2023 at 04:26:12PM -0700, Noah Misch wrote: > On Fri, Oct 27, 2023 at 06:40:55PM -0400, Tom Lane wrote: > > Noah Misch <noah@leadboat.com> writes: > > > On Fri, Oct 27, 2023 at 03:32:26PM -0400, Tom Lane wrote: > > > I anticipate a new locktag per catalog that can receive inplace updates, > > > i.e. LOCKTAG_RELATION_DEFINITION and LOCKTAG_DATABASE_DEFINITION. > > > > We could perhaps make this work by using the existing tuple-lock > > infrastructure, rather than inventing new locktags (a choice that > > spills to a lot of places including clients that examine pg_locks). > > That could be okay. It would be weird to reuse a short-term lock like that > one as something held till end of transaction. But the alternative of new > locktags ain't perfect, as you say. That worked. > > I would prefer though to find a solution that only depends on making > > heap_inplace_update protect itself, without high-level cooperation > > from the possibly-interfering updater. This is basically because > > I'm still afraid that we're defining the problem too narrowly. > > For one thing, I have nearly zero confidence that GRANT et al are > > the only problematic source of conflicting transactional updates. > > Likewise here, but I have fair confidence that an assertion would flush out > the rest. heap_inplace_update() would assert that the backend holds one of > the acceptable locks. It could even be an elog; heap_inplace_update() can > tolerate that cost. That check would fall in both heap_inplace_update() and heap_update(). After all, a heap_inplace_update() check won't detect an omission in GRANT. > > For another, I'm worried that some extension may be using > > heap_inplace_update against a catalog we're not considering here. > > A pgxn search finds "citus" using heap_inplace_update(). > > > I'd also like to find a solution that fixes the case of a conflicting > > manual UPDATE (although certainly that's a stretch goal we may not be > > able to reach). > > It would be nice. I expect most approaches could get there by having ExecModifyTable() arrange for the expected locking or other actions. That's analogous to how heap_update() takes care of sinval even for a manual UPDATE. > > I wonder if there's a way for heap_inplace_update to mark the tuple > > header as just-updated in a way that regular heap_update could > > recognize. (For standard catalog updates, we'd then end up erroring > > in simple_heap_update, which I think is fine.) We can't update xmin, > > because the VACUUM callers don't have an XID; but maybe there's some > > other way? I'm speculating about putting a funny value into xmax, > > or something like that, and having heap_update check that what it > > sees in xmax matches what was in the tuple the update started with. > > Hmmm. Achieving it without an XID would be the real trick. (With an XID, we > could use xl_heap_lock like heap_update() does.) Thinking out loud, what if > heap_inplace_update() sets HEAP_XMAX_INVALID and xmax = > TransactionIdAdvance(xmax)? Or change t_ctid in a similar way. Then regular > heap_update() could complain if the field changed vs. last seen value. This > feels like something to regret later in terms of limiting our ability to > harness those fields for more-valuable ends or compact them away in a future > page format. I can't pinpoint a specific loss, so the idea might have legs. > Nontransactional data in separate tables or in new metapages smells like the > right long-term state. A project wanting to reuse the tuple header bits could > introduce such storage to unblock its own bit reuse. heap_update() does not have the pre-modification xmax today, so I used t_ctid. heap_modify_tuple() preserves t_ctid, so heap_update() already has the pre-modification t_ctid in key cases. For details of how the prototype uses t_ctid, see comment at "#define InplaceCanaryOffsetNumber". The prototype doesn't prevent corruption in the following scenario, because the aborted ALTER TABLE RENAME overwrites the special t_ctid: == session 1 drop table t; create table t (c int); begin; -- in gdb, set breakpoint on heap_modify_tuple grant select on t to public; == session 2 alter table t add primary key (c); begin; alter table t rename to t2; rollback; == back in session 1 -- release breakpoint -- want error (would get it w/o the begin;alter;rollback) commit; I'm missing how to mark the tuple in a fashion accessible to a second heap_update() after a rolled-back heap_update(). The mark needs enough bits "N" so it's implausible for 2^N inplace updates to happen between GRANT fetching the old tuple and GRANT completing heap_update(). Looking for bits that could persist across a rolled-back heap_update(), we have 3 in t_ctid, 2 in t_infomask2, and 0 in xmax. I definitely don't want to paint us into a corner by spending the t_infomask2 bits on this. Even if I did, 2^(3+2)=32 wouldn't clearly be enough inplace updates. Is there a way to salvage the goal of fixing the bug without modifying code like ExecGrant_common()? If not, I'm inclined to pick from one of the following designs: - Acquire ShareUpdateExclusiveLock in GRANT ((B1) from previous list). It does make GRANT more intrusive; e.g. GRANT will cancel autovacuum. I'm leaning toward this one for two reasons. First, it doesn't slow heap_update() outside of assert builds. Second, it makes the GRANT experience more like the rest our DDL, in that concurrent DDL will make GRANT block, not fail. - GRANT passes to heapam the fixed-size portion of the pre-modification tuple. heap_update() compares those bytes to the oldtup in shared buffers to see if an inplace update happened. (HEAD could get the bytes from a new heap_update() parameter, while back branches would need a different passing approach.) - LockTuple(), as seen in its attached prototype. I like this least at the moment, because it changes pg_locks content without having a clear advantage over the previous option. Also, the prototype has enough FIXME markers that I expect this to get hairy before it's done. I might change my preference after further prototypes. Does anyone have a strong preference between those? Briefly, I did consider these additional alternatives: - Just accept the yet-rarer chance of corruption from this message's test procedure. - Hold a buffer lock long enough to solve things. - Remember the tuples where we overwrote a special t_ctid, and reverse the overwrite during abort processing. But I/O in the abort path sounds bad. Thanks, nm
Attachment
I'm attaching patches implementing the LockTuple() design. It turns out we don't just lose inplace updates. We also overwrite unrelated tuples, reproduced at inplace.spec. Good starting points are README.tuplock and the heap_inplace_update_scan() header comment. On Wed, Nov 01, 2023 at 08:09:15PM -0700, Noah Misch wrote: > On Fri, Oct 27, 2023 at 04:26:12PM -0700, Noah Misch wrote: > > On Fri, Oct 27, 2023 at 06:40:55PM -0400, Tom Lane wrote: > > > We could perhaps make this work by using the existing tuple-lock > > > infrastructure, rather than inventing new locktags (a choice that > > > spills to a lot of places including clients that examine pg_locks). > > > I'd also like to find a solution that fixes the case of a conflicting > > > manual UPDATE (although certainly that's a stretch goal we may not be > > > able to reach). I implemented that; search for ri_needLockTagTuple. > - GRANT passes to heapam the fixed-size portion of the pre-modification tuple. > heap_update() compares those bytes to the oldtup in shared buffers to see if > an inplace update happened. (HEAD could get the bytes from a new > heap_update() parameter, while back branches would need a different passing > approach.) This could have been fine, but ... > - LockTuple(), as seen in its attached prototype. I like this least at the > moment, because it changes pg_locks content without having a clear advantage > over the previous option. ... I settled on the LockTuple() design for these reasons: - Solves more conflicts by waiting, instead of by ERROR or by retry loops. - Extensions wanting inplace updates don't have a big disadvantage over core code inplace updates. - One could use this to stop "tuple concurrently updated" for pg_class rows, by using SearchSysCacheLocked1() for all pg_class DDL and making that function wait for any existing xmax like inplace_xmax_lock() does. I don't expect to write that, but it's a nice option to have. - pg_locks shows the new lock acquisitions. Separable, nontrivial things not fixed in the attached patch stack: - Inplace update uses transactional CacheInvalidateHeapTuple(). ROLLBACK of CREATE INDEX wrongly discards the inval, leading to the relhasindex=t loss still seen in inplace-inval.spec. CacheInvalidateRelmap() does this right. - AtEOXact_Inval(true) is outside the RecordTransactionCommit() critical section, but it is critical. We must not commit transactional DDL without other backends receiving an inval. (When the inplace inval becomes nontransactional, it will face the same threat.) - Trouble is possible, I bet, if the system crashes between the inplace-update memcpy() and XLogInsert(). See the new XXX comment below the memcpy(). Might solve this by inplace update setting DELAY_CHKPT, writing WAL, and finally issuing memcpy() into the buffer. - [consequences limited to transient failure] Since a PROC_IN_VACUUM backend's xmin does not stop pruning, an MVCC scan in that backend can find zero tuples when one is live. This is like what all backends got in the days of SnapshotNow catalog scans. See the pgbench test suite addition. (Perhaps the fix is to make VACUUM do its MVCC scans outside of PROC_IN_VACUUM, setting that flag later and unsetting it earlier.) If you find decisions in this thread's patches are tied to any of those such that I should not separate those, let's discuss that. Topics in the patches that I feel are most fruitful for debate: - This makes inplace update block if the tuple has an updater. It's like one GRANT blocking another, except an inplace updater won't get "ERROR: tuple concurrently updated" like one of the GRANTs would. I had implemented versions that avoided this blocking by mutating each tuple in the updated tuple chain. That worked, but it had corner cases bad for maintainability, listed in the inplace_xmax_lock() header comment. I'd rather accept the blocking, so hackers can rule out those corner cases. A long-running GRANT already hurts VACUUM progress more just by keeping an XID running. - Pre-checks could make heap_inplace_update_cancel() calls rarer. Avoiding one of those avoids an exclusive buffer lock, and it avoids waiting on concurrent heap_update() if any. We'd pre-check the syscache tuple. EventTriggerOnLogin() does it that way, because the code was already in that form. I expect only vac_update_datfrozenxid() concludes !dirty enough to matter. I didn't bother with the optimization, but it would be simple. - If any citus extension user feels like porting its heap_inplace_update() call to this, I'd value hearing about your experience. - I paid more than my usual attention to test coverage, considering the patch stack's intensity compared to most back-patch bug fixes. I've kept all the above topics brief; feel free to ask for more details. Thanks, nm
Attachment
- inplace005-UNEXPECTEDPASS-tap-meson-v1.patch
- inplace010-tests-v1.patch
- inplace040-waitfuncs-v1.patch
- inplace050-tests-inj-v1.patch
- inplace060-nodeModifyTable-comments-v1.patch
- inplace070-rel-locks-missing-v1.patch
- inplace080-catcache-detoast-inplace-stale-v1.patch
- inplace090-LOCKTAG_TUPLE-eoxact-v1.patch
- inplace110-successors-v1.patch
- inplace120-locktag-v1.patch
On Sun, May 12, 2024 at 04:29:23PM -0700, Noah Misch wrote: > I'm attaching patches implementing the LockTuple() design. It turns out we > don't just lose inplace updates. We also overwrite unrelated tuples, > reproduced at inplace.spec. Good starting points are README.tuplock and the > heap_inplace_update_scan() header comment. About inplace050-tests-inj-v1.patch. + /* Check if blocked_pid is in injection_wait(). */ + proc = BackendPidGetProc(blocked_pid); + if (proc == NULL) + PG_RETURN_BOOL(false); /* session gone: definitely unblocked */ + wait_event = + pgstat_get_wait_event(UINT32_ACCESS_ONCE(proc->wait_event_info)); + if (wait_event && strncmp("INJECTION_POINT(", + wait_event, + strlen("INJECTION_POINT(")) == 0) + PG_RETURN_BOOL(true); Hmm. I am not sure that this is the right interface for the job because this is not only related to injection points but to the monitoring of a one or more wait events when running a permutation step. Perhaps this is something that should be linked to the spec files with some property area listing the wait events we're expected to wait on instead when running a step that we know will wait? -- Michael
Attachment
On Mon, May 13, 2024 at 04:59:59PM +0900, Michael Paquier wrote: > About inplace050-tests-inj-v1.patch. > > + /* Check if blocked_pid is in injection_wait(). */ > + proc = BackendPidGetProc(blocked_pid); > + if (proc == NULL) > + PG_RETURN_BOOL(false); /* session gone: definitely unblocked */ > + wait_event = > + pgstat_get_wait_event(UINT32_ACCESS_ONCE(proc->wait_event_info)); > + if (wait_event && strncmp("INJECTION_POINT(", > + wait_event, > + strlen("INJECTION_POINT(")) == 0) > + PG_RETURN_BOOL(true); > > Hmm. I am not sure that this is the right interface for the job > because this is not only related to injection points but to the > monitoring of a one or more wait events when running a permutation > step. Could you say more about that? Permutation steps don't monitor wait events today. This patch would be the first instance of that. > Perhaps this is something that should be linked to the spec > files with some property area listing the wait events we're expected > to wait on instead when running a step that we know will wait? The spec syntax doesn't distinguish contention types at all. The isolation tester's needs are limited to distinguishing: (a) process is waiting on another test session (b) process is waiting on automatic background activity (autovacuum, mainly) Automatic background activity doesn't make a process enter or leave injection_wait(), so all injection point wait events fall in (a). (The tester ignores (b), since those clear up without intervention. Failing to ignore them, as the tester did long ago, made output unstable.)
On Sun, May 12, 2024 at 7:29 PM Noah Misch <noah@leadboat.com> wrote: > - [consequences limited to transient failure] Since a PROC_IN_VACUUM backend's > xmin does not stop pruning, an MVCC scan in that backend can find zero > tuples when one is live. This is like what all backends got in the days of > SnapshotNow catalog scans. See the pgbench test suite addition. (Perhaps > the fix is to make VACUUM do its MVCC scans outside of PROC_IN_VACUUM, > setting that flag later and unsetting it earlier.) Are you saying that this is a problem already, or that the patch causes it to start happening? If it's the former, that's horrible. If it's the latter, I'd say that is a fatal defect. -- Robert Haas EDB: http://www.enterprisedb.com
On Mon, May 13, 2024 at 03:53:08PM -0400, Robert Haas wrote: > On Sun, May 12, 2024 at 7:29 PM Noah Misch <noah@leadboat.com> wrote: > > - [consequences limited to transient failure] Since a PROC_IN_VACUUM backend's > > xmin does not stop pruning, an MVCC scan in that backend can find zero > > tuples when one is live. This is like what all backends got in the days of > > SnapshotNow catalog scans. See the pgbench test suite addition. (Perhaps > > the fix is to make VACUUM do its MVCC scans outside of PROC_IN_VACUUM, > > setting that flag later and unsetting it earlier.) > > Are you saying that this is a problem already, or that the patch > causes it to start happening? If it's the former, that's horrible. The former.
On Sun, May 12, 2024 at 04:29:23PM -0700, Noah Misch wrote: > I'm attaching patches implementing the LockTuple() design. Starting 2024-06-10, I plan to push the first seven of the ten patches: inplace005-UNEXPECTEDPASS-tap-meson-v1.patch inplace010-tests-v1.patch inplace040-waitfuncs-v1.patch inplace050-tests-inj-v1.patch inplace060-nodeModifyTable-comments-v1.patch Those five just deal in tests, test infrastructure, and comments. inplace070-rel-locks-missing-v1.patch Main risk is new DDL deadlocks. inplace080-catcache-detoast-inplace-stale-v1.patch If it fails to fix the bug it targets, I expect it's a no-op rather than breaking things. I'll leave the last three of the ten needing review. Those three are beyond my skill to self-certify.
On Wed, Jun 5, 2024 at 2:17 PM Noah Misch <noah@leadboat.com> wrote: > Starting 2024-06-10, I plan to push the first seven of the ten patches: > > inplace005-UNEXPECTEDPASS-tap-meson-v1.patch > inplace010-tests-v1.patch > inplace040-waitfuncs-v1.patch > inplace050-tests-inj-v1.patch > inplace060-nodeModifyTable-comments-v1.patch > Those five just deal in tests, test infrastructure, and comments. > inplace070-rel-locks-missing-v1.patch > Main risk is new DDL deadlocks. > inplace080-catcache-detoast-inplace-stale-v1.patch > If it fails to fix the bug it targets, I expect it's a no-op rather than > breaking things. > > I'll leave the last three of the ten needing review. Those three are beyond > my skill to self-certify. It's not this patch set's fault, but I'm not very pleased to see that the injection point wait events have been shoehorned into the "Extension" category - which they are not - instead of being a new wait_event_type. That would have avoided the ugly wait-event naming pattern, inconsistent with everything else, introduced by inplace050-tests-inj-v1.patch. I think that the comments and commit messages in this patch set could, in some places, use improvement. For instance, inplace060-nodeModifyTable-comments-v1.patch reflows a bunch of comments, which makes it hard to see what actually changed, and the commit message doesn't tell you, either. A good bit of it seems to be changing "a view" to "a view INSTEAD OF trigger" or "a view having an INSTEAD OF trigger," but the reasoning behind that change is not spelled out anywhere. The reader is left to guess what the other case is and why the same principles don't apply to it. I don't doubt that the new comments are more correct than the old ones, but I expect future patch authors to have difficulty maintaining that state of affairs. Similarly, inplace070-rel-locks-missing-v1.patch adds no comments. IMHO, the commit message also isn't very informative. It disclaims knowledge of what bug it's fixing, while at the same time leaving the reader to figure out for themselves how the behavior has changed. Consequently, I expect writing the release notes for a release including this patch to be difficult: "We added some locks that block ... something ... in some circumstances ... to prevent ... something." It's not really the job of the release note author to fill in those blanks, but rather of the patch author or committer. I don't want to overburden the act of fixing bugs, but I just feel like more explanation is needed here. When I see for example that we're adding a lock acquisition to the end of heap_create(), I can't help but wonder if it's really true that we don't take a lock on a just-created relation today. I'm certainly under the impression that we lock newly-created, uncommitted relations, and a quick test seems to confirm that. I don't quite know whether that happens, but evidently this call is guarding against something more subtle than a categorical failure to lock a relation on creation so I think there should be a comment explaining what that thing is. It's also quite surprising that SetRelationHasSubclass() says "take X lock before calling" and 2 of 4 callers just don't. I guess that's how it is. But shouldn't we then have an assertion inside that function to guard against future mistakes? If the reason why we failed to add this initially is discernible from the commit messages that introduced the bug, it would be nice to mention what it seems to have been; if not, it would at least be nice to mention the offending commit(s). I'm also a bit worried that this is going to cause deadlocks, but I suppose if it does, that's still better than the status quo. IsInplaceUpdateOid's header comment says IsInplaceUpdateRelation instead of IsInplaceUpdateOid. inplace080-catcache-detoast-inplace-stale-v1.patch seems like another place where spelling out the rationale in more detail would be helpful to future readers; for instance, the commit message says that PgDatabaseToastTable is the only one affected, but it doesn't say why the others are not, or why this one is. The lengthy comment in CatalogCacheCreateEntry is also difficult to correlate with the code which follows. I can't guess whether the two cases called out in the comment always needed to be handled and were handled save only for in-place updates, and thus the comment changes were simply taking the opportunity to elaborate on the existing comments; or whether one of those cases is preexisting and the other arises from the desire to handle inplace updates. It could be helpful to mention relevant identifiers from the code in the comment text e.g. "systable_recheck_tuple detects ordinary updates by noting changes to the tuple's visibility information, while the equalTuple() case detects inplace updates." IMHO, this patch set underscores the desirability of removing in-place update altogether. That sounds difficult and not back-patchable, but I can't classify what this patch set does as anything better than grotty hacks to work around serious design deficiencies. That is not a vote against these patches: I see no better way forward. Nonetheless, I dislike the lack of better options. I have done only cursory review of the last two patches and don't feel I'm in a place to certify them, at least not now. -- Robert Haas EDB: http://www.enterprisedb.com
On Thu, Jun 06, 2024 at 09:48:51AM -0400, Robert Haas wrote: > It's not this patch set's fault, but I'm not very pleased to see that > the injection point wait events have been shoehorned into the > "Extension" category - which they are not - instead of being a new > wait_event_type. That would have avoided the ugly wait-event naming > pattern, inconsistent with everything else, introduced by > inplace050-tests-inj-v1.patch. Not sure to agree with that. The set of core backend APIs supporting injection points have nothing to do with wait events. The library attached to one or more injection points *may* decide to use a wait event like what the wait/wakeup calls in modules/injection_points do, but that's entirely optional. These rely on custom wait events, plugged into the Extension category as the code run is itself in an extension. I am not arguing against the point that it may be interesting to plug in custom wait event categories, but the current design of wait events makes that much harder than what core is currently able to handle, and I am not sure that this brings much at the end as long as the wait event strings can be customized. I've voiced upthread concerns over the naming enforced by the patch and the way it plugs the namings into the isolation functions, by the way. -- Michael
Attachment
On Thu, Jun 6, 2024 at 7:20 PM Michael Paquier <michael@paquier.xyz> wrote: > On Thu, Jun 06, 2024 at 09:48:51AM -0400, Robert Haas wrote: > > It's not this patch set's fault, but I'm not very pleased to see that > > the injection point wait events have been shoehorned into the > > "Extension" category - which they are not - instead of being a new > > wait_event_type. That would have avoided the ugly wait-event naming > > pattern, inconsistent with everything else, introduced by > > inplace050-tests-inj-v1.patch. > > Not sure to agree with that. The set of core backend APIs supporting > injection points have nothing to do with wait events. The library > attached to one or more injection points *may* decide to use a wait > event like what the wait/wakeup calls in modules/injection_points do, > but that's entirely optional. These rely on custom wait events, > plugged into the Extension category as the code run is itself in an > extension. I am not arguing against the point that it may be > interesting to plug in custom wait event categories, but the current > design of wait events makes that much harder than what core is > currently able to handle, and I am not sure that this brings much at > the end as long as the wait event strings can be customized. > > I've voiced upthread concerns over the naming enforced by the patch > and the way it plugs the namings into the isolation functions, by the > way. I think the core code should provide an "Injection Point" wait event type and let extensions add specific wait events there, just like you did for "Extension". Then this ugly naming would go away. As I see it, "Extension" is only supposed to be used as a catch-all when we have no other information, but here we do. If we refuse to use the wait_event_type field to categorize waits, then people are going to have to find some other way to get that data into the system, as Noah has done. -- Robert Haas EDB: http://www.enterprisedb.com
On Fri, Jun 07, 2024 at 09:08:03AM -0400, Robert Haas wrote: > On Thu, Jun 6, 2024 at 7:20 PM Michael Paquier <michael@paquier.xyz> wrote: > > On Thu, Jun 06, 2024 at 09:48:51AM -0400, Robert Haas wrote: > > > It's not this patch set's fault, but I'm not very pleased to see that > > > the injection point wait events have been shoehorned into the > > > "Extension" category - which they are not - instead of being a new > > > wait_event_type. That would have avoided the ugly wait-event naming > > > pattern, inconsistent with everything else, introduced by > > > inplace050-tests-inj-v1.patch. > > > > Not sure to agree with that. The set of core backend APIs supporting > > injection points have nothing to do with wait events. The library > > attached to one or more injection points *may* decide to use a wait > > event like what the wait/wakeup calls in modules/injection_points do, > > but that's entirely optional. These rely on custom wait events, > > plugged into the Extension category as the code run is itself in an > > extension. I am not arguing against the point that it may be > > interesting to plug in custom wait event categories, but the current > > design of wait events makes that much harder than what core is > > currently able to handle, and I am not sure that this brings much at > > the end as long as the wait event strings can be customized. > > > > I've voiced upthread concerns over the naming enforced by the patch > > and the way it plugs the namings into the isolation functions, by the > > way. > > I think the core code should provide an "Injection Point" wait event > type and let extensions add specific wait events there, just like you > did for "Extension". Michael, could you accept the core code offering that, or not? If so, I am content to implement that. If not, for injection point wait events, I have just one priority. The isolation tester already detects lmgr locks without the test writer teaching it about each lock individually. I want it to have that same capability for injection points. Do you think we can find something everyone can accept, having that property? These wait events show up in tests only, and I'm happy to make the cosmetics be anything compatible with that detection ability.
On Thu, Jun 06, 2024 at 09:48:51AM -0400, Robert Haas wrote: > On Wed, Jun 5, 2024 at 2:17 PM Noah Misch <noah@leadboat.com> wrote: > > Starting 2024-06-10, I plan to push the first seven of the ten patches: > > > > inplace005-UNEXPECTEDPASS-tap-meson-v1.patch > > inplace010-tests-v1.patch > > inplace040-waitfuncs-v1.patch > > inplace050-tests-inj-v1.patch > > inplace060-nodeModifyTable-comments-v1.patch > > Those five just deal in tests, test infrastructure, and comments. > > inplace070-rel-locks-missing-v1.patch > > Main risk is new DDL deadlocks. > > inplace080-catcache-detoast-inplace-stale-v1.patch > > If it fails to fix the bug it targets, I expect it's a no-op rather than > > breaking things. > > > > I'll leave the last three of the ten needing review. Those three are beyond > > my skill to self-certify. > > It's not this patch set's fault, but I'm not very pleased to see that > the injection point wait events have been shoehorned into the > "Extension" category I've replied on that branch of the thread. > I think that the comments and commit messages in this patch set could, > in some places, use improvement. For instance, > inplace060-nodeModifyTable-comments-v1.patch reflows a bunch of > comments, which makes it hard to see what actually changed, and the > commit message doesn't tell you, either. A good bit of it seems to be > changing "a view" to "a view INSTEAD OF trigger" or "a view having an > INSTEAD OF trigger," but the reasoning behind that change is not > spelled out anywhere. The reader is left to guess what the other case > is and why the same principles don't apply to it. I don't doubt that > the new comments are more correct than the old ones, but I expect > future patch authors to have difficulty maintaining that state of > affairs. The two kinds are trigger-updatable views and auto-updatable views. I've added sentences about that to the nodeModifyTable.c header comment. One could argue for dropping the INSTEAD OF comment changes outside of the header. > Similarly, inplace070-rel-locks-missing-v1.patch adds no comments. > IMHO, the commit message also isn't very informative. It disclaims > knowledge of what bug it's fixing, while at the same time leaving the > reader to figure out for themselves how the behavior has changed. > Consequently, I expect writing the release notes for a release > including this patch to be difficult: "We added some locks that block > ... something ... in some circumstances ... to prevent ... something." > It's not really the job of the release note author to fill in those > blanks, but rather of the patch author or committer. I don't want to I had been thinking release notes should just say "Add missing DDL lock acquisitions". One can cure a breach of our locking standards without proving some specific bad outcome. However, one could counter that commands like GRANT follow a different standard, and perhaps SetRelationHasSubclass() should use the GRANT standard. Hence, I researched the bugs this fixes and split inplace070-rel-locks-missing into three patches: 1. [inplace065-lock-SequenceChangePersistence] Lock in SequenceChangePersistence(), where the omission can lose nextval() increments of the sequence. 2. [inplace071-lock-SetRelationHasSubclass] Lock in SetRelationHasSubclass(). This one has only minor benefits; see the new commit message. A fair alternative would be tuple-level locking in inplace120-locktag, like that patch adds to GRANT. That might avoid some deadlocks. I feel like the minor benefits justify the way I chose, but it's a weak preference. 3. [inplace075-lock-heap_create] Add to heap creation: > overburden the act of fixing bugs, but I just feel like more > explanation is needed here. When I see for example that we're adding a > lock acquisition to the end of heap_create(), I can't help but wonder > if it's really true that we don't take a lock on a just-created > relation today. I'm certainly under the impression that we lock > newly-created, uncommitted relations, and a quick test seems to > confirm that. I don't quite know whether that happens, but evidently > this call is guarding against something more subtle than a categorical > failure to lock a relation on creation so I think there should be a > comment explaining what that thing is. I've covered that in the new log message. To lock as early as possible, I've moved this up a layer, to just after relid assignment. One could argue this change belongs in inplace120 rather than its own patch, since it's only here to eliminate a harmless exception to the rule inplace120 asserts. I've removed the update_relispartition() that appeared in inplace070-rel-locks-missing-v1.patch. Only an older, unpublished draft of the rules (that inplace110-successors adds to README.tuplock) required that lock. The lock might be worthwhile for avoiding "tuple concurrently updated", but it's out of scope for $SUBJECT. > It's also quite surprising that SetRelationHasSubclass() says "take X > lock before calling" and 2 of 4 callers just don't. I guess that's how > it is. But shouldn't we then have an assertion inside that function to > guard against future mistakes? If the reason why we failed to add this Works for me. Done. I've moved the LockHeldByMe() change from inplace110-successors to this patch, since the assertion wants it. > initially is discernible from the commit messages that introduced the > bug, it would be nice to mention what it seems to have been; if not, > it would at least be nice to mention the offending commit(s). I'm also Done. > a bit worried that this is going to cause deadlocks, but I suppose if > it does, that's still better than the status quo. > > IsInplaceUpdateOid's header comment says IsInplaceUpdateRelation > instead of IsInplaceUpdateOid. Fixed. > inplace080-catcache-detoast-inplace-stale-v1.patch seems like another > place where spelling out the rationale in more detail would be helpful > to future readers; for instance, the commit message says that > PgDatabaseToastTable is the only one affected, but it doesn't say why > the others are not, or why this one is. The lengthy comment in I've updated the commit message to answer that. > CatalogCacheCreateEntry is also difficult to correlate with the code > which follows. I can't guess whether the two cases called out in the > comment always needed to be handled and were handled save only for > in-place updates, and thus the comment changes were simply taking the > opportunity to elaborate on the existing comments; or whether one of > those cases is preexisting and the other arises from the desire to > handle inplace updates. It could be helpful to mention relevant > identifiers from the code in the comment text e.g. > "systable_recheck_tuple detects ordinary updates by noting changes to > the tuple's visibility information, while the equalTuple() case > detects inplace updates." The patch was elaborating on existing comments. Reading the patch again today, the elaboration no longer feels warranted. Hence, I've rewritten that comment addition. I've included identifiers, and the patch no longer adds comment material orthogonal to inplace updates. Thanks, nm
Attachment
- inplace005-UNEXPECTEDPASS-tap-meson-v2.patch
- inplace010-tests-v2.patch
- inplace040-waitfuncs-v2.patch
- inplace050-tests-inj-v2.patch
- inplace060-nodeModifyTable-comments-v2.patch
- inplace065-lock-SequenceChangePersistence-v2.patch
- inplace071-lock-SetRelationHasSubclass-v2.patch
- inplace075-lock-heap_create-v2.patch
- inplace080-catcache-detoast-inplace-stale-v2.patch
- inplace090-LOCKTAG_TUPLE-eoxact-v2.patch
- inplace110-successors-v2.patch
- inplace120-locktag-v2.patch
On Mon, Jun 10, 2024 at 07:19:27PM -0700, Noah Misch wrote: > On Fri, Jun 07, 2024 at 09:08:03AM -0400, Robert Haas wrote: >> I think the core code should provide an "Injection Point" wait event >> type and let extensions add specific wait events there, just like you >> did for "Extension". > > Michael, could you accept the core code offering that, or not? If so, I am > content to implement that. If not, for injection point wait events, I have > just one priority. The isolation tester already detects lmgr locks without > the test writer teaching it about each lock individually. I want it to have > that same capability for injection points. Do you think we can find something > everyone can accept, having that property? These wait events show up in tests > only, and I'm happy to make the cosmetics be anything compatible with that > detection ability. Adding a wait event class for injection point is an interesting suggestion that would simplify the detection in the isolation function quite a bit. Are you sure that this is something that would be fit for v17 material? TBH, I am not sure. At the end, the test coverage has the highest priority and the bugs you are addressing are complex enough that isolation tests of this level are a necessity, so I don't object to what inplace050-tests-inj-v2.patch introduces with the naming dependency for the time being on HEAD. I'll just adapt and live with that depending on what I deal with, while trying to improve HEAD later on. I'm still wondering if there is something that could be more elegant than a dedicated class for injection points, but I cannot think about something that would be better for isolation tests on top of my head. If there is something I can think of, I'll just go and implement it :) -- Michael
Attachment
Hello, everyone.
I am not sure, but I think that issue may be related to the issue described in https://www.postgresql.org/message-id/CANtu0ojXmqjmEzp-%3DaJSxjsdE76iAsRgHBoK0QtYHimb_mEfsg%40mail.gmail.com
It looks like REINDEX CONCURRENTLY could interfere with ON CONFLICT UPDATE in some strange way.
Best regards,
Mikhail.
On Tue, Jun 11, 2024 at 01:37:21PM +0900, Michael Paquier wrote: > On Mon, Jun 10, 2024 at 07:19:27PM -0700, Noah Misch wrote: > > On Fri, Jun 07, 2024 at 09:08:03AM -0400, Robert Haas wrote: > >> I think the core code should provide an "Injection Point" wait event > >> type and let extensions add specific wait events there, just like you > >> did for "Extension". > > > > Michael, could you accept the core code offering that, or not? If so, I am > > content to implement that. If not, for injection point wait events, I have > > just one priority. The isolation tester already detects lmgr locks without > > the test writer teaching it about each lock individually. I want it to have > > that same capability for injection points. Do you think we can find something > > everyone can accept, having that property? These wait events show up in tests > > only, and I'm happy to make the cosmetics be anything compatible with that > > detection ability. > > Adding a wait event class for injection point is an interesting > suggestion that would simplify the detection in the isolation function > quite a bit. Are you sure that this is something that would be fit > for v17 material? TBH, I am not sure. If I were making a list of changes always welcome post-beta, it wouldn't include adding wait event types. But I don't hesitate to add one if it unblocks a necessary test for a bug present in all versions. > At the end, the test coverage has the highest priority and the bugs > you are addressing are complex enough that isolation tests of this > level are a necessity, so I don't object to what > inplace050-tests-inj-v2.patch introduces with the naming dependency > for the time being on HEAD. I'll just adapt and live with that > depending on what I deal with, while trying to improve HEAD later on. Here's what I'm reading for each person's willingness to tolerate each option: STRATEGY | Paquier | Misch | Haas -------------------------------------------------------- new "Injection Point" wait type | maybe | yes | yes INJECTION_POINT(...) naming | yes | yes | unknown isolation spec says event names | yes | no | unknown Corrections and additional strategy lines welcome. Robert, how do you judge the lines where I've listed you as "unknown"? > I'm still wondering if there is something that could be more elegant > than a dedicated class for injection points, but I cannot think about > something that would be better for isolation tests on top of my head. > If there is something I can think of, I'll just go and implement it :) I once considered changing them to use advisory lock waits instead of ConditionVariableSleep(), but I recall that was worse from the perspective of injection points in critical sections.
On Wed, Jun 12, 2024 at 1:54 PM Noah Misch <noah@leadboat.com> wrote: > If I were making a list of changes always welcome post-beta, it wouldn't > include adding wait event types. But I don't hesitate to add one if it > unblocks a necessary test for a bug present in all versions. However, injection points themselves are not present in all versions, so even if we invent a new wait-event type, we'll have difficulty testing older versions, unless we're planning to back-patch all of that infrastructure, which I assume we aren't. Personally, I think the fact that injection point wait events were put under Extension is a design mistake that should be corrected before 17 is out of beta. > Here's what I'm reading for each person's willingness to tolerate each option: > > STRATEGY | Paquier | Misch | Haas > -------------------------------------------------------- > new "Injection Point" wait type | maybe | yes | yes > INJECTION_POINT(...) naming | yes | yes | unknown > isolation spec says event names | yes | no | unknown > > Corrections and additional strategy lines welcome. Robert, how do you judge > the lines where I've listed you as "unknown"? I'd tolerate INJECTION_POINT() if we had no other option but I think it's clearly inferior. Does the last line refer to putting the specific wait event names in the isolation spec file? If so, I'd also be fine with that. -- Robert Haas EDB: http://www.enterprisedb.com
On Wed, Jun 12, 2024 at 02:08:31PM -0400, Robert Haas wrote: > On Wed, Jun 12, 2024 at 1:54 PM Noah Misch <noah@leadboat.com> wrote: > > If I were making a list of changes always welcome post-beta, it wouldn't > > include adding wait event types. But I don't hesitate to add one if it > > unblocks a necessary test for a bug present in all versions. > > However, injection points themselves are not present in all versions, > so even if we invent a new wait-event type, we'll have difficulty > testing older versions, unless we're planning to back-patch all of > that infrastructure, which I assume we aren't. Right. We could put the injection point tests in v18 only instead of v17+v18. I feel that would be an overreaction to a dispute about names that show up only in tests. Even so, I could accept that. > Personally, I think the fact that injection point wait events were put > under Extension is a design mistake that should be corrected before 17 > is out of beta. Works for me. I don't personally have a problem with the use of Extension, since it is a src/test/modules extension creating them. > > Here's what I'm reading for each person's willingness to tolerate each option: > > > > STRATEGY | Paquier | Misch | Haas > > -------------------------------------------------------- > > new "Injection Point" wait type | maybe | yes | yes > > INJECTION_POINT(...) naming | yes | yes | unknown > > isolation spec says event names | yes | no | unknown > > > > Corrections and additional strategy lines welcome. Robert, how do you judge > > the lines where I've listed you as "unknown"? > > I'd tolerate INJECTION_POINT() if we had no other option but I think > it's clearly inferior. Does the last line refer to putting the > specific wait event names in the isolation spec file? If so, I'd also > be fine with that. Yes, the last line does refer to that. Updated table: STRATEGY | Paquier | Misch | Haas -------------------------------------------------------- new "Injection Point" wait type | maybe | yes | yes INJECTION_POINT(...) naming | yes | yes | no isolation spec says event names | yes | no | yes I find that's adequate support for the first line. If there are no objections in the next 24hr, I will implement that.
On Wed, Jun 12, 2024 at 03:02:43PM +0200, Michail Nikolaev wrote: > I am not sure, but I think that issue may be related to the issue described > in > https://www.postgresql.org/message-id/CANtu0ojXmqjmEzp-%3DaJSxjsdE76iAsRgHBoK0QtYHimb_mEfsg%40mail.gmail.com > > It looks like REINDEX CONCURRENTLY could interfere with ON CONFLICT UPDATE > in some strange way. Can you say more about the connection you see between $SUBJECT and that? That looks like a valid report of an important bug, but I'm not following the potential relationship to $SUBJECT. On your other thread, it would be useful to see stack traces from the high-CPU processes once the live lock has ended all query completion.
Hello!
> looks like a valid report of an important bug, but I'm not following the
> potential relationship to $SUBJECT.
I was guided by the following logic:
* A pg_class race condition can cause table indexes to look stale.
* REINDEX updates indexes
* errors can be explained by different backends using different arbiter indexes
* A pg_class race condition can cause table indexes to look stale.
* REINDEX updates indexes
* errors can be explained by different backends using different arbiter indexes
> On your other thread, it would be useful to see stack traces from the high-CPU
> processes once the live lock has ended all query completion.
I'll do.
Best regards,
Mikhail.
On Wed, Jun 12, 2024 at 12:32:23PM -0700, Noah Misch wrote: > On Wed, Jun 12, 2024 at 02:08:31PM -0400, Robert Haas wrote: >> Personally, I think the fact that injection point wait events were put >> under Extension is a design mistake that should be corrected before 17 >> is out of beta. Well, isolation tests and the way to wait for specific points in them is something I've thought about when working on the initial injpoint infrastructure, but all my ideas went down to the fact that this is not specific to injection points: I've also wanted to be able to cause an isolation to wait for a specific event (class,name). A hardcoded sleep is an example. Even if I discourage anything like that in the in-core tests because they're slow on fast machines and can be unreliable on slow machines, it is a fact that they are used by out-of-core code and that extension developers find them acceptable. > Works for me. I don't personally have a problem with the use of Extension, > since it is a src/test/modules extension creating them. That's the original reason why Extension has been used in this case, because the points are assigned in an extension. > Yes, the last line does refer to that. Updated table: > > STRATEGY | Paquier | Misch | Haas > -------------------------------------------------------- > new "Injection Point" wait type | maybe | yes | yes > INJECTION_POINT(...) naming | yes | yes | no > isolation spec says event names | yes | no | yes > > I find that's adequate support for the first line. If there are no objections > in the next 24hr, I will implement that. OK. That sounds like a consensus to me, useful enough for the cases at hand. -- Michael
Attachment
On Wed, Jun 12, 2024 at 10:02:00PM +0200, Michail Nikolaev wrote: > > Can you say more about the connection you see between $SUBJECT and that? > That > > looks like a valid report of an important bug, but I'm not following the > > potential relationship to $SUBJECT. > > I was guided by the following logic: > * A pg_class race condition can cause table indexes to look stale. > * REINDEX updates indexes > * errors can be explained by different backends using different arbiter > indexes Got it. The race condition of $SUBJECT involves inplace updates, and the wrong content becomes permanent. Hence, I suspect they're unrelated.
On Mon, Jun 10, 2024 at 07:45:25PM -0700, Noah Misch wrote: > On Thu, Jun 06, 2024 at 09:48:51AM -0400, Robert Haas wrote: > > It's not this patch set's fault, but I'm not very pleased to see that > > the injection point wait events have been shoehorned into the > > "Extension" category > > I've replied on that branch of the thread. I think the attached covers all comments to date. I gave everything v3, but most patches have just a no-conflict rebase vs. v2. The exceptions are inplace031-inj-wait-event (implements the holding from that branch of the thread) and inplace050-tests-inj (updated to cooperate with inplace031). Much of inplace031-inj-wait-event is essentially s/Extension/Custom/ for the infrastructure common to the two custom wait event types. Thanks, nm
Attachment
- inplace005-UNEXPECTEDPASS-tap-meson-v3.patch
- inplace010-tests-v3.patch
- inplace031-inj-wait-event-v3.patch
- inplace040-waitfuncs-v3.patch
- inplace050-tests-inj-v3.patch
- inplace060-nodeModifyTable-comments-v3.patch
- inplace065-lock-SequenceChangePersistence-v3.patch
- inplace071-lock-SetRelationHasSubclass-v3.patch
- inplace075-lock-heap_create-v3.patch
- inplace080-catcache-detoast-inplace-stale-v3.patch
- inplace090-LOCKTAG_TUPLE-eoxact-v3.patch
- inplace110-successors-v3.patch
- inplace120-locktag-v3.patch
On Thu, Jun 13, 2024 at 05:35:49PM -0700, Noah Misch wrote: > I think the attached covers all comments to date. I gave everything v3, but > most patches have just a no-conflict rebase vs. v2. The exceptions are > inplace031-inj-wait-event (implements the holding from that branch of the > thread) and inplace050-tests-inj (updated to cooperate with inplace031). Much > of inplace031-inj-wait-event is essentially s/Extension/Custom/ for the > infrastructure common to the two custom wait event types. Looking at inplace031-inj-wait-event.. The comment at the top of GetWaitEventCustomNames() requires an update, still mentioning extensions. GetWaitEventCustomIdentifier() is incorrect, and should return "InjectionPoint" in the default case of this class name, no? I would just pass the classID to GetWaitEventCustomIdentifier(). It is suboptimal to have pg_get_wait_events() do two scans of WaitEventCustomHashByName. Wouldn't it be better to do a single scan, returning a set of (class_name,event_name) fed to the tuplestore of this SRF? uint32 WaitEventExtensionNew(const char *wait_event_name) { + return WaitEventCustomNew(PG_WAIT_EXTENSION, wait_event_name); +} + +uint32 +WaitEventInjectionPointNew(const char *wait_event_name) +{ + return WaitEventCustomNew(PG_WAIT_INJECTIONPOINT, wait_event_name); +} Hmm. The advantage of two routines is that it is possible to control the class IDs allowed to use the custom wait events. Shouldn't the second routine be documented in xfunc.sgml? wait_event_names.txt also needs tweaks, in the shape of a new class name for the new class "InjectionPoint" so as it can be documented for its default case. That's a fallback if an event ID cannot be found, which should not be the case, still that's more correct than showing "Extension" for all class IDs covered by custom wait events. -- Michael
Attachment
On Fri, Jun 14, 2024 at 09:58:59AM +0900, Michael Paquier wrote: > Looking at inplace031-inj-wait-event.. > > The comment at the top of GetWaitEventCustomNames() requires an > update, still mentioning extensions. Thanks. Fixed locally. > GetWaitEventCustomIdentifier() is incorrect, and should return > "InjectionPoint" in the default case of this class name, no? I intentionally didn't provide a default event ID for InjectionPoint. PG_WAIT_EXTENSION needs a default case for backward compatibility, if nothing else. For this second custom type, it's needless complexity. The value 0x0B000000U won't just show up like PG_WAIT_EXTENSION does. GetLWLockIdentifier() also has no default case. How do you see it? > I would > just pass the classID to GetWaitEventCustomIdentifier(). As you say, that would allow eventId==0 to raise "could not find custom wait event" for PG_WAIT_INJECTIONPOINT instead of wrongly returning "Extension". Even if 0x0B000000U somehow does show up, having pg_stat_activity report "Extension" instead of an error, in a developer test run, feels unimportant to me. > It is suboptimal to have pg_get_wait_events() do two scans of > WaitEventCustomHashByName. Wouldn't it be better to do a single scan, > returning a set of (class_name,event_name) fed to the tuplestore of > this SRF? Micro-optimization of pg_get_wait_events() doesn't matter. I did consider that or pushing more of the responsibility into wait_events.c, but I considered it on code repetition grounds, not performance grounds. > uint32 > WaitEventExtensionNew(const char *wait_event_name) > { > + return WaitEventCustomNew(PG_WAIT_EXTENSION, wait_event_name); > +} > + > +uint32 > +WaitEventInjectionPointNew(const char *wait_event_name) > +{ > + return WaitEventCustomNew(PG_WAIT_INJECTIONPOINT, wait_event_name); > +} > > Hmm. The advantage of two routines is that it is possible to control > the class IDs allowed to use the custom wait events. Shouldn't the > second routine be documented in xfunc.sgml? The patch added to xfunc.sgml an example of using it. I'd be more inclined to delete the WaitEventExtensionNew() docbook documentation than to add its level of detail for WaitEventInjectionPointNew(). We don't have that kind of documentation for most extension-facing C functions.
On Thu, Jun 13, 2024 at 07:42:25PM -0700, Noah Misch wrote: > On Fri, Jun 14, 2024 at 09:58:59AM +0900, Michael Paquier wrote: >> GetWaitEventCustomIdentifier() is incorrect, and should return >> "InjectionPoint" in the default case of this class name, no? > > I intentionally didn't provide a default event ID for InjectionPoint. > PG_WAIT_EXTENSION needs a default case for backward compatibility, if nothing > else. For this second custom type, it's needless complexity. The value > 0x0B000000U won't just show up like PG_WAIT_EXTENSION does. > GetLWLockIdentifier() also has no default case. How do you see it? I would add a default for consistency as this is just a few extra lines, but if you feel strongly about that, I'm OK as well. It makes a bit easier the detection of incorrect wait event numbers set incorrectly in extensions depending on the class wanted. > The patch added to xfunc.sgml an example of using it. I'd be more inclined to > delete the WaitEventExtensionNew() docbook documentation than to add its level > of detail for WaitEventInjectionPointNew(). We don't have that kind of > documentation for most extension-facing C functions. It's one of the areas where I think that we should have more documentation, not less of it, so I'd rather keep it and maintaining it is not really a pain (?). The backend gets complicated enough these days that limiting what developers have to guess on their own is a better long-term approach because the Postgres out-of-core ecosystem is expanding a lot (aka have also in-core documentation for hooks, even if there's been a lot of reluctance historically about having them). -- Michael
Attachment
On Sun, Jun 16, 2024 at 09:28:05AM +0900, Michael Paquier wrote: > On Thu, Jun 13, 2024 at 07:42:25PM -0700, Noah Misch wrote: > > On Fri, Jun 14, 2024 at 09:58:59AM +0900, Michael Paquier wrote: > >> GetWaitEventCustomIdentifier() is incorrect, and should return > >> "InjectionPoint" in the default case of this class name, no? > > > > I intentionally didn't provide a default event ID for InjectionPoint. > > PG_WAIT_EXTENSION needs a default case for backward compatibility, if nothing > > else. For this second custom type, it's needless complexity. The value > > 0x0B000000U won't just show up like PG_WAIT_EXTENSION does. > > GetLWLockIdentifier() also has no default case. How do you see it? > > I would add a default for consistency as this is just a few extra > lines, but if you feel strongly about that, I'm OK as well. It makes > a bit easier the detection of incorrect wait event numbers set > incorrectly in extensions depending on the class wanted. It would be odd to detect exactly 0x0B000000U and not other invalid inputs, like 0x0A000001U where only 0x0B000001U is valid. I'm attaching roughly what it would take. Shall I squash this into inplace031? The thing I feel strongly about here is keeping focus on fixing $SUBJECT bugs that are actually corrupting data out there. I think we should all limit our interest in the verbiage of strings that appear only when running developer tests, especially when $SUBJECT is a bug fix. When the string appears only after C code passes invalid input to other C code, it matters even less. > > The patch added to xfunc.sgml an example of using it. I'd be more inclined to > > delete the WaitEventExtensionNew() docbook documentation than to add its level > > of detail for WaitEventInjectionPointNew(). We don't have that kind of > > documentation for most extension-facing C functions. > > It's one of the areas where I think that we should have more > documentation, not less of it, so I'd rather keep it and maintaining > it is not really a pain (?). The backend gets complicated enough > these days that limiting what developers have to guess on their own is > a better long-term approach because the Postgres out-of-core ecosystem > is expanding a lot (aka have also in-core documentation for hooks, > even if there's been a lot of reluctance historically about having > them). [getting deeply off topic -- let's move this to another thread if it needs to expand] I like reducing the need to guess. So far in this inplace update project (this thread plus postgr.es/m/20240615223718.42.nmisch@google.com), three patches just fix comments. Even comments carry quite a price, but I value them. When we hand-maintain documentation of a C function in both its header comment and another place, I get skeptical about whether hackers (including myself) will actually keep them in sync and skeptical of the incremental value of maintaining the second version.
Attachment
On Sun, Jun 16, 2024 at 07:07:08AM -0700, Noah Misch wrote: > It would be odd to detect exactly 0x0B000000U and not other invalid inputs, > like 0x0A000001U where only 0x0B000001U is valid. I'm attaching roughly what > it would take. Shall I squash this into inplace031? Agreed that merging both together is cleaner. Moving the event class into the key of WaitEventCustomEntryByInfo leads to a more consistent final result. > The thing I feel strongly about here is keeping focus on fixing $SUBJECT bugs > that are actually corrupting data out there. Agreed to focus on that first. -- Michael
Attachment
On Thu, Jun 13, 2024 at 05:35:49PM -0700, Noah Misch wrote: > On Mon, Jun 10, 2024 at 07:45:25PM -0700, Noah Misch wrote: > > On Thu, Jun 06, 2024 at 09:48:51AM -0400, Robert Haas wrote: > > > It's not this patch set's fault, but I'm not very pleased to see that > > > the injection point wait events have been shoehorned into the > > > "Extension" category > > > > I've replied on that branch of the thread. > > I think the attached covers all comments to date. I gave everything v3, but > most patches have just a no-conflict rebase vs. v2. The exceptions are > inplace031-inj-wait-event (implements the holding from that branch of the > thread) and inplace050-tests-inj (updated to cooperate with inplace031). Much > of inplace031-inj-wait-event is essentially s/Extension/Custom/ for the > infrastructure common to the two custom wait event types. Starting 2024-06-27, I'd like to push inplace080-catcache-detoast-inplace-stale and earlier patches, self-certifying them if needed. Then I'll submit the last three to the commitfest. Does anyone want me to delay that step? Two more test-related changes compared to v3: - In inplace010-tests, add to 027_stream_regress.pl a test that catalog contents match between primary and standby. If one of these patches broke replay of inplace updates, this would help catch it. - In inplace031-inj-wait-event, make sysviews.sql indifferent to whether InjectionPoint wait events exist. installcheck need this if other activity created such an event since the last postmaster restart.
Attachment
On Fri, Jun 21, 2024 at 02:28:42PM -0700, Noah Misch wrote: > On Thu, Jun 13, 2024 at 05:35:49PM -0700, Noah Misch wrote: > > I think the attached covers all comments to date. I gave everything v3, but > > most patches have just a no-conflict rebase vs. v2. The exceptions are > > inplace031-inj-wait-event (implements the holding from that branch of the > > thread) and inplace050-tests-inj (updated to cooperate with inplace031). Much > > of inplace031-inj-wait-event is essentially s/Extension/Custom/ for the > > infrastructure common to the two custom wait event types. > > Starting 2024-06-27, I'd like to push > inplace080-catcache-detoast-inplace-stale and earlier patches, self-certifying > them if needed. Then I'll submit the last three to the commitfest. Does > anyone want me to delay that step? Pushed. Buildfarm member prion is failing the new inplace-inval.spec, almost surely because prion uses -DCATCACHE_FORCE_RELEASE and inplace-inval.spec is testing an extant failure to inval a cache entry. Naturally, inexorable inval masks the extant bug. Ideally, I'd just skip the test under any kind of cache clobber option. I don't know a pleasant way to do that, so these are known-feasible things I'm considering: 1. Neutralize the test in all branches, probably by having it just not report the final answer. Undo in the later fix patch. 2. v14+ has pg_backend_memory_contexts. In the test, run some plpgsql that uses heuristics on that to deduce whether caches are getting released. Have a separate expected output for the cache-release scenario. Perhaps also have the test treat installcheck like cache-release, since installcheck could experience sinval reset with similar consequences. Neutralize the test in v12 & v13. 3. Add a test module with a C function that reports whether any kind of cache clobber is active. Call it in this test. Have a separate expected output for the cache-release scenario. Preferences or other ideas? I'm waffling between (1) and (2). I'll give it more thought over the next day. Thanks, nm
Noah Misch <noah@leadboat.com> writes: > Pushed. Buildfarm member prion is failing the new inplace-inval.spec, almost > surely because prion uses -DCATCACHE_FORCE_RELEASE and inplace-inval.spec is > testing an extant failure to inval a cache entry. Naturally, inexorable inval > masks the extant bug. Ideally, I'd just skip the test under any kind of cache > clobber option. I don't know a pleasant way to do that, so these are > known-feasible things I'm considering: > 1. Neutralize the test in all branches, probably by having it just not report > the final answer. Undo in the later fix patch. > 2. v14+ has pg_backend_memory_contexts. In the test, run some plpgsql that > uses heuristics on that to deduce whether caches are getting released. > Have a separate expected output for the cache-release scenario. Perhaps > also have the test treat installcheck like cache-release, since > installcheck could experience sinval reset with similar consequences. > Neutralize the test in v12 & v13. > 3. Add a test module with a C function that reports whether any kind of cache > clobber is active. Call it in this test. Have a separate expected output > for the cache-release scenario. > Preferences or other ideas? I'm waffling between (1) and (2). I'll give it > more thought over the next day. I'd just go for (1). We were doing fine without this test case. I can't see expending effort towards hiding its result rather than actually fixing anything. regards, tom lane
On Fri, Jun 28, 2024 at 01:17:22AM -0400, Tom Lane wrote: > Noah Misch <noah@leadboat.com> writes: > > Pushed. Buildfarm member prion is failing the new inplace-inval.spec, almost > > surely because prion uses -DCATCACHE_FORCE_RELEASE and inplace-inval.spec is > > testing an extant failure to inval a cache entry. Naturally, inexorable inval > > masks the extant bug. Ideally, I'd just skip the test under any kind of cache > > clobber option. I don't know a pleasant way to do that, so these are > > known-feasible things I'm considering: > > > 1. Neutralize the test in all branches, probably by having it just not report > > the final answer. Undo in the later fix patch. > > > 2. v14+ has pg_backend_memory_contexts. In the test, run some plpgsql that > > uses heuristics on that to deduce whether caches are getting released. > > Have a separate expected output for the cache-release scenario. Perhaps > > also have the test treat installcheck like cache-release, since > > installcheck could experience sinval reset with similar consequences. > > Neutralize the test in v12 & v13. > > > 3. Add a test module with a C function that reports whether any kind of cache > > clobber is active. Call it in this test. Have a separate expected output > > for the cache-release scenario. > > > Preferences or other ideas? I'm waffling between (1) and (2). I'll give it > > more thought over the next day. > > I'd just go for (1). We were doing fine without this test case. > I can't see expending effort towards hiding its result rather > than actually fixing anything. Good point, any effort on (2) would be wasted once the fixes get certified. I pushed (1). I'm attaching the rebased fix patches.
Attachment
Hello Noah, 29.06.2024 05:42, Noah Misch wrote: > Good point, any effort on (2) would be wasted once the fixes get certified. I > pushed (1). I'm attaching the rebased fix patches. Please look at a new anomaly, introduced by inplace110-successors-v5.patch: CREATE TABLE t (i int) PARTITION BY LIST(i); CREATE TABLE p1 (i int); ALTER TABLE t ATTACH PARTITION p1 FOR VALUES IN (1); ALTER TABLE t DETACH PARTITION p1; ANALYZE t; triggers unexpected ERROR: tuple to be updated was already modified by an operation triggered by the current command Best regards, Alexander
On Wed, Jul 03, 2024 at 06:00:00AM +0300, Alexander Lakhin wrote: > 29.06.2024 05:42, Noah Misch wrote: > > Good point, any effort on (2) would be wasted once the fixes get certified. I > > pushed (1). I'm attaching the rebased fix patches. > > Please look at a new anomaly, introduced by inplace110-successors-v5.patch: > CREATE TABLE t (i int) PARTITION BY LIST(i); > CREATE TABLE p1 (i int); > ALTER TABLE t ATTACH PARTITION p1 FOR VALUES IN (1); > ALTER TABLE t DETACH PARTITION p1; > ANALYZE t; > > triggers unexpected > ERROR: tuple to be updated was already modified by an operation triggered by the current command Thanks. Today, it's okay to issue heap_inplace_update() after heap_update() without an intervening CommandCounterIncrement(). The patch makes the CCI required. The ANALYZE in your example reaches this with a heap_update to set relhassubclass=f. I've fixed this by just adding a CCI (and adding to the tests in vacuum.sql). The alternative would be to allow inplace updates on TM_SelfModified tuples. I can't think of a specific problem with allowing that, but I feel that would make system state interactions harder to reason about. It might be optimal to allow that in back branches only, to reduce the chance of releasing a bug like the one you found.
Attachment
On Wed, Jul 03, 2024 at 04:09:54PM -0700, Noah Misch wrote: > On Wed, Jul 03, 2024 at 06:00:00AM +0300, Alexander Lakhin wrote: > > 29.06.2024 05:42, Noah Misch wrote: > > > Good point, any effort on (2) would be wasted once the fixes get certified. I > > > pushed (1). I'm attaching the rebased fix patches. > > > > Please look at a new anomaly, introduced by inplace110-successors-v5.patch: > > CREATE TABLE t (i int) PARTITION BY LIST(i); > > CREATE TABLE p1 (i int); > > ALTER TABLE t ATTACH PARTITION p1 FOR VALUES IN (1); > > ALTER TABLE t DETACH PARTITION p1; > > ANALYZE t; > > > > triggers unexpected > > ERROR: tuple to be updated was already modified by an operation triggered by the current command > > Thanks. Today, it's okay to issue heap_inplace_update() after heap_update() > without an intervening CommandCounterIncrement(). Correction: it's not okay today. If code does that, heap_inplace_update() mutates a tuple that is going to become invisible at CCI. The lack of CCI yields a minor live bug in v14+. Its consequences seem to be limited to failing to update reltuples for a partitioned table having zero partitions. > The patch makes the CCI > required. The ANALYZE in your example reaches this with a heap_update to set > relhassubclass=f. I've fixed this by just adding a CCI (and adding to the > tests in vacuum.sql). That's still the right fix, but I've separated it into its own patch and expanded the test. All the non-comment changes between v5 and v6 are now part of the separate patch. > The alternative would be to allow inplace updates on TM_SelfModified tuples. > I can't think of a specific problem with allowing that, but I feel that would > make system state interactions harder to reason about. It might be optimal to > allow that in back branches only, to reduce the chance of releasing a bug like > the one you found. Allowing a mutation of a TM_SelfModified tuple is bad, since that tuple is going to become dead soon. Mutating its successor could be okay. Since we'd expect such code to be unreachable, I'm not keen carry such code. For that scenario, I'd rather keep the error you encountered. Other opinions?
Attachment
Hello Noah, 28.06.2024 08:13, Noah Misch wrote: > Pushed. ... Please look also at another anomaly, I've discovered. An Assert added with d5f788b41 may be falsified with: CREATE TABLE t(a int PRIMARY KEY); INSERT INTO t VALUES (1); CREATE VIEW v AS SELECT * FROM t; MERGE INTO v USING (VALUES (1)) AS va(a) ON v.a = va.a WHEN MATCHED THEN DO NOTHING WHEN NOT MATCHED THEN DO NOTHING; TRAP: failed Assert("resultRelInfo->ri_TrigDesc"), File: "nodeModifyTable.c", Line: 2891, PID: 1590670 Best regards, Alexander
On Thu, Jul 04, 2024 at 08:00:00AM +0300, Alexander Lakhin wrote: > 28.06.2024 08:13, Noah Misch wrote: > > Pushed. ... > > Please look also at another anomaly, I've discovered. > > An Assert added with d5f788b41 may be falsified with: > CREATE TABLE t(a int PRIMARY KEY); > INSERT INTO t VALUES (1); > CREATE VIEW v AS SELECT * FROM t; > > MERGE INTO v USING (VALUES (1)) AS va(a) ON v.a = va.a > WHEN MATCHED THEN DO NOTHING > WHEN NOT MATCHED THEN DO NOTHING; > > TRAP: failed Assert("resultRelInfo->ri_TrigDesc"), File: "nodeModifyTable.c", Line: 2891, PID: 1590670 Thanks. When all the MERGE actions are DO NOTHING, view_has_instead_trigger() returns true, so we use the wholerow code regardless of the view's triggers or auto update capability. The behavior is fine, so I'm fixing the new assertion and comments with new patch inplace087-merge-DO-NOTHING-v8.patch. The closest relevant tests processed zero rows, so they narrowly avoided witnessing this assertion.
Attachment
On Thu, Jul 04, 2024 at 03:08:16PM -0700, Noah Misch wrote: > On Thu, Jul 04, 2024 at 08:00:00AM +0300, Alexander Lakhin wrote: > > 28.06.2024 08:13, Noah Misch wrote: > > > Pushed. ... > > > > Please look also at another anomaly, I've discovered. > > > > An Assert added with d5f788b41 may be falsified with: > > CREATE TABLE t(a int PRIMARY KEY); > > INSERT INTO t VALUES (1); > > CREATE VIEW v AS SELECT * FROM t; > > > > MERGE INTO v USING (VALUES (1)) AS va(a) ON v.a = va.a > > WHEN MATCHED THEN DO NOTHING > > WHEN NOT MATCHED THEN DO NOTHING; > > > > TRAP: failed Assert("resultRelInfo->ri_TrigDesc"), File: "nodeModifyTable.c", Line: 2891, PID: 1590670 > > Thanks. When all the MERGE actions are DO NOTHING, view_has_instead_trigger() > returns true I've pushed the two patches for your reports. To placate cfbot, I'm attaching the remaining patches.
Attachment
Hello Noah, 28.06.2024 08:13, Noah Misch wrote: > Pushed. A recent buildfarm test failure [1] showed that the intra-grant-inplace-db.spec test added with 0844b3968 may fail on a slow machine (per my understanding): test intra-grant-inplace-db ... FAILED 4302 ms @@ -21,8 +21,7 @@ WHERE datname = current_catalog AND age(datfrozenxid) > (SELECT min(age(x)) FROM frozen_witness); -?column? ----------------------- -datfrozenxid retreated -(1 row) +?column? +-------- +(0 rows) whilst the previous (successful) run shows much shorter duration: test intra-grant-inplace-db ... ok 540 ms I reproduced this failure on a VM slowed down so that the test duration reached 4+ seconds, with 100 test: intra-grant-inplace-db in isolation_schedule: test intra-grant-inplace-db ... ok 4324 ms test intra-grant-inplace-db ... FAILED 4633 ms test intra-grant-inplace-db ... ok 4649 ms But as the test going to be modified by the inplace110-successors-v8.patch and the modified test (with all three latest patches applied) passes reliably in the same conditions, maybe this failure doesn't deserve a deeper exploration. What do you think? [1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=habu&dt=2024-07-18%2003%3A08%3A08 Best regards, Alexander
On Sat, Jul 20, 2024 at 11:00:00AM +0300, Alexander Lakhin wrote: > 28.06.2024 08:13, Noah Misch wrote: > > Pushed. > > A recent buildfarm test failure [1] showed that the > intra-grant-inplace-db.spec test added with 0844b3968 may fail > on a slow machine > But as the test going to be modified by the inplace110-successors-v8.patch > and the modified test (with all three latest patches applied) passes > reliably in the same conditions, maybe this failure doesn't deserve a > deeper exploration. Agreed. Let's just wait for code review of the actual bug fix, not develop a separate change to stabilize the test. One flake in three weeks is low enough to make that okay. > [1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=habu&dt=2024-07-18%2003%3A08%3A08
Noah Misch <noah@leadboat.com> writes: > On Sat, Jul 20, 2024 at 11:00:00AM +0300, Alexander Lakhin wrote: >> A recent buildfarm test failure [1] showed that the >> intra-grant-inplace-db.spec test added with 0844b3968 may fail >> on a slow machine >> But as the test going to be modified by the inplace110-successors-v8.patch >> and the modified test (with all three latest patches applied) passes >> reliably in the same conditions, maybe this failure doesn't deserve a >> deeper exploration. > Agreed. Let's just wait for code review of the actual bug fix, not develop a > separate change to stabilize the test. One flake in three weeks is low enough > to make that okay. It's now up to three similar failures in the past ten days: in addition to https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=habu&dt=2024-07-18%2003%3A08%3A08 I see https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=urutu&dt=2024-07-22%2018%3A00%3A46 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=taipan&dt=2024-07-28%2012%3A20%3A37 Is it time to worry yet? If this were HEAD only, I'd not be too concerned; but two of these three are on allegedly-stable branches. And we have releases coming up fast. (BTW, I don't think taipan qualifies as a slow machine.) regards, tom lane
On Sun, Jul 28, 2024 at 11:50:33AM -0400, Tom Lane wrote: > Noah Misch <noah@leadboat.com> writes: > > On Sat, Jul 20, 2024 at 11:00:00AM +0300, Alexander Lakhin wrote: > >> A recent buildfarm test failure [1] showed that the > >> intra-grant-inplace-db.spec test added with 0844b3968 may fail > > >> But as the test going to be modified by the inplace110-successors-v8.patch > >> and the modified test (with all three latest patches applied) passes > >> reliably in the same conditions, maybe this failure doesn't deserve a > >> deeper exploration. > > > Agreed. Let's just wait for code review of the actual bug fix, not develop a > > separate change to stabilize the test. One flake in three weeks is low enough > > to make that okay. > > It's now up to three similar failures in the past ten days > Is it time to worry yet? If this were HEAD only, I'd not be too > concerned; but two of these three are on allegedly-stable branches. > And we have releases coming up fast. I don't know; neither decision feels terrible to me. A bug fix that would address both the data corruption causes and those buildfarm failures has been awaiting review on this thread for 77 days. The data corruption causes are more problematic than 0.03% of buildfarm runs getting noise failures. Two wrongs don't make a right, but a commit masking that level of buildfarm noise also feels like sending the wrong message.
Noah Misch <noah@leadboat.com> writes: > On Sun, Jul 28, 2024 at 11:50:33AM -0400, Tom Lane wrote: >> Is it time to worry yet? If this were HEAD only, I'd not be too >> concerned; but two of these three are on allegedly-stable branches. >> And we have releases coming up fast. > I don't know; neither decision feels terrible to me. Yeah, same here. Obviously, it'd be better to spend effort on getting the bug fix committed than to spend effort on some cosmetic workaround. The fact that the failure is in the isolation tests not the core regression tests reduces my level of concern somewhat about shipping it this way. I think that packagers typically run the core tests not check-world during package verification, so they won't hit this. regards, tom lane
On 14/07/2024 20:48, Noah Misch wrote: > I've pushed the two patches for your reports. To placate cfbot, I'm attaching > the remaining patches. inplace090-LOCKTAG_TUPLE-eoxact-v8.patch: Makes sense. A comment would be in order, it looks pretty random as it is. Something like: /* * Tuple locks are currently only held for short durations within a * transaction. Check that we didn't forget to release one. */ inplace110-successors-v8.patch: Makes sense. The README changes would be better as part of the third patch, as this patch doesn't actually do any of the new locking described in the README, and it fixes the "inplace update updates wrong tuple" bug even without those tuple locks. > + * ... [any slow preparation not requiring oldtup] ... > + * heap_inplace_update_scan([...], &tup, &inplace_state); > + * if (!HeapTupleIsValid(tup)) > + * elog(ERROR, [...]); > + * ... [buffer is exclusive-locked; mutate "tup"] ... > + * if (dirty) > + * heap_inplace_update_finish(inplace_state, tup); > + * else > + * heap_inplace_update_cancel(inplace_state); I wonder if the functions should be called "systable_*" and placed in genam.c rather than in heapam.c. The interface looks more like the existing systable functions. It feels like a modularity violation for a function in heapam.c to take an argument like "indexId", and call back into systable_* functions. > /*---------- > * XXX A crash here can allow datfrozenxid() to get ahead of relfrozenxid: > * > * ["D" is a VACUUM (ONLY_DATABASE_STATS)] > * ["R" is a VACUUM tbl] > * D: vac_update_datfrozenid() -> systable_beginscan(pg_class) > * c * D: systable_getnext() returns pg_class tuple of tbl > * R: memcpy() into pg_class tuple of tbl > * D: raise pg_database.datfrozenxid, XLogInsert(), finish > * [crash] > * [recovery restores datfrozenxid w/o relfrozenxid] > */ Hmm, that's a tight race, but feels bad to leave it unfixed. One approach would be to modify the tuple on the buffer only after WAL-logging it. That way, D cannot read the updated value before it has been WAL logged. Just need to make sure that the change still gets included in the WAL record. Maybe something like: if (RelationNeedsWAL(relation)) { /* * Make a temporary copy of the page that includes the change, in * case the a full-page image is logged */ PGAlignedBlock tmppage; memcpy(tmppage.data, page, BLCKSZ); /* copy the tuple to the temporary copy */ memcpy(...); XLogRegisterBlock(0, ..., tmppage, REGBUF_STANDARD); XLogInsert(); } /* copy the tuple to the buffer */ memcpy(...); > pg_class heap_inplace_update_scan() callers: before the call, acquire > LOCKTAG_RELATION in mode ShareLock (CREATE INDEX), ShareUpdateExclusiveLock > (VACUUM), or a mode with strictly more conflicts. If the update targets a > row of RELKIND_INDEX (but not RELKIND_PARTITIONED_INDEX), that lock must be > on the table. Locking the index rel is optional. (This allows VACUUM to > overwrite per-index pg_class while holding a lock on the table alone.) We > could allow weaker locks, in which case the next paragraph would simply call > for stronger locks for its class of commands. heap_inplace_update_scan() > acquires and releases LOCKTAG_TUPLE in InplaceUpdateTupleLock, an alias for > ExclusiveLock, on each tuple it overwrites. > > pg_class heap_update() callers: before copying the tuple to modify, take a > lock that conflicts with at least one of those from the preceding paragraph. > SearchSysCacheLocked1() is one convenient way to acquire LOCKTAG_TUPLE. > After heap_update(), release any LOCKTAG_TUPLE. Most of these callers opt > to acquire just the LOCKTAG_RELATION. These rules seem complicated. Phrasing this slightly differently, if I understand correctly: for a heap_update() caller, it's always sufficient to hold LOCKTAG_TUPLE, but if you happen to hold some other lock on the relation that conflicts with those mentioned in the first paragraph, then you can skip the LOCKTAG_TUPLE lock. Could we just stipulate that you must always hold LOCKTAG_TUPLE when you call heap_update() on pg_class or pg_database? That'd make the rule simple. -- Heikki Linnakangas Neon (https://neon.tech)
Thanks for reviewing. On Fri, Aug 16, 2024 at 12:26:28PM +0300, Heikki Linnakangas wrote: > On 14/07/2024 20:48, Noah Misch wrote: > > I've pushed the two patches for your reports. To placate cfbot, I'm attaching > > the remaining patches. > > inplace090-LOCKTAG_TUPLE-eoxact-v8.patch: Makes sense. A comment would be in > order, it looks pretty random as it is. Something like: > > /* > * Tuple locks are currently only held for short durations within a > * transaction. Check that we didn't forget to release one. > */ Will add. > inplace110-successors-v8.patch: Makes sense. > > The README changes would be better as part of the third patch, as this patch > doesn't actually do any of the new locking described in the README, and it > fixes the "inplace update updates wrong tuple" bug even without those tuple > locks. That should work. Will confirm. > > + * ... [any slow preparation not requiring oldtup] ... > > + * heap_inplace_update_scan([...], &tup, &inplace_state); > > + * if (!HeapTupleIsValid(tup)) > > + * elog(ERROR, [...]); > > + * ... [buffer is exclusive-locked; mutate "tup"] ... > > + * if (dirty) > > + * heap_inplace_update_finish(inplace_state, tup); > > + * else > > + * heap_inplace_update_cancel(inplace_state); > > I wonder if the functions should be called "systable_*" and placed in > genam.c rather than in heapam.c. The interface looks more like the existing > systable functions. It feels like a modularity violation for a function in > heapam.c to take an argument like "indexId", and call back into systable_* > functions. Yes, _scan() and _cancel() especially are wrappers around systable. Some API options follow. Any preference or other ideas? ==== direct s/heap_/systable_/ rename systable_inplace_update_scan([...], &tup, &inplace_state); if (!HeapTupleIsValid(tup)) elog(ERROR, [...]); ... [buffer is exclusive-locked; mutate "tup"] ... if (dirty) systable_inplace_update_finish(inplace_state, tup); else systable_inplace_update_cancel(inplace_state); ==== make the first and last steps more systable-like systable_inplace_update_begin([...], &tup, &inplace_state); if (!HeapTupleIsValid(tup)) elog(ERROR, [...]); ... [buffer is exclusive-locked; mutate "tup"] ... if (dirty) systable_inplace_update(inplace_state, tup); systable_inplace_update_end(inplace_state); ==== no systable_ wrapper for middle step, more like CatalogTupleUpdate systable_inplace_update_begin([...], &tup, &inplace_state); if (!HeapTupleIsValid(tup)) elog(ERROR, [...]); ... [buffer is exclusive-locked; mutate "tup"] ... if (dirty) heap_inplace_update(relation, systable_inplace_old_tuple(inplace_state), tup, systable_inplace_buffer(inplace_state)); systable_inplace_update_end(inplace_state); > > /*---------- > > * XXX A crash here can allow datfrozenxid() to get ahead of relfrozenxid: > > * > > * ["D" is a VACUUM (ONLY_DATABASE_STATS)] > > * ["R" is a VACUUM tbl] > > * D: vac_update_datfrozenid() -> systable_beginscan(pg_class) > > * c * D: systable_getnext() returns pg_class tuple of tbl > > * R: memcpy() into pg_class tuple of tbl > > * D: raise pg_database.datfrozenxid, XLogInsert(), finish > > * [crash] > > * [recovery restores datfrozenxid w/o relfrozenxid] > > */ > > Hmm, that's a tight race, but feels bad to leave it unfixed. One approach > would be to modify the tuple on the buffer only after WAL-logging it. That > way, D cannot read the updated value before it has been WAL logged. Just > need to make sure that the change still gets included in the WAL record. > Maybe something like: > > if (RelationNeedsWAL(relation)) > { > /* > * Make a temporary copy of the page that includes the change, in > * case the a full-page image is logged > */ > PGAlignedBlock tmppage; > > memcpy(tmppage.data, page, BLCKSZ); > > /* copy the tuple to the temporary copy */ > memcpy(...); > > XLogRegisterBlock(0, ..., tmppage, REGBUF_STANDARD); > XLogInsert(); > } > > /* copy the tuple to the buffer */ > memcpy(...); Yes, that's the essence of inplace180-datfrozenxid-overtakes-relfrozenxid-v1.patch from https://postgr.es/m/flat/20240620012908.92.nmisch@google.com. > > pg_class heap_inplace_update_scan() callers: before the call, acquire > > LOCKTAG_RELATION in mode ShareLock (CREATE INDEX), ShareUpdateExclusiveLock > > (VACUUM), or a mode with strictly more conflicts. If the update targets a > > row of RELKIND_INDEX (but not RELKIND_PARTITIONED_INDEX), that lock must be > > on the table. Locking the index rel is optional. (This allows VACUUM to > > overwrite per-index pg_class while holding a lock on the table alone.) We > > could allow weaker locks, in which case the next paragraph would simply call > > for stronger locks for its class of commands. heap_inplace_update_scan() > > acquires and releases LOCKTAG_TUPLE in InplaceUpdateTupleLock, an alias for > > ExclusiveLock, on each tuple it overwrites. > > > > pg_class heap_update() callers: before copying the tuple to modify, take a > > lock that conflicts with at least one of those from the preceding paragraph. > > SearchSysCacheLocked1() is one convenient way to acquire LOCKTAG_TUPLE. > > After heap_update(), release any LOCKTAG_TUPLE. Most of these callers opt > > to acquire just the LOCKTAG_RELATION. > > These rules seem complicated. Phrasing this slightly differently, if I > understand correctly: for a heap_update() caller, it's always sufficient to > hold LOCKTAG_TUPLE, but if you happen to hold some other lock on the > relation that conflicts with those mentioned in the first paragraph, then > you can skip the LOCKTAG_TUPLE lock. Yes. > Could we just stipulate that you must always hold LOCKTAG_TUPLE when you > call heap_update() on pg_class or pg_database? That'd make the rule simple. We could. That would change more code sites. Rough estimate: $ git grep -E CatalogTupleUpd'.*(class|relrelation|relationRelation)' | wc -l 23 If the count were 2, I'd say let's simplify the rule like you're exploring. (I originally had a complicated rule for pg_database, but I abandoned that when it helped few code sites.) If it were 100, I'd say the complicated rule is worth it. A count of 23 makes both choices fair. Long-term, I hope relfrozenxid gets reimplemented with storage outside pg_class, removing the need for inplace updates. So the additional 23 code sites might change back at a future date. That shouldn't be a big consideration, though. Another option here would be to preface that README section with a simplified view, something like, "If a warning brought you here, take a tuple lock. The rest of this section is just for people needing to understand the conditions for --enable-casserts emitting that warning." How about that instead of simplifying the rules?
On 17/08/2024 07:07, Noah Misch wrote: > On Fri, Aug 16, 2024 at 12:26:28PM +0300, Heikki Linnakangas wrote: >> On 14/07/2024 20:48, Noah Misch wrote: >>> + * ... [any slow preparation not requiring oldtup] ... >>> + * heap_inplace_update_scan([...], &tup, &inplace_state); >>> + * if (!HeapTupleIsValid(tup)) >>> + * elog(ERROR, [...]); >>> + * ... [buffer is exclusive-locked; mutate "tup"] ... >>> + * if (dirty) >>> + * heap_inplace_update_finish(inplace_state, tup); >>> + * else >>> + * heap_inplace_update_cancel(inplace_state); >> >> I wonder if the functions should be called "systable_*" and placed in >> genam.c rather than in heapam.c. The interface looks more like the existing >> systable functions. It feels like a modularity violation for a function in >> heapam.c to take an argument like "indexId", and call back into systable_* >> functions. > > Yes, _scan() and _cancel() especially are wrappers around systable. Some API > options follow. Any preference or other ideas? > > ==== direct s/heap_/systable_/ rename [option 1] > > systable_inplace_update_scan([...], &tup, &inplace_state); > if (!HeapTupleIsValid(tup)) > elog(ERROR, [...]); > ... [buffer is exclusive-locked; mutate "tup"] ... > if (dirty) > systable_inplace_update_finish(inplace_state, tup); > else > systable_inplace_update_cancel(inplace_state); > > ==== make the first and last steps more systable-like [option 2] > > systable_inplace_update_begin([...], &tup, &inplace_state); > if (!HeapTupleIsValid(tup)) > elog(ERROR, [...]); > ... [buffer is exclusive-locked; mutate "tup"] ... > if (dirty) > systable_inplace_update(inplace_state, tup); > systable_inplace_update_end(inplace_state); > > ==== no systable_ wrapper for middle step, more like CatalogTupleUpdate [option 3] > > systable_inplace_update_begin([...], &tup, &inplace_state); > if (!HeapTupleIsValid(tup)) > elog(ERROR, [...]); > ... [buffer is exclusive-locked; mutate "tup"] ... > if (dirty) > heap_inplace_update(relation, > systable_inplace_old_tuple(inplace_state), > tup, > systable_inplace_buffer(inplace_state)); > systable_inplace_update_end(inplace_state); My order of preference is: 2, 1, 3. >> Could we just stipulate that you must always hold LOCKTAG_TUPLE when you >> call heap_update() on pg_class or pg_database? That'd make the rule simple. > > We could. That would change more code sites. Rough estimate: > > $ git grep -E CatalogTupleUpd'.*(class|relrelation|relationRelation)' | wc -l > 23 > > If the count were 2, I'd say let's simplify the rule like you're exploring. > (I originally had a complicated rule for pg_database, but I abandoned that > when it helped few code sites.) If it were 100, I'd say the complicated rule > is worth it. A count of 23 makes both choices fair. Ok. How many of those for RELKIND_INDEX vs tables? I'm thinking if we should always require a tuple lock on indexes, if that would make a difference. > Long-term, I hope relfrozenxid gets reimplemented with storage outside > pg_class, removing the need for inplace updates. So the additional 23 code > sites might change back at a future date. That shouldn't be a big > consideration, though. > > Another option here would be to preface that README section with a simplified > view, something like, "If a warning brought you here, take a tuple lock. The > rest of this section is just for people needing to understand the conditions > for --enable-casserts emitting that warning." How about that instead of > simplifying the rules? Works for me. Or perhaps the rules could just be explained more succinctly. Something like: ----- pg_class heap_inplace_update_scan() callers: before the call, acquire a lock on the relation in mode ShareUpdateExclusiveLock or stricter. If the update targets a row of RELKIND_INDEX (but not RELKIND_PARTITIONED_INDEX), that lock must be on the table, locking the index rel is not necessary. (This allows VACUUM to overwrite per-index pg_class while holding a lock on the table alone.) heap_inplace_update_scan() acquires and releases LOCKTAG_TUPLE in InplaceUpdateTupleLock, an alias for ExclusiveLock, on each tuple it overwrites. pg_class heap_update() callers: before copying the tuple to modify, take a lock on the tuple, or a ShareUpdateExclusiveLock or stricter on the relation. SearchSysCacheLocked1() is one convenient way to acquire the tuple lock. Most heap_update() callers already hold a suitable lock on the relation for other reasons, and can skip the tuple lock. If you do acquire the tuple lock, release it immediately after the update. pg_database: before copying the tuple to modify, all updaters of pg_database rows acquire LOCKTAG_TUPLE. (Few updaters acquire LOCKTAG_OBJECT on the database OID, so it wasn't worth extending that as a second option.) ----- -- Heikki Linnakangas Neon (https://neon.tech)
On Thu, Aug 29, 2024 at 8:11 PM Noah Misch <noah@leadboat.com> wrote: > > On Tue, Aug 20, 2024 at 11:59:45AM +0300, Heikki Linnakangas wrote: > > My order of preference is: 2, 1, 3. > > I kept tuple locking responsibility in heapam.c. That's simpler and better > for modularity, but it does mean we release+acquire after any xmax wait. > Before, we avoided that if the next genam.c scan found the same TID. (If the > next scan finds the same TID, the xmax probably aborted.) I think DDL aborts > are rare enough to justify simplifying as this version does. I don't expect > anyone to notice the starvation outside of tests built to show it. (With > previous versions, one can show it with a purpose-built test that commits > instead of aborting, like the "001_pgbench_grant@9" test.) > > This move also loses the optimization of unpinning before XactLockTableWait(). > heap_update() doesn't optimize that way, so that's fine. > > The move ended up more like (1), though I did do > s/systable_inplace_update_scan/systable_inplace_update_begin/ like in (2). I > felt that worked better than (2) to achieve lock release before > CacheInvalidateHeapTuple(). Alternatives that could be fine: > From a consistency point of view, I find it cleaner if we can have all the heap_inplace_lock and heap_inplace_unlock in the same set of functions. So here those would be the systable_inplace_* functions. > - In the cancel case, call both systable_inplace_update_cancel and > systable_inplace_update_end. _finish or _cancel would own unlock, while > _end would own systable_endscan(). > What happens to CacheInvalidateHeapTuple() in this approach? I think it will still need to be brought to the genam.c layer if we are releasing the lock in systable_inplace_update_finish. > - Hoist the CacheInvalidateHeapTuple() up to the genam.c layer. While > tolerable now, this gets less attractive after the inplace160 patch from > https://postgr.es/m/flat/20240523000548.58.nmisch@google.com > I skimmed through the inplace160 patch. It wasn't clear to me why this becomes less attractive with the patch. I see there is a new CacheInvalidateHeapTupleInPlace but that looks like it would be called while holding the lock. And then there is an AcceptInvalidationMessages which can perhaps be moved to the genam.c layer too. Is the concern that one invalidation call will be in the heapam layer and the other will be in the genam layer? Also I have a small question from inplace120. I see that all the places we check resultRelInfo->ri_needLockTagTuple, we can just call IsInplaceUpdateRelation(resultRelInfo->ri_RelationDesc). Is there a big advantage of storing a separate bool field? Also there is another write to ri_RelationDesc in CatalogOpenIndexes in src/backlog/catalog/indexing.c. I think ri_needLockTagTuple needs to be set there also to keep it consistent with ri_RelationDesc. Please let me know if I am missing something about the usage of the new field. Thanks & Regards, Nitin Motiani Google
On Thu, Aug 29, 2024 at 09:08:43PM +0530, Nitin Motiani wrote: > On Thu, Aug 29, 2024 at 8:11 PM Noah Misch <noah@leadboat.com> wrote: > > On Tue, Aug 20, 2024 at 11:59:45AM +0300, Heikki Linnakangas wrote: > > > My order of preference is: 2, 1, 3. > > > > I kept tuple locking responsibility in heapam.c. That's simpler and better > > for modularity, but it does mean we release+acquire after any xmax wait. > > Before, we avoided that if the next genam.c scan found the same TID. (If the > > next scan finds the same TID, the xmax probably aborted.) I think DDL aborts > > are rare enough to justify simplifying as this version does. I don't expect > > anyone to notice the starvation outside of tests built to show it. (With > > previous versions, one can show it with a purpose-built test that commits > > instead of aborting, like the "001_pgbench_grant@9" test.) > > > > This move also loses the optimization of unpinning before XactLockTableWait(). > > heap_update() doesn't optimize that way, so that's fine. > > > > The move ended up more like (1), though I did do > > s/systable_inplace_update_scan/systable_inplace_update_begin/ like in (2). I > > felt that worked better than (2) to achieve lock release before > > CacheInvalidateHeapTuple(). Alternatives that could be fine: > > > From a consistency point of view, I find it cleaner if we can have all > the heap_inplace_lock and heap_inplace_unlock in the same set of > functions. So here those would be the systable_inplace_* functions. That will technically be the case after inplace160, and I could make it so here by inlining heap_inplace_unlock() into its heapam.c caller. Would that be cleaner or less clean? > > - In the cancel case, call both systable_inplace_update_cancel and > > systable_inplace_update_end. _finish or _cancel would own unlock, while > > _end would own systable_endscan(). > > > What happens to CacheInvalidateHeapTuple() in this approach? I think > it will still need to be brought to the genam.c layer if we are > releasing the lock in systable_inplace_update_finish. Cancel scenarios don't do invalidation. (Same under other alternatives.) > > - Hoist the CacheInvalidateHeapTuple() up to the genam.c layer. While > > tolerable now, this gets less attractive after the inplace160 patch from > > https://postgr.es/m/flat/20240523000548.58.nmisch@google.com > > > I skimmed through the inplace160 patch. It wasn't clear to me why this > becomes less attractive with the patch. I see there is a new > CacheInvalidateHeapTupleInPlace but that looks like it would be called > while holding the lock. And then there is an > AcceptInvalidationMessages which can perhaps be moved to the genam.c > layer too. Is the concern that one invalidation call will be in the > heapam layer and the other will be in the genam layer? That, or a critical section would start in heapam.c, then end in genam.c. Current call tree at inplace160 v4: genam.c:systable_inplace_update_finish heapam.c:heap_inplace_update PreInplace_Inval START_CRIT_SECTION BUFFER_LOCK_UNLOCK AtInplace_Inval END_CRIT_SECTION UnlockTuple AcceptInvalidationMessages If we hoisted all of invalidation up to the genam.c layer, a critical section that starts in heapam.c would end in genam.c: genam.c:systable_inplace_update_finish PreInplace_Inval heapam.c:heap_inplace_update START_CRIT_SECTION BUFFER_LOCK_UNLOCK AtInplace_Inval END_CRIT_SECTION UnlockTuple AcceptInvalidationMessages If we didn't accept splitting the critical section but did accept splitting invalidation responsibilities, one gets perhaps: genam.c:systable_inplace_update_finish PreInplace_Inval heapam.c:heap_inplace_update START_CRIT_SECTION BUFFER_LOCK_UNLOCK AtInplace_Inval END_CRIT_SECTION UnlockTuple AcceptInvalidationMessages That's how I ended up at inplace120 v9's design. > Also I have a small question from inplace120. > > I see that all the places we check resultRelInfo->ri_needLockTagTuple, > we can just call > IsInplaceUpdateRelation(resultRelInfo->ri_RelationDesc). Is there a > big advantage of storing a separate bool field? Also there is another No, not a big advantage. I felt it was more in line with the typical style of src/backend/executor. > write to ri_RelationDesc in CatalogOpenIndexes in > src/backlog/catalog/indexing.c. I think ri_needLockTagTuple needs to > be set there also to keep it consistent with ri_RelationDesc. Please > let me know if I am missing something about the usage of the new > field. Can you say more about consequences you found? Only the full executor reads the field, doing so when it fetches the most recent version of a row. CatalogOpenIndexes() callers lack the full executor's practice of fetching the most recent version of a row, so they couldn't benefit reading the field. I don't think any CatalogOpenIndexes() caller passes its ResultRelInfo to the full executor, and "typedef struct ResultRelInfo *CatalogIndexState" exists in part to keep it that way. Since CatalogOpenIndexes() skips ri_TrigDesc and other fields, I would expect other malfunctions if some caller tried. Thanks, nm
On Sat, Aug 31, 2024 at 6:40 AM Noah Misch <noah@leadboat.com> wrote: > > On Thu, Aug 29, 2024 at 09:08:43PM +0530, Nitin Motiani wrote: > > On Thu, Aug 29, 2024 at 8:11 PM Noah Misch <noah@leadboat.com> wrote: > > > On Tue, Aug 20, 2024 at 11:59:45AM +0300, Heikki Linnakangas wrote: > > > > My order of preference is: 2, 1, 3. > > > > > > I kept tuple locking responsibility in heapam.c. That's simpler and better > > > for modularity, but it does mean we release+acquire after any xmax wait. > > > Before, we avoided that if the next genam.c scan found the same TID. (If the > > > next scan finds the same TID, the xmax probably aborted.) I think DDL aborts > > > are rare enough to justify simplifying as this version does. I don't expect > > > anyone to notice the starvation outside of tests built to show it. (With > > > previous versions, one can show it with a purpose-built test that commits > > > instead of aborting, like the "001_pgbench_grant@9" test.) > > > > > > This move also loses the optimization of unpinning before XactLockTableWait(). > > > heap_update() doesn't optimize that way, so that's fine. > > > > > > The move ended up more like (1), though I did do > > > s/systable_inplace_update_scan/systable_inplace_update_begin/ like in (2). I > > > felt that worked better than (2) to achieve lock release before > > > CacheInvalidateHeapTuple(). Alternatives that could be fine: > > > > > From a consistency point of view, I find it cleaner if we can have all > > the heap_inplace_lock and heap_inplace_unlock in the same set of > > functions. So here those would be the systable_inplace_* functions. > > That will technically be the case after inplace160, and I could make it so > here by inlining heap_inplace_unlock() into its heapam.c caller. Would that > be cleaner or less clean? > I am not sure. It seems more inconsistent to take the lock using heap_inplace_lock but then just unlock by calling LockBuffer. On the other hand, it doesn't seem that different from the way SearchSysCacheLocked1 and UnlockTuple are used in inplace120. If we are doing it this way, perhaps it would be good to rename heap_inplace_update to heap_inplace_update_and_unlock. > > > - In the cancel case, call both systable_inplace_update_cancel and > > > systable_inplace_update_end. _finish or _cancel would own unlock, while > > > _end would own systable_endscan(). > > > > > What happens to CacheInvalidateHeapTuple() in this approach? I think > > it will still need to be brought to the genam.c layer if we are > > releasing the lock in systable_inplace_update_finish. > > Cancel scenarios don't do invalidation. (Same under other alternatives.) > Sorry, I wasn't clear about this one. Let me rephrase. My understanding is that the code in this approach would look like below : if (dirty) systable_inplace_update_finish(inplace_state, tup); else systable_inplace_update_cancel(inplace_state); systable_inplace_update_end(inplace_state); And that in this structure, both _finish and _cancel will call heap_inplace_unlock and then _end will call systable_endscan. So even with this structure, the invalidation has to happen inside _finish after the unlock. So this also pulls the invalidation to the genam.c layer. Am I understanding this correctly? > > > - Hoist the CacheInvalidateHeapTuple() up to the genam.c layer. While > > > tolerable now, this gets less attractive after the inplace160 patch from > > > https://postgr.es/m/flat/20240523000548.58.nmisch@google.com > > > > > I skimmed through the inplace160 patch. It wasn't clear to me why this > > becomes less attractive with the patch. I see there is a new > > CacheInvalidateHeapTupleInPlace but that looks like it would be called > > while holding the lock. And then there is an > > AcceptInvalidationMessages which can perhaps be moved to the genam.c > > layer too. Is the concern that one invalidation call will be in the > > heapam layer and the other will be in the genam layer? > > That, or a critical section would start in heapam.c, then end in genam.c. > Current call tree at inplace160 v4: > > genam.c:systable_inplace_update_finish > heapam.c:heap_inplace_update > PreInplace_Inval > START_CRIT_SECTION > BUFFER_LOCK_UNLOCK > AtInplace_Inval > END_CRIT_SECTION > UnlockTuple > AcceptInvalidationMessages > > If we hoisted all of invalidation up to the genam.c layer, a critical section > that starts in heapam.c would end in genam.c: > > genam.c:systable_inplace_update_finish > PreInplace_Inval > heapam.c:heap_inplace_update > START_CRIT_SECTION > BUFFER_LOCK_UNLOCK > AtInplace_Inval > END_CRIT_SECTION > UnlockTuple > AcceptInvalidationMessages > > If we didn't accept splitting the critical section but did accept splitting > invalidation responsibilities, one gets perhaps: > > genam.c:systable_inplace_update_finish > PreInplace_Inval > heapam.c:heap_inplace_update > START_CRIT_SECTION > BUFFER_LOCK_UNLOCK > AtInplace_Inval > END_CRIT_SECTION > UnlockTuple > AcceptInvalidationMessages > How about this alternative? genam.c:systable_inplace_update_finish PreInplace_Inval START_CRIT_SECTION heapam.c:heap_inplace_update BUFFER_LOCK_UNLOCK AtInplace_Inval END_CRIT_SECTION UnlockTuple AcceptInvalidationMessages Looking at inplace160, it seems that the start of the critical section is right after PreInplace_Inval. So why not pull START_CRIT_SECTION and END_CRIT_SECTION out to the genam.c layer? Alternatively since heap_inplace_update is commented as a subroutine of systable_inplace_update_finish, should everything just be moved to the genam.c layer? Although it looks like you already considered and rejected this approach. So just pulling out the critical sections start and end is fine. Am I missing something here? If the above alternatives are not possible, it's probably fine to go ahead with the current patch with the function renamed to heap_inplace_update_and_unlock (or something similar) as mentioned earlier? > That's how I ended up at inplace120 v9's design. > > > Also I have a small question from inplace120. > > > > I see that all the places we check resultRelInfo->ri_needLockTagTuple, > > we can just call > > IsInplaceUpdateRelation(resultRelInfo->ri_RelationDesc). Is there a > > big advantage of storing a separate bool field? Also there is another > > No, not a big advantage. I felt it was more in line with the typical style of > src/backend/executor. > Thanks for the clarification. For ri_TrigDesc, I see the following comment in execMain.c : /* make a copy so as not to depend on relcache info not changing... */ resultRelInfo->ri_TrigDesc = CopyTriggerDesc(resultRelationDesc->trigdesc); So in this case I see more value in having a separate field compared to the bool field for ri_needLockTagTuple. > > write to ri_RelationDesc in CatalogOpenIndexes in > > src/backlog/catalog/indexing.c. I think ri_needLockTagTuple needs to > > be set there also to keep it consistent with ri_RelationDesc. Please > > let me know if I am missing something about the usage of the new > > field. > > Can you say more about consequences you found? > My apologies that I wasn't clear. I haven't found any consequences. I just find it a smell that there are two fields which are not independent and they go out of sync. And that's why my preference is to not have a dependent field unless there is a specific advantage. > Only the full executor reads the field, doing so when it fetches the most > recent version of a row. CatalogOpenIndexes() callers lack the full > executor's practice of fetching the most recent version of a row, so they > couldn't benefit reading the field. > > I don't think any CatalogOpenIndexes() caller passes its ResultRelInfo to the > full executor, and "typedef struct ResultRelInfo *CatalogIndexState" exists in > part to keep it that way. Since CatalogOpenIndexes() skips ri_TrigDesc and > other fields, I would expect other malfunctions if some caller tried. > Sorry, I missed the typedef. Thanks for pointing that out. I agree that the likelihood of any malfunction is low. But even for the ri_TrigDesc, CatalogOpenIndexes() still sets it to NULL. So shouldn't ri_needLockTagTuple also be set to a default value of false? My preference would be not to have a separate bool field to avoid thinking about these scenarios. Thanks & Regards Nitin Motiani Google
On Tue, Sep 03, 2024 at 09:24:52PM +0530, Nitin Motiani wrote: > On Sat, Aug 31, 2024 at 6:40 AM Noah Misch <noah@leadboat.com> wrote: > > On Thu, Aug 29, 2024 at 09:08:43PM +0530, Nitin Motiani wrote: > > > On Thu, Aug 29, 2024 at 8:11 PM Noah Misch <noah@leadboat.com> wrote: > > > > - In the cancel case, call both systable_inplace_update_cancel and > > > > systable_inplace_update_end. _finish or _cancel would own unlock, while > > > > _end would own systable_endscan(). > > > > > > > What happens to CacheInvalidateHeapTuple() in this approach? I think > > > it will still need to be brought to the genam.c layer if we are > > > releasing the lock in systable_inplace_update_finish. > understanding is that the code in this approach would look like below > : > if (dirty) > systable_inplace_update_finish(inplace_state, tup); > else > systable_inplace_update_cancel(inplace_state); > systable_inplace_update_end(inplace_state); > > And that in this structure, both _finish and _cancel will call > heap_inplace_unlock and then _end will call systable_endscan. So even > with this structure, the invalidation has to happen inside _finish > after the unlock. Right. > So this also pulls the invalidation to the genam.c > layer. Am I understanding this correctly? Compared to the v9 patch, the "call both" alternative would just move the systable_endscan() call to a new systable_inplace_update_end(). It wouldn't move anything across the genam:heapam boundary. systable_inplace_update_finish() would remain a thin wrapper around a heapam function. > > > > - Hoist the CacheInvalidateHeapTuple() up to the genam.c layer. While > > > > tolerable now, this gets less attractive after the inplace160 patch from > > > > https://postgr.es/m/flat/20240523000548.58.nmisch@google.com > > > > > > > I skimmed through the inplace160 patch. It wasn't clear to me why this > > > becomes less attractive with the patch. I see there is a new > > > CacheInvalidateHeapTupleInPlace but that looks like it would be called > > > while holding the lock. And then there is an > > > AcceptInvalidationMessages which can perhaps be moved to the genam.c > > > layer too. Is the concern that one invalidation call will be in the > > > heapam layer and the other will be in the genam layer? > > > > That, or a critical section would start in heapam.c, then end in genam.c. > > Current call tree at inplace160 v4: > How about this alternative? > > genam.c:systable_inplace_update_finish > PreInplace_Inval > START_CRIT_SECTION > heapam.c:heap_inplace_update > BUFFER_LOCK_UNLOCK > AtInplace_Inval > END_CRIT_SECTION > UnlockTuple > AcceptInvalidationMessages > Looking at inplace160, it seems that the start of the critical section > is right after PreInplace_Inval. So why not pull START_CRIT_SECTION > and END_CRIT_SECTION out to the genam.c layer? heap_inplace_update() has an elog(ERROR) that needs to happen outside any critical section. Since the condition for that elog deals with tuple header internals, it belongs at the heapam layer more than the systable layer. > Alternatively since > heap_inplace_update is commented as a subroutine of > systable_inplace_update_finish, should everything just be moved to the > genam.c layer? Although it looks like you already considered and > rejected this approach. Calling XLogInsert(RM_HEAP_ID) in genam.c would be a worse modularity violation than the one that led to the changes between v8 and v9. I think even calling CacheInvalidateHeapTuple() in genam.c would be a worse modularity violation than the one attributed to v8. Modularity would have the heap_inplace function resemble heap_update() handling of invals. > If the above alternatives are not possible, it's probably fine to go > ahead with the current patch with the function renamed to > heap_inplace_update_and_unlock (or something similar) as mentioned > earlier? I like that name. The next version will use it. > > > I see that all the places we check resultRelInfo->ri_needLockTagTuple, > > > we can just call > > > IsInplaceUpdateRelation(resultRelInfo->ri_RelationDesc). Is there a > > > big advantage of storing a separate bool field? Also there is another > > > > No, not a big advantage. I felt it was more in line with the typical style of > > src/backend/executor. > just find it a smell that there are two fields which are not > independent and they go out of sync. And that's why my preference is > to not have a dependent field unless there is a specific advantage. Got it. This check happens for every tuple of every UPDATE, so performance may be a factor. Some designs and their merits: ==== a. ri_needLockTagTuple Performance: best: check one value for nonzero Drawback: one more value lifecycle to understand Drawback: users of ResultRelInfo w/o InitResultRelInfo() could miss this ==== b. call IsInplaceUpdateRelation Performance: worst: two extern function calls, then compare against two values ==== c. make IsInplaceUpdateRelation() and IsInplaceUpdateOid() inline, and call Performance: high: compare against two values Drawback: unlike catalog.c peers Drawback: extensions that call these must recompile if these change ==== d. add IsInplaceUpdateRelationInline() and IsInplaceUpdateOidInline(), and call Performance: high: compare against two values Drawback: more symbols to understand Drawback: extensions might call these, reaching the drawback of (c) I think my preference order is (a), (c), (d), (b). How do you see it? > But even for the > ri_TrigDesc, CatalogOpenIndexes() still sets it to NULL. So shouldn't > ri_needLockTagTuple also be set to a default value of false? CatalogOpenIndexes() explicitly zero-initializes two fields and relies on makeNode() zeroing for dozens of others. Hence, omitting the initialization fits the function's local convention better than including it. (PostgreSQL has no policy or dominant practice about redundant zero-initialization.)
On Wed, Sep 4, 2024 at 2:53 AM Noah Misch <noah@leadboat.com> wrote: > > > > So this also pulls the invalidation to the genam.c > > layer. Am I understanding this correctly? > > Compared to the v9 patch, the "call both" alternative would just move the > systable_endscan() call to a new systable_inplace_update_end(). It wouldn't > move anything across the genam:heapam boundary. > systable_inplace_update_finish() would remain a thin wrapper around a heapam > function. > Thanks for the clarification. > > > > > - Hoist the CacheInvalidateHeapTuple() up to the genam.c layer. While > > > > > tolerable now, this gets less attractive after the inplace160 patch from > > > > > https://postgr.es/m/flat/20240523000548.58.nmisch@google.com > > > > > > > > > I skimmed through the inplace160 patch. It wasn't clear to me why this > > > > becomes less attractive with the patch. I see there is a new > > > > CacheInvalidateHeapTupleInPlace but that looks like it would be called > > > > while holding the lock. And then there is an > > > > AcceptInvalidationMessages which can perhaps be moved to the genam.c > > > > layer too. Is the concern that one invalidation call will be in the > > > > heapam layer and the other will be in the genam layer? > > > > > > That, or a critical section would start in heapam.c, then end in genam.c. > > > Current call tree at inplace160 v4: > > > How about this alternative? > > > > genam.c:systable_inplace_update_finish > > PreInplace_Inval > > START_CRIT_SECTION > > heapam.c:heap_inplace_update > > BUFFER_LOCK_UNLOCK > > AtInplace_Inval > > END_CRIT_SECTION > > UnlockTuple > > AcceptInvalidationMessages > > > Looking at inplace160, it seems that the start of the critical section > > is right after PreInplace_Inval. So why not pull START_CRIT_SECTION > > and END_CRIT_SECTION out to the genam.c layer? > > heap_inplace_update() has an elog(ERROR) that needs to happen outside any > critical section. Since the condition for that elog deals with tuple header > internals, it belongs at the heapam layer more than the systable layer. > Understood. How about this alternative then? The tuple length check and the elog(ERROR) gets its own function. Something like heap_inplace_update_validate or heap_inplace_update_validate_tuple_length. So in that case, it would look like this : genam.c:systable_inplace_update_finish heapam.c:heap_inplace_update_validate/heap_inplace_update_precheck PreInplace_Inval START_CRIT_SECTION heapam.c:heap_inplace_update BUFFER_LOCK_UNLOCK AtInplace_Inval END_CRIT_SECTION UnlockTuple AcceptInvalidationMessages This is starting to get complicated though so I don't have any issues with just renaming the heap_inplace_update to heap_inplace_update_and_unlock. > > Alternatively since > > heap_inplace_update is commented as a subroutine of > > systable_inplace_update_finish, should everything just be moved to the > > genam.c layer? Although it looks like you already considered and > > rejected this approach. > > Calling XLogInsert(RM_HEAP_ID) in genam.c would be a worse modularity > violation than the one that led to the changes between v8 and v9. I think > even calling CacheInvalidateHeapTuple() in genam.c would be a worse modularity > violation than the one attributed to v8. Modularity would have the > heap_inplace function resemble heap_update() handling of invals. > Understood. Thanks. > > If the above alternatives are not possible, it's probably fine to go > > ahead with the current patch with the function renamed to > > heap_inplace_update_and_unlock (or something similar) as mentioned > > earlier? > > I like that name. The next version will use it. > So either we go with this or try the above approach of having a separate function _validate/_precheck/_validate_tuple_length. I don't have a strong opinion on either of these approaches. > > > > I see that all the places we check resultRelInfo->ri_needLockTagTuple, > > > > we can just call > > > > IsInplaceUpdateRelation(resultRelInfo->ri_RelationDesc). Is there a > > > > big advantage of storing a separate bool field? Also there is another > > > > > > No, not a big advantage. I felt it was more in line with the typical style of > > > src/backend/executor. > > > just find it a smell that there are two fields which are not > > independent and they go out of sync. And that's why my preference is > > to not have a dependent field unless there is a specific advantage. > > Got it. This check happens for every tuple of every UPDATE, so performance > may be a factor. Some designs and their merits: > Thanks. If performance is a factor, it makes sense to keep it. > ==== a. ri_needLockTagTuple > Performance: best: check one value for nonzero > Drawback: one more value lifecycle to understand > Drawback: users of ResultRelInfo w/o InitResultRelInfo() could miss this > > ==== b. call IsInplaceUpdateRelation > Performance: worst: two extern function calls, then compare against two values > > ==== c. make IsInplaceUpdateRelation() and IsInplaceUpdateOid() inline, and call > Performance: high: compare against two values > Drawback: unlike catalog.c peers > Drawback: extensions that call these must recompile if these change > > ==== d. add IsInplaceUpdateRelationInline() and IsInplaceUpdateOidInline(), and call > Performance: high: compare against two values > Drawback: more symbols to understand > Drawback: extensions might call these, reaching the drawback of (c) > > I think my preference order is (a), (c), (d), (b). How do you see it? > My preference order would be the same. In general I like (c) more than (a) but recompiling extensions sounds like a major drawback so here the preference is (a). Can we do (a) along with some extra checks? To elaborate, execMain.c has a function called CheckValidateRelationRel which is called by ExecInitModifyTable in nodeModifyTable.c. Can we add the following assert assert (or just a debug assert) in this function? Assert(rel->ri_needsLockTagTuple == IsInplaceUpdateRelation(rel->relationDesc) This can safeguard against users of ResultRelInfo missing this field. An alternative might be to only do debug assertions in the functions which use the field. But it seems simpler to just do it once in the ExecInitModifyTable. > > But even for the > > ri_TrigDesc, CatalogOpenIndexes() still sets it to NULL. So shouldn't > > ri_needLockTagTuple also be set to a default value of false? > > CatalogOpenIndexes() explicitly zero-initializes two fields and relies on > makeNode() zeroing for dozens of others. Hence, omitting the initialization > fits the function's local convention better than including it. (PostgreSQL > has no policy or dominant practice about redundant zero-initialization.) Thanks. Makes sense. Thanks & Regards Nitin Motiani Google
On Thu, Sep 5, 2024 at 1:27 AM Noah Misch <noah@leadboat.com> wrote: > > On Wed, Sep 04, 2024 at 09:00:32PM +0530, Nitin Motiani wrote: > > How about this alternative then? The tuple length check > > and the elog(ERROR) gets its own function. Something like > > heap_inplace_update_validate or > > heap_inplace_update_validate_tuple_length. So in that case, it would > > look like this : > > > > genam.c:systable_inplace_update_finish > > heapam.c:heap_inplace_update_validate/heap_inplace_update_precheck > > PreInplace_Inval > > START_CRIT_SECTION > > heapam.c:heap_inplace_update > > BUFFER_LOCK_UNLOCK > > AtInplace_Inval > > END_CRIT_SECTION > > UnlockTuple > > AcceptInvalidationMessages > > > > This is starting to get complicated though so I don't have any issues > > with just renaming the heap_inplace_update to > > heap_inplace_update_and_unlock. > > Complexity aside, I don't see the _precheck design qualifying as a modularity > improvement. > > > Assert(rel->ri_needsLockTagTuple == IsInplaceUpdateRelation(rel->relationDesc) > > > > This can safeguard against users of ResultRelInfo missing this field. > > v10 does the rename and adds that assertion. This question remains open: > Looks good. A couple of minor comments : 1. In the inplace110 commit message, there are still references to heap_inplace_update. Should it be clarified that the function has been renamed? 2. Should there be a comment above the ri_needLockTag definition in execNodes.h that we are caching this value to avoid function calls to IsInPlaceUpdateRelation for every tuple? Similar to how the comment above ri_TrigFunctions mentions that it is cached lookup info. > On Thu, Aug 22, 2024 at 12:32:00AM -0700, Noah Misch wrote: > > On Tue, Aug 20, 2024 at 11:59:45AM +0300, Heikki Linnakangas wrote: > > > How many of those for RELKIND_INDEX vs tables? I'm thinking if we should > > > always require a tuple lock on indexes, if that would make a difference. > > > > Three sites. See attached inplace125 patch. Is it a net improvement? If so, > > I'll squash it into inplace120. > > If nobody has an opinion, I'll discard inplace125. I feel it's not a net > improvement, but either way is fine with me. Seems moderately simpler to me. But there is still special handling for the RELKIND_INDEX. Just that instead of doing it in systable_inplace_update_begin, we have a special case in heap_update. So overall it's only a small improvement and I'm fine either way. Thanks & Regards Nitin Motiani Google
On Thu, Sep 05, 2024 at 07:10:04PM +0530, Nitin Motiani wrote: > On Thu, Sep 5, 2024 at 1:27 AM Noah Misch <noah@leadboat.com> wrote: > > On Wed, Sep 04, 2024 at 09:00:32PM +0530, Nitin Motiani wrote: > > > Assert(rel->ri_needsLockTagTuple == IsInplaceUpdateRelation(rel->relationDesc) > > > > > > This can safeguard against users of ResultRelInfo missing this field. > > > > v10 does the rename and adds that assertion. This question remains open: > > Looks good. A couple of minor comments : > 1. In the inplace110 commit message, there are still references to > heap_inplace_update. Should it be clarified that the function has been > renamed? PGXN has only one caller of this function, so I think that wouldn't help readers enough. If someone gets a compiler error about the old name, they'll figure it out without commit log guidance. If a person doesn't get a compiler error, they didn't need to read about the fact of the rename. > 2. Should there be a comment above the ri_needLockTag definition in > execNodes.h that we are caching this value to avoid function calls to > IsInPlaceUpdateRelation for every tuple? Similar to how the comment > above ri_TrigFunctions mentions that it is cached lookup info. Current comment: /* updates do LockTuple() before oldtup read; see README.tuplock */ bool ri_needLockTagTuple; Once the comment doesn't fit in one line, pgindent rules make it take a minimum of four lines. I don't think words about avoiding function calls would add enough value to justify the vertical space, because a person starting to remove it would see where it's called. That's not to say the addition would be negligent. If someone else were writing the patch and had included that, I wouldn't be deleting the material.
On Fri, Sep 6, 2024 at 3:34 AM Noah Misch <noah@leadboat.com> wrote: > > On Thu, Sep 05, 2024 at 07:10:04PM +0530, Nitin Motiani wrote: > > On Thu, Sep 5, 2024 at 1:27 AM Noah Misch <noah@leadboat.com> wrote: > > > On Wed, Sep 04, 2024 at 09:00:32PM +0530, Nitin Motiani wrote: > > > > Assert(rel->ri_needsLockTagTuple == IsInplaceUpdateRelation(rel->relationDesc) > > > > > > > > This can safeguard against users of ResultRelInfo missing this field. > > > > > > v10 does the rename and adds that assertion. This question remains open: > > > > Looks good. A couple of minor comments : > > 1. In the inplace110 commit message, there are still references to > > heap_inplace_update. Should it be clarified that the function has been > > renamed? > > PGXN has only one caller of this function, so I think that wouldn't help > readers enough. If someone gets a compiler error about the old name, they'll > figure it out without commit log guidance. If a person doesn't get a compiler > error, they didn't need to read about the fact of the rename. > > > 2. Should there be a comment above the ri_needLockTag definition in > > execNodes.h that we are caching this value to avoid function calls to > > IsInPlaceUpdateRelation for every tuple? Similar to how the comment > > above ri_TrigFunctions mentions that it is cached lookup info. > > Current comment: > > /* updates do LockTuple() before oldtup read; see README.tuplock */ > bool ri_needLockTagTuple; > > Once the comment doesn't fit in one line, pgindent rules make it take a > minimum of four lines. I don't think words about avoiding function calls > would add enough value to justify the vertical space, because a person > starting to remove it would see where it's called. That's not to say the > addition would be negligent. If someone else were writing the patch and had > included that, I wouldn't be deleting the material. Thanks. I have no other comments.
On Fri, Sep 06, 2024 at 03:22:48PM +0530, Nitin Motiani wrote: > Thanks. I have no other comments. https://commitfest.postgresql.org/49/5090/ remains in status="Needs review". When someone moves it to status="Ready for Committer", I will commit inplace090, inplace110, and inplace120 patches. If one of you is comfortable with that, please modify the status.
On Sat, Sep 7, 2024 at 12:25 AM Noah Misch <noah@leadboat.com> wrote: > > On Fri, Sep 06, 2024 at 03:22:48PM +0530, Nitin Motiani wrote: > > Thanks. I have no other comments. > > https://commitfest.postgresql.org/49/5090/ remains in status="Needs review". > When someone moves it to status="Ready for Committer", I will commit > inplace090, inplace110, and inplace120 patches. If one of you is comfortable > with that, please modify the status. Done.
On Thu, Sep 19, 2024 at 02:33:46PM -0700, Noah Misch wrote: > On Mon, Sep 09, 2024 at 10:55:32AM +0530, Nitin Motiani wrote: > > On Sat, Sep 7, 2024 at 12:25 AM Noah Misch <noah@leadboat.com> wrote: > > > https://commitfest.postgresql.org/49/5090/ remains in status="Needs review". > > > When someone moves it to status="Ready for Committer", I will commit > > > inplace090, inplace110, and inplace120 patches. If one of you is comfortable > > > with that, please modify the status. > > > > Done. > > FYI, here are the branch-specific patches. I plan to push these after the v17 > release freeze lifts next week. Pushed, but the pushes contained at least one defect: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=akepa&dt=2024-09-24%2022%3A29%3A02 I will act on that and other buildfarm failures that show up.
On Thu, Aug 22, 2024 at 12:32:00AM -0700, Noah Misch wrote: > This move also loses the optimization of unpinning before XactLockTableWait(). > heap_update() doesn't optimize that way, so that's fine. In this other thread, I'm proposing to go back to unpinning: https://postgr.es/m/20241027214035.8a.nmisch@google.com
Hi, On 2024-05-12 16:29:23 -0700, Noah Misch wrote: > Author: Noah Misch <noah@leadboat.com> > Commit: Noah Misch <noah@leadboat.com> > > Make TAP todo_start effects the same under Meson and prove_check. > > This could have caused spurious failures only on SPARC Linux, because > today's only todo_start tests for that platform. Back-patch to v16, > where Meson support first appeared. > > Reviewed by FIXME. > > Discussion: https://postgr.es/m/FIXME > > diff --git a/src/tools/testwrap b/src/tools/testwrap > index d01e610..9a270be 100755 > --- a/src/tools/testwrap > +++ b/src/tools/testwrap > @@ -41,12 +41,22 @@ env_dict = {**os.environ, > 'TESTDATADIR': os.path.join(testdir, 'data'), > 'TESTLOGDIR': os.path.join(testdir, 'log')} > > -sp = subprocess.run(args.test_command, env=env_dict) > +sp = subprocess.Popen(args.test_command, env=env_dict, stdout=subprocess.PIPE) > +# Meson categorizes a passing TODO test point as bad > +# (https://github.com/mesonbuild/meson/issues/13183). Remove the TODO > +# directive, so Meson computes the file result like Perl does. This could > +# have the side effect of delaying stdout lines relative to stderr. That > +# doesn't affect the log file, and the TAP protocol uses stdout only. > +for line in sp.stdout: > + if line.startswith(b'ok '): > + line = line.replace(b' # TODO ', b' # testwrap-overridden-TODO ', 1) > + sys.stdout.buffer.write(line) > +returncode = sp.wait() This has the issue that it causes the testwrap output to be buffered, which makes running tests with ``meson test -v <testname>` update the output less promptly, only updating whenever the output buffer is flushed. That's not the end of the world, but it'd be nice to get the output more promptly again. It doesn't matter that much when running the tests normally, but if you run them with valgrind or such and you just want to see the first failure, because it's going to take an hour to finish all tests... The easiest fix is to just explicitly flush after each line, as in the attached. Greetings, Andres Freund
Attachment
On Tue, Mar 18, 2025 at 03:03:52PM -0400, Andres Freund wrote: > Subject: [PATCH v1] meson: Flush stdout in testwrap > > Otherwise the progress won't reliably be displayed during a test. > --- > src/tools/testwrap | 1 + > 1 file changed, 1 insertion(+) > > diff --git a/src/tools/testwrap b/src/tools/testwrap > index 8ae8fb79ba7..21105146c9d 100755 > --- a/src/tools/testwrap > +++ b/src/tools/testwrap > @@ -61,6 +61,7 @@ for line in sp.stdout: > if line.startswith(b'ok '): > line = line.replace(b' # TODO ', b' # testwrap-overridden-TODO ', 1) > sys.stdout.buffer.write(line) > + sys.stdout.flush() > returncode = sp.wait() Fine with me.
Hi, On 2025-03-18 12:17:41 -0700, Noah Misch wrote: > On Tue, Mar 18, 2025 at 03:03:52PM -0400, Andres Freund wrote: > > Subject: [PATCH v1] meson: Flush stdout in testwrap > > > > Otherwise the progress won't reliably be displayed during a test. > > --- > > src/tools/testwrap | 1 + > > 1 file changed, 1 insertion(+) > > > > diff --git a/src/tools/testwrap b/src/tools/testwrap > > index 8ae8fb79ba7..21105146c9d 100755 > > --- a/src/tools/testwrap > > +++ b/src/tools/testwrap > > @@ -61,6 +61,7 @@ for line in sp.stdout: > > if line.startswith(b'ok '): > > line = line.replace(b' # TODO ', b' # testwrap-overridden-TODO ', 1) > > sys.stdout.buffer.write(line) > > + sys.stdout.flush() > > returncode = sp.wait() > > Fine with me. Cool. Pushed. Thanks, Andres