Re: race condition in pg_class - Mailing list pgsql-hackers

From Heikki Linnakangas
Subject Re: race condition in pg_class
Date
Msg-id c2bc625f-2a79-4be0-a154-72b07c347703@iki.fi
Whole thread Raw
In response to Re: race condition in pg_class  (Noah Misch <noah@leadboat.com>)
Responses Re: race condition in pg_class
List pgsql-hackers
On 14/07/2024 20:48, Noah Misch wrote:
> I've pushed the two patches for your reports.  To placate cfbot, I'm attaching
> the remaining patches.

inplace090-LOCKTAG_TUPLE-eoxact-v8.patch: Makes sense. A comment would 
be in order, it looks pretty random as it is. Something like:

/*
  * Tuple locks are currently only held for short durations within a
  * transaction. Check that we didn't forget to release one.
  */

inplace110-successors-v8.patch: Makes sense.

The README changes would be better as part of the third patch, as this 
patch doesn't actually do any of the new locking described in the 
README, and it fixes the "inplace update updates wrong tuple" bug even 
without those tuple locks.

> + * ... [any slow preparation not requiring oldtup] ...
> + * heap_inplace_update_scan([...], &tup, &inplace_state);
> + * if (!HeapTupleIsValid(tup))
> + *    elog(ERROR, [...]);
> + * ... [buffer is exclusive-locked; mutate "tup"] ...
> + * if (dirty)
> + *    heap_inplace_update_finish(inplace_state, tup);
> + * else
> + *    heap_inplace_update_cancel(inplace_state);

I wonder if the functions should be called "systable_*" and placed in 
genam.c rather than in heapam.c. The interface looks more like the 
existing systable functions. It feels like a modularity violation for a 
function in heapam.c to take an argument like "indexId", and call back 
into systable_* functions.

>     /*----------
>      * XXX A crash here can allow datfrozenxid() to get ahead of relfrozenxid:
>      *
>      * ["D" is a VACUUM (ONLY_DATABASE_STATS)]
>      * ["R" is a VACUUM tbl]
>      * D: vac_update_datfrozenid() -> systable_beginscan(pg_class)
>     * c * D: systable_getnext() returns pg_class tuple of tbl
>      * R: memcpy() into pg_class tuple of tbl
>      * D: raise pg_database.datfrozenxid, XLogInsert(), finish
>      * [crash]
>      * [recovery restores datfrozenxid w/o relfrozenxid]
>      */

Hmm, that's a tight race, but feels bad to leave it unfixed. One 
approach would be to modify the tuple on the buffer only after 
WAL-logging it. That way, D cannot read the updated value before it has 
been WAL logged. Just need to make sure that the change still gets 
included in the WAL record. Maybe something like:

if (RelationNeedsWAL(relation))
{
     /*
      * Make a temporary copy of the page that includes the change, in
      * case the a full-page image is logged
      */
     PGAlignedBlock tmppage;

     memcpy(tmppage.data, page, BLCKSZ);

     /* copy the tuple to the temporary copy */
     memcpy(...);

     XLogRegisterBlock(0, ..., tmppage, REGBUF_STANDARD);
     XLogInsert();
}

/* copy the tuple to the buffer */
memcpy(...);


>   pg_class heap_inplace_update_scan() callers: before the call, acquire
>   LOCKTAG_RELATION in mode ShareLock (CREATE INDEX), ShareUpdateExclusiveLock
>   (VACUUM), or a mode with strictly more conflicts.  If the update targets a
>   row of RELKIND_INDEX (but not RELKIND_PARTITIONED_INDEX), that lock must be
>   on the table.  Locking the index rel is optional.  (This allows VACUUM to
>   overwrite per-index pg_class while holding a lock on the table alone.)  We
>   could allow weaker locks, in which case the next paragraph would simply call
>   for stronger locks for its class of commands.  heap_inplace_update_scan()
>   acquires and releases LOCKTAG_TUPLE in InplaceUpdateTupleLock, an alias for
>   ExclusiveLock, on each tuple it overwrites.
> 
>   pg_class heap_update() callers: before copying the tuple to modify, take a
>   lock that conflicts with at least one of those from the preceding paragraph.
>   SearchSysCacheLocked1() is one convenient way to acquire LOCKTAG_TUPLE.
>   After heap_update(), release any LOCKTAG_TUPLE.  Most of these callers opt
>   to acquire just the LOCKTAG_RELATION.

These rules seem complicated. Phrasing this slightly differently, if I 
understand correctly: for a heap_update() caller, it's always sufficient 
to hold LOCKTAG_TUPLE, but if you happen to hold some other lock on the 
relation that conflicts with those mentioned in the first paragraph, 
then you can skip the LOCKTAG_TUPLE lock.

Could we just stipulate that you must always hold LOCKTAG_TUPLE when you 
call heap_update() on pg_class or pg_database? That'd make the rule simple.

-- 
Heikki Linnakangas
Neon (https://neon.tech)




pgsql-hackers by date:

Previous
From: shveta malik
Date:
Subject: Re: Conflict detection and logging in logical replication
Next
From: Peter Eisentraut
Date:
Subject: Re: [PATCH] Add get_bytes() and set_bytes() functions