Thread: BUG: Postgres 14 + vacuum_defer_cleanup_age + FOR UPDATE + UPDATE

BUG: Postgres 14 + vacuum_defer_cleanup_age + FOR UPDATE + UPDATE

From
Michail Nikolaev
Date:
Hello, hackers.

It seems like PG 14 works incorrectly with vacuum_defer_cleanup_age
(or just not cleared rows, not sure) and SELECT FOR UPDATE + UPDATE.
I am not certain, but hot_standby_feedback probably able to cause the
same issues.

Steps to reproduce:

1) Start Postgres like this:

      docker run -it -p 5432:5432 --name pg -e
POSTGRES_PASSWORD=postgres -e LANG=C.UTF-8 -d postgres:14.6  -c
vacuum_defer_cleanup_age=1000000

2) Prepare scheme:

      CREATE TABLE something_is_wrong_here (id bigserial PRIMARY KEY,
value numeric(15,4) DEFAULT 0 NOT NULL);
      INSERT INTO something_is_wrong_here (value) (SELECT 10000 from
generate_series(0, 100));

3) Prepare file for pgbench:

      BEGIN;

      SELECT omg.*
         FROM something_is_wrong_here AS omg
         ORDER BY random()
         LIMIT 1
         FOR UPDATE
      \gset

      UPDATE something_is_wrong_here SET value = :value + 1 WHERE id = :id;

      END;

4) Run pgbench:

      pgbench -c 50 -j 2 -n -f reproduce.bench 'port= 5432
host=localhost user=postgres dbname=postgres password=postgres' -T 100
-P 1

I was able to see such a set of errors (looks scary):

ERROR:  MultiXactId 30818104 has not been created yet -- apparent wraparound
ERROR:  could not open file "base/13757/16385.1" (target block
39591744): previous segment is only 24 blocks
ERROR:  attempted to lock invisible tuple
ERROR:  could not access status of transaction 38195704
DETAIL:  Could not open file "pg_subtrans/0246": No such file or directory.


Best regards,
Michail.



Re: BUG: Postgres 14 + vacuum_defer_cleanup_age + FOR UPDATE + UPDATE

From
Michail Nikolaev
Date:
Hello, Andres.

I apologize for the direct ping, but I think your snapshot scalability
work in PG14 could be related to the issue.

The TransactionIdRetreatedBy implementation looks correct... But with
txid_current=212195 I see errors like "could not access status of
transaction 58643736"...
So, maybe vacuum_defer_cleanup_age just highlights some special case
(something with "previous" xids on the left side of zero?)....

Thanks,
Michail.



Re: BUG: Postgres 14 + vacuum_defer_cleanup_age + FOR UPDATE + UPDATE

From
Matthias van de Meent
Date:


On Thu, 5 Jan 2023 at 14:12, Michail Nikolaev <michail.nikolaev@gmail.com> wrote:
>
> Hello, hackers.
>
> It seems like PG 14 works incorrectly with vacuum_defer_cleanup_age
> (or just not cleared rows, not sure) and SELECT FOR UPDATE + UPDATE.
> I am not certain, but hot_standby_feedback probably able to cause the
> same issues.
>
> Steps to reproduce:
>
> [steps]
>
> I was able to see such a set of errors (looks scary):
>
> ERROR:  MultiXactId 30818104 has not been created yet -- apparent wraparound
> ERROR:  could not open file "base/13757/16385.1" (target block
> 39591744): previous segment is only 24 blocks

This looks quite suspicious too - it wants to access a block at 296GB of data, where only 196kB exist.

> ERROR:  attempted to lock invisible tuple
> ERROR:  could not access status of transaction 38195704
> DETAIL:  Could not open file "pg_subtrans/0246": No such file or directory.

I just saw two instances of this "attempted to lock invisible tuple" error for the 15.1 image (run on Docker in Ubuntu in WSL) with your reproducer script, so this does not seem to be specific to PG14 (.6).

And, after some vacuum and restarting the process, I got the following:

client 29 script 0 aborted in command 2 query 0: ERROR:  heap tid from index tuple (111,1) points past end of heap page line pointer array at offset 262 of block 1 in index "something_is_wrong_here_pkey"

There is indeed something wrong there; the page can't be read by pageinspect:

$ select get_raw_page('public.something_is_wrong_here', 111)::bytea;
ERROR:  invalid page in block 111 of relation base/5/16385

I don't have access to the v14 data anymore (I tried a restart, which dropped the data :-( ), but will retain my v15 instance for some time to help any debugging.

Kind regards,

Matthias van de Meent

Re: BUG: Postgres 14 + vacuum_defer_cleanup_age + FOR UPDATE + UPDATE

From
Peter Geoghegan
Date:
On Thu, Jan 5, 2023 at 1:49 PM Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:
> client 29 script 0 aborted in command 2 query 0: ERROR:  heap tid from index tuple (111,1) points past end of heap
pageline pointer array at offset 262 of block 1 in index "something_is_wrong_here_pkey"
 

This particular error message is from the hardening added to Postgres
15 in commit e7428a99. So it's not surprising that Michail didn't see
the same error on 14.

-- 
Peter Geoghegan



Re: BUG: Postgres 14 + vacuum_defer_cleanup_age + FOR UPDATE + UPDATE

From
Peter Geoghegan
Date:
On Thu, Jan 5, 2023 at 3:27 PM Peter Geoghegan <pg@bowt.ie> wrote:
> This particular error message is from the hardening added to Postgres
> 15 in commit e7428a99. So it's not surprising that Michail didn't see
> the same error on 14.

Reproduced this on HEAD locally (no docker), without any difficulty.

FWIW, I find that the "Assert(ItemIdIsNormal(lp));" at the top of
heap_lock_tuple() is the first thing that fails on my assert-enabled
build.

-- 
Peter Geoghegan



Re: BUG: Postgres 14 + vacuum_defer_cleanup_age + FOR UPDATE + UPDATE

From
Michail Nikolaev
Date:
Hello!

Thanks for checking the issue!

> FWIW, I find that the "Assert(ItemIdIsNormal(lp));" at the top of
> heap_lock_tuple() is the first thing that fails on my assert-enabled
> build.

Yes, the same for me:

         TRAP: failed Assert("ItemIdIsNormal(lp)"), File: "heapam.c",
Line: 4292, PID: 33416


> Reproduced this on HEAD locally (no docker), without any difficulty.

It is a little bit harder without docker in my case, need to adjust
connections and number of threads:

         pgbench -c 90 -j 8 -n -f reproduce.bench 'port= 5432
host=localhost user=postgres dbname=postgres password=postgres' -T
2000 -P 1


Best regards,
Michail.



Re: BUG: Postgres 14 + vacuum_defer_cleanup_age + FOR UPDATE + UPDATE

From
Michail Nikolaev
Date:
Hello.

The few things I have got so far:

1) It is not required to order by random() to reproduce the issue - it
could be done using queries like:

      BEGIN;
      SELECT omg.*
         FROM something_is_wrong_here AS omg
         ORDER BY value -- change is here
         LIMIT 1
         FOR UPDATE
      \gset


      UPDATE something_is_wrong_here SET value = :value + 1 WHERE id = :id;
      COMMIT;

But for some reason it is harder to reproduce without random in my
case (typically need to wait for about a minute with 100 connections).

2) It is not an issue at table creation time. Issue is reproducible if
vacuum_defer_cleanup_age set after table preparation.

3) To reproduce the issue, vacuum_defer_cleanup_age should flip xid
over zero (be >= txid_current()).
And it is stable.... So, for example - unable to reproduce with 733
value, but 734 gives error each time.
Just a single additional txid_current() (after data is filled) fixes a
crash... It looks like the first SELECT FOR UPDATE + UPDATE silently
poisons everything somehow.
You could use such PSQL script:

      DROP TABLE IF EXISTS something_is_wrong_here;

      CREATE TABLE something_is_wrong_here (id bigserial PRIMARY KEY,
value numeric(15,4) DEFAULT 0 NOT NULL);

      INSERT INTO something_is_wrong_here (value) (SELECT 10000 from
generate_series(0, 100));

      SELECT txid_current() \gset

      SELECT :txid_current + 1 as txid \gset

      ALTER SYSTEM SET vacuum_defer_cleanup_age to :txid;SELECT
pg_reload_conf();

I have attached some scripts if someone goes to reproduce.

Best regards,
Michail.

Attachment

Re: BUG: Postgres 14 + vacuum_defer_cleanup_age + FOR UPDATE + UPDATE

From
Andres Freund
Date:
Hi,

Thomas, CCing you because of the 64bit xid representation aspect.


On 2023-01-06 00:39:44 +0300, Michail Nikolaev wrote:
> I apologize for the direct ping, but I think your snapshot scalability
> work in PG14 could be related to the issue.

Good call!


> The TransactionIdRetreatedBy implementation looks correct... But with
> txid_current=212195 I see errors like "could not access status of
> transaction 58643736"...
> So, maybe vacuum_defer_cleanup_age just highlights some special case
> (something with "previous" xids on the left side of zero?)....

I think the bug is close to TransactionIdRetreatedBy(). Arguably in
FullXidRelativeTo(). Or a more fundamental data representation issue with
64bit xids.

To explain, here's a trace to the bottom of GetSnapshotData() leading to the
problem:

In the case I'm looking at here we end up with 720:
        oldestfxid = FullXidRelativeTo(latest_completed, oldestxid);
and xmin is 255271, both correct.

Then in TransactionIdRetreatedBy:
        /* apply vacuum_defer_cleanup_age */
        def_vis_xid_data =
            TransactionIdRetreatedBy(xmin, vacuum_defer_cleanup_age);

things start to be iffy. Because we retreat by vacuum_defer_cleanup_age, which
was set to txid_current() in scheme.sql, and the xmin above is that xid,
TransactionIdRetreatedBy() first ends up with 0. It then backtracks further to
the highest 32 xid (i.e. 4294967295). So far so good.

We could obviously end up with values further in the past as well, if
vacuum_defer_cleanup_age were larger.


Things start to seriously go off the rails when we convert that 32bit xid to
64 bit with:
        def_vis_fxid = FullXidRelativeTo(latest_completed, def_vis_xid);
which returns {value = 18446744073709551615}, which 0-1 in 64bit.


However, as 64bit xids are not supposed to wrap around, we're in trouble -
it's an xid *very* far into the future. Allowing things to be pruned that
shouldn't, because everything is below that.


I don't quite know how to best fix this. 4294967295 is the correct result for
TransactionIdRetreatedBy() in this case, and it'd not cause problems for
FullXidRelativeTo() if we actually had wrapped around the xid counter before
(since then we'd just represent it as a fxid "of the proper epoch").

Basically, in contrast to 32bit xids, 64bit xids cannot represent xids from
before the start of the universe, whereas with the modulo arithmetic of 32bit
that's not a real problem.


It's probably not too hard to fix specifically in this one place - we could
just clamp vacuum_defer_cleanup_age to be smaller than latest_completed, but
it strikes me as as a somewhat larger issue for the 64it xid infrastructure. I
suspect this might not be the only place running into problems with such
"before the universe" xids.


For a bit I was thinking that we should introduce the notion that a
FullTransactionId can be from the past. Specifically that when the upper 32bit
are all set, we treat the lower 32bit as a value from before xid 0 using the
normal 32bit xid arithmetic.  But it sucks to add overhead for that
everywhere.

It might be a bit more palatable to designate one individual value,
e.g. 2^32-1<<32, as a "before xid 0" indicator - it doesn't matter how far
before the start of the universe an xid point to...


FullXidRelativeTo() did assert that the input 32bit xid is in a reasonable
range, but unfortunately didn't do a similar check for the 64bit case.


Greetings,

Andres Freund



Re: BUG: Postgres 14 + vacuum_defer_cleanup_age + FOR UPDATE + UPDATE

From
Andres Freund
Date:
Hi,

On 2023-01-07 21:06:06 +0300, Michail Nikolaev wrote:
> 2) It is not an issue at table creation time. Issue is reproducible if
> vacuum_defer_cleanup_age set after table preparation.
> 
> 3) To reproduce the issue, vacuum_defer_cleanup_age should flip xid
> over zero (be >= txid_current()).
> And it is stable.... So, for example - unable to reproduce with 733
> value, but 734 gives error each time.
> Just a single additional txid_current() (after data is filled) fixes a
> crash... It looks like the first SELECT FOR UPDATE + UPDATE silently
> poisons everything somehow.
> You could use such PSQL script:

FWIW, the concrete value for vacuum_defer_cleanup_age is not crucial to
encounter the problem. It needs to be a value that, when compared to the xid
that did the "INSERT INTO something_is_wrong_here", results in value <= 0.

Setting vacuum_defer_cleanup_age than the xid to a much larger value allows
the crash to be encountered repeatedly.

Greetings,

Andres Freund



Re: BUG: Postgres 14 + vacuum_defer_cleanup_age + FOR UPDATE + UPDATE

From
Andres Freund
Date:
Hi,

On 2023-01-07 16:29:23 -0800, Andres Freund wrote:
> It's probably not too hard to fix specifically in this one place - we could
> just clamp vacuum_defer_cleanup_age to be smaller than latest_completed, but
> it strikes me as as a somewhat larger issue for the 64it xid infrastructure. I
> suspect this might not be the only place running into problems with such
> "before the universe" xids.

I haven't found other problematic places in HEAD, but did end up find a less
serious version of this bug in < 14: GetFullRecentGlobalXmin(). I did verify
that with vacuum_defer_cleanup_age set GetFullRecentGlobalXmin() returns
values that look likely to cause problems. Its "just" used in gist luckily.

It's hard to find places that do this kind of arithmetic, we traditionally
haven't had a helper for it. So it's open-coded in various ways.

    xidStopLimit = xidWrapLimit - 3000000;
    if (xidStopLimit < FirstNormalTransactionId)
        xidStopLimit -= FirstNormalTransactionId;

and oddly:
    xidVacLimit = oldest_datfrozenxid + autovacuum_freeze_max_age;
    if (xidVacLimit < FirstNormalTransactionId)
        xidVacLimit += FirstNormalTransactionId;

or (in < 14):

    RecentGlobalXmin = globalxmin - vacuum_defer_cleanup_age;
    if (!TransactionIdIsNormal(RecentGlobalXmin))
        RecentGlobalXmin = FirstNormalTransactionId;


The currently existing places I found, other than the ones in procarray.c,
luckily don't seem to convert the xids to 64bit xids.


> For a bit I was thinking that we should introduce the notion that a
> FullTransactionId can be from the past. Specifically that when the upper 32bit
> are all set, we treat the lower 32bit as a value from before xid 0 using the
> normal 32bit xid arithmetic.  But it sucks to add overhead for that
> everywhere.
> 
> It might be a bit more palatable to designate one individual value,
> e.g. 2^32-1<<32, as a "before xid 0" indicator - it doesn't matter how far
> before the start of the universe an xid point to...

On IM Thomas suggested we could reserve the 2^32-1 epoch for invalid values. I
hacked up a patch that converts various fxid functions to inline functions
with such assertions, and it indeed quickly catches the problem this thread
reported, close to the source of the use.

One issue with that is is that it'd reduce what can be input for the xid8
type. But it's hard to believe that'd be a real issue?


It's quite unfortunate that we don't have a test for vacuum_defer_cleanup_age
yet :(.

Greetings,

Andres Freund



Re: BUG: Postgres 14 + vacuum_defer_cleanup_age + FOR UPDATE + UPDATE

From
Matthias van de Meent
Date:
On Sun, 8 Jan 2023 at 04:09, Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
> On 2023-01-07 16:29:23 -0800, Andres Freund wrote:
> > It's probably not too hard to fix specifically in this one place - we could
> > just clamp vacuum_defer_cleanup_age to be smaller than latest_completed, but
> > it strikes me as as a somewhat larger issue for the 64it xid infrastructure. I
> > suspect this might not be the only place running into problems with such
> > "before the universe" xids.
>
> I haven't found other problematic places in HEAD, but did end up find a less
> serious version of this bug in < 14: GetFullRecentGlobalXmin(). I did verify
> that with vacuum_defer_cleanup_age set GetFullRecentGlobalXmin() returns
> values that look likely to cause problems. Its "just" used in gist luckily.
>
> It's hard to find places that do this kind of arithmetic, we traditionally
> haven't had a helper for it. So it's open-coded in various ways.
> [...]
>
> The currently existing places I found, other than the ones in procarray.c,
> luckily don't seem to convert the xids to 64bit xids.

That's good to know.

> > For a bit I was thinking that we should introduce the notion that a
> > FullTransactionId can be from the past. Specifically that when the upper 32bit
> > are all set, we treat the lower 32bit as a value from before xid 0 using the
> > normal 32bit xid arithmetic.  But it sucks to add overhead for that
> > everywhere.
> >
> > It might be a bit more palatable to designate one individual value,
> > e.g. 2^32-1<<32, as a "before xid 0" indicator - it doesn't matter how far
> > before the start of the universe an xid point to...
>
> On IM Thomas suggested we could reserve the 2^32-1 epoch for invalid values. I
> hacked up a patch that converts various fxid functions to inline functions
> with such assertions, and it indeed quickly catches the problem this thread
> reported, close to the source of the use.

Wouldn't it be enough to only fix the constructions in
FullXidRelativeTo() and widen_snapshot_xid() (as attached, $topic does
not occur with the patch), and (optionally) bump the first XID
available for any cluster to (FirstNormalXid + 1) to retain the 'older
than any running transaction' property?

The change only fixes the issue for FullTransactionId, which IMO is OK
- I don't think we need to keep xid->xid8->xid symmetric in cases of
xid8 wraparound.

> One issue with that is is that it'd reduce what can be input for the xid8
> type. But it's hard to believe that'd be a real issue?

Yes, it's unlikely anyone would ever hit that with our current WAL
format - we use 24 bytes /xid just to log it's use, so we'd use at
most epoch 0x1000_0000 in unrealistic scenarios. In addition;
technically, we already have (3*2^32 - 3) "invalid" xid8 values that
can never be produced in FullXidRelativeTo - those few extra invalid
values don't matter much to me except "even more special casing".

Kind regards,

Matthias van de Meent.

Attachment

Re: BUG: Postgres 14 + vacuum_defer_cleanup_age + FOR UPDATE + UPDATE

From
Andres Freund
Date:
Hi,

Robert, Mark, CCing you because of the amcheck aspect (see below).

On 2023-01-09 17:50:10 +0100, Matthias van de Meent wrote:
> On Sun, 8 Jan 2023 at 04:09, Andres Freund <andres@anarazel.de> wrote:
> > > For a bit I was thinking that we should introduce the notion that a
> > > FullTransactionId can be from the past. Specifically that when the upper 32bit
> > > are all set, we treat the lower 32bit as a value from before xid 0 using the
> > > normal 32bit xid arithmetic.  But it sucks to add overhead for that
> > > everywhere.
> > >
> > > It might be a bit more palatable to designate one individual value,
> > > e.g. 2^32-1<<32, as a "before xid 0" indicator - it doesn't matter how far
> > > before the start of the universe an xid point to...
> >
> > On IM Thomas suggested we could reserve the 2^32-1 epoch for invalid values. I
> > hacked up a patch that converts various fxid functions to inline functions
> > with such assertions, and it indeed quickly catches the problem this thread
> > reported, close to the source of the use.
> 
> Wouldn't it be enough to only fix the constructions in
> FullXidRelativeTo() and widen_snapshot_xid() (as attached, $topic does
> not occur with the patch), and (optionally) bump the first XID
> available for any cluster to (FirstNormalXid + 1) to retain the 'older
> than any running transaction' property?

It's not too hard to fix in individual places, but I suspect that we'll
introduce the bug in future places without some more fundamental protection.

Locally I fixed it by clamping vacuum_defer_cleanup_age to a reasonable value
in ComputeXidHorizons() and GetSnapshotData().

Fixing it in FullXidRelativeTo() doesn't seem quite right, while it's ok to
just return FirstNormalTransactionId in the case of vacuum_defer_cleanup_age,
it doesn't see necessarily correct for other cases.


> The change only fixes the issue for FullTransactionId, which IMO is OK
> - I don't think we need to keep xid->xid8->xid symmetric in cases of
> xid8 wraparound.

I think we should keep that symmetric, it just gets too confusing / easy to
miss bugs otherwise.


> > One issue with that is is that it'd reduce what can be input for the xid8
> > type. But it's hard to believe that'd be a real issue?
> 
> Yes, it's unlikely anyone would ever hit that with our current WAL
> format - we use 24 bytes /xid just to log it's use, so we'd use at
> most epoch 0x1000_0000 in unrealistic scenarios. In addition;
> technically, we already have (3*2^32 - 3) "invalid" xid8 values that
> can never be produced in FullXidRelativeTo - those few extra invalid
> values don't matter much to me except "even more special casing".

Yep. The attached 0002 is a first implementation of this.

The new assertions found at least one bug in amcheck, and one further example
of the problem of representing past 32 xids in 64bit:

1) Because ctx->next_xid is set after the XidFromFullTransactionId() call in
update_cached_xid_range(), we end up using the xid 0 (or an outdated value in
subsequent calls) to determine whether epoch needs to be reduced.

2) One test generates includes an xid from the future (4026531839). Which
causes epoch to wrap around (via the epoch--) in
FullTransactionIdFromXidAndCtx(). I've hackily fixed that by just representing
it as an xid from the future instead. But not sure that's a good answer.


A different approach would be to represent fxids as *signed* 64bit
integers. That'd of course loose more range, but could represent everything
accurately, and would have a compatible on-disk representation on two's
complement platforms (all our platforms). I think the only place that'd need
special treatment is U64FromFullTransactionId() / its callers.  I think this
might be the most robust approach.

Greetings,

Andres Freund

Attachment

Re: BUG: Postgres 14 + vacuum_defer_cleanup_age + FOR UPDATE + UPDATE

From
Thomas Munro
Date:
On Tue, Jan 10, 2023 at 8:34 AM Andres Freund <andres@anarazel.de> wrote:
> A different approach would be to represent fxids as *signed* 64bit
> integers. That'd of course loose more range, but could represent everything
> accurately, and would have a compatible on-disk representation on two's
> complement platforms (all our platforms). I think the only place that'd need
> special treatment is U64FromFullTransactionId() / its callers.  I think this
> might be the most robust approach.

It does sound like an interesting approach; it means you are free to
retreat arbitrarily without ever thinking about it, and by the
arguments given (LSN space required to consume fxids) it's still
'enough'.  Essentially all these bugs are places where the author
already believed it worked that way.

(Two's complement is required in the C23 draft.)



Re: BUG: Postgres 14 + vacuum_defer_cleanup_age + FOR UPDATE + UPDATE

From
Mark Dilger
Date:

> On Jan 9, 2023, at 11:34 AM, Andres Freund <andres@anarazel.de> wrote:
>
> 1) Because ctx->next_xid is set after the XidFromFullTransactionId() call in
> update_cached_xid_range(), we end up using the xid 0 (or an outdated value in
> subsequent calls) to determine whether epoch needs to be reduced.

Can you say a bit more about your analysis here, preferably with pointers to the lines of code you are analyzing?  Does
theproblem exist in amcheck as currently committed, or are you thinking about a problem that arises only after applying
yourpatch?  I'm a bit fuzzy on where xid 0 gets used. 

Thanks

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company






Re: BUG: Postgres 14 + vacuum_defer_cleanup_age + FOR UPDATE + UPDATE

From
Andres Freund
Date:
Hi,

On 2023-01-09 13:55:02 -0800, Mark Dilger wrote:
> > On Jan 9, 2023, at 11:34 AM, Andres Freund <andres@anarazel.de> wrote:
> > 
> > 1) Because ctx->next_xid is set after the XidFromFullTransactionId() call in
> > update_cached_xid_range(), we end up using the xid 0 (or an outdated value in
> > subsequent calls) to determine whether epoch needs to be reduced.
> 
> Can you say a bit more about your analysis here, preferably with pointers to
> the lines of code you are analyzing?  Does the problem exist in amcheck as
> currently committed, or are you thinking about a problem that arises only
> after applying your patch?  I'm a bit fuzzy on where xid 0 gets used.

The problems exist in the code as currently committed. I'm not sure what
exactly the consequences are, the result is that oldest_fxid will be, at least
temporarily, bogus.

Consider the first call to update_cached_xid_range():

/*
 * Update our cached range of valid transaction IDs.
 */
static void
update_cached_xid_range(HeapCheckContext *ctx)
{
    /* Make cached copies */
    LWLockAcquire(XidGenLock, LW_SHARED);
    ctx->next_fxid = ShmemVariableCache->nextXid;
    ctx->oldest_xid = ShmemVariableCache->oldestXid;
    LWLockRelease(XidGenLock);

    /* And compute alternate versions of the same */
    ctx->oldest_fxid = FullTransactionIdFromXidAndCtx(ctx->oldest_xid, ctx);
    ctx->next_xid = XidFromFullTransactionId(ctx->next_fxid);
}

The problem is that the call to FullTransactionIdFromXidAndCtx() happens
before ctx->next_xid is assigned, even though FullTransactionIdFromXidAndCtx()
uses ctx->next_xid.

static FullTransactionId
FullTransactionIdFromXidAndCtx(TransactionId xid, const HeapCheckContext *ctx)
{
    uint32        epoch;

    if (!TransactionIdIsNormal(xid))
        return FullTransactionIdFromEpochAndXid(0, xid);
    epoch = EpochFromFullTransactionId(ctx->next_fxid);
    if (xid > ctx->next_xid)
        epoch--;
    return FullTransactionIdFromEpochAndXid(epoch, xid);
}

Because ctx->next_xid is 0, due to not having been set yet, "xid > ctx->next_xid"
will always be true, leading to epoch being reduced by one.

In the common case of there never having been an xid wraparound, we'll thus
underflow epoch, generating an xid far into the future.


The tests encounter the issue today. If you add
    Assert(TransactionIdIsValid(ctx->next_xid));
    Assert(FullTransactionIdIsValid(ctx->next_fxid));
early in FullTransactionIdFromXidAndCtx() it'll be hit in the
amcheck/pg_amcheck tests.

Greetings,

Andres Freund



Re: BUG: Postgres 14 + vacuum_defer_cleanup_age + FOR UPDATE + UPDATE

From
Mark Dilger
Date:

> On Jan 9, 2023, at 2:07 PM, Andres Freund <andres@anarazel.de> wrote:
>
> The tests encounter the issue today. If you add
> Assert(TransactionIdIsValid(ctx->next_xid));
> Assert(FullTransactionIdIsValid(ctx->next_fxid));
> early in FullTransactionIdFromXidAndCtx() it'll be hit in the
> amcheck/pg_amcheck tests.

Ok, I can confirm that.  I find the assertion

    Assert(epoch != (uint32)-1);

a bit simpler to reason about, but either way, I agree it is a bug.  Thanks for finding this.

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company






Re: BUG: Postgres 14 + vacuum_defer_cleanup_age + FOR UPDATE + UPDATE

From
Matthias van de Meent
Date:
On Mon, 9 Jan 2023 at 20:34, Andres Freund <andres@anarazel.de> wrote:
> On 2023-01-09 17:50:10 +0100, Matthias van de Meent wrote:
> > Wouldn't it be enough to only fix the constructions in
> > FullXidRelativeTo() and widen_snapshot_xid() (as attached, $topic does
> > not occur with the patch), and (optionally) bump the first XID
> > available for any cluster to (FirstNormalXid + 1) to retain the 'older
> > than any running transaction' property?
>
> It's not too hard to fix in individual places, but I suspect that we'll
> introduce the bug in future places without some more fundamental protection.
>
> Locally I fixed it by clamping vacuum_defer_cleanup_age to a reasonable value
> in ComputeXidHorizons() and GetSnapshotData().

I don't think that clamping the value with oldestXid (as seen in patch
0001, in GetSnapshotData) is right.
It would clamp the value relative to the oldest frozen xid of all
databases, which can be millions of transactions behind oldestXmin,
and thus severely skew the amount of transaction's changes you keep on
disk (that is, until oldestXid moves past 1000_000).
A similar case can be made for the changes in ComputeXidHorizons - for
the described behaviour of vacuum_defer_cleanup_age we would need to
clamp the used offset separately for each of the fields in the horizon
result to retain all transaction data for the first 1 million
transactions, and the ones that may still see these transactions.

> Fixing it in FullXidRelativeTo() doesn't seem quite right, while it's ok to
> just return FirstNormalTransactionId in the case of vacuum_defer_cleanup_age,
> it doesn't see necessarily correct for other cases.

Understood.

Kind regards,

Matthias van de Meent



Re: BUG: Postgres 14 + vacuum_defer_cleanup_age + FOR UPDATE + UPDATE

From
Andres Freund
Date:
Hi,

On 2023-01-10 15:03:42 +0100, Matthias van de Meent wrote:
> On Mon, 9 Jan 2023 at 20:34, Andres Freund <andres@anarazel.de> wrote:
> > On 2023-01-09 17:50:10 +0100, Matthias van de Meent wrote:
> > > Wouldn't it be enough to only fix the constructions in
> > > FullXidRelativeTo() and widen_snapshot_xid() (as attached, $topic does
> > > not occur with the patch), and (optionally) bump the first XID
> > > available for any cluster to (FirstNormalXid + 1) to retain the 'older
> > > than any running transaction' property?
> >
> > It's not too hard to fix in individual places, but I suspect that we'll
> > introduce the bug in future places without some more fundamental protection.
> >
> > Locally I fixed it by clamping vacuum_defer_cleanup_age to a reasonable value
> > in ComputeXidHorizons() and GetSnapshotData().
> 
> I don't think that clamping the value with oldestXid (as seen in patch
> 0001, in GetSnapshotData) is right.

I agree that using oldestXid to clamp is problematic.


> It would clamp the value relative to the oldest frozen xid of all
> databases, which can be millions of transactions behind oldestXmin,
> and thus severely skew the amount of transaction's changes you keep on
> disk (that is, until oldestXid moves past 1000_000).

What precisely do you mean with "skew" here? Do you just mean that it'd take a
long time until vacuum_defer_cleanup_age takes effect? Somehow it sounds like
you might mean more than that?


I'm tempted to go with reinterpreting 64bit xids as signed. Except that it
seems like a mighty invasive change to backpatch.


Greetings,

Andres Freund



Re: BUG: Postgres 14 + vacuum_defer_cleanup_age + FOR UPDATE + UPDATE

From
Matthias van de Meent
Date:
On Tue, 10 Jan 2023 at 20:14, Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
> On 2023-01-10 15:03:42 +0100, Matthias van de Meent wrote:
> > On Mon, 9 Jan 2023 at 20:34, Andres Freund <andres@anarazel.de> wrote:
> > > It's not too hard to fix in individual places, but I suspect that we'll
> > > introduce the bug in future places without some more fundamental protection.
> > >
> > > Locally I fixed it by clamping vacuum_defer_cleanup_age to a reasonable value
> > > in ComputeXidHorizons() and GetSnapshotData().
> >
> > I don't think that clamping the value with oldestXid (as seen in patch
> > 0001, in GetSnapshotData) is right.
>
> I agree that using oldestXid to clamp is problematic.
>
>
> > It would clamp the value relative to the oldest frozen xid of all
> > databases, which can be millions of transactions behind oldestXmin,
> > and thus severely skew the amount of transaction's changes you keep on
> > disk (that is, until oldestXid moves past 1000_000).
>
> What precisely do you mean with "skew" here? Do you just mean that it'd take a
> long time until vacuum_defer_cleanup_age takes effect? Somehow it sounds like
> you might mean more than that?

h->oldest_considered_running can be extremely old due to the global
nature of the value and the potential existence of a snapshot in
another database that started in parallel to a very old running
transaction.

Example: With vacuum_defer_cleanup_age set to 1000000, it is possible
that a snapshot in another database (thus another backend) would
result in a local intermediate status result of h->o_c_r = 20,
h->s_o_n = 20, h->d_o_n = 10030. The clamped offset would then be 20
(clamped using h->o_c_r), which updates h->data_oldest_nonremovable to
10010. The obvious result is that all but the last 20 transactions
from this database's data files are available for cleanup, which
contradicts with the intention of the vacuum_defer_cleanup_age GUC.

> I'm tempted to go with reinterpreting 64bit xids as signed. Except that it
> seems like a mighty invasive change to backpatch.

I'm not sure either. Protecting against underflow by halving the
effective valid value space is quite the intervention, but if it is
necessary to make this work in a performant manner, it would be worth
it. Maybe someone else with more experience can provide their opinion
here.

Kind regards,

Matthias van de Meent



Re: BUG: Postgres 14 + vacuum_defer_cleanup_age + FOR UPDATE + UPDATE

From
Michail Nikolaev
Date:
Hello.

I have registered it as patch in the commit fest:
https://commitfest.postgresql.org/42/4138/

Best regards,
Michail.



Re: BUG: Postgres 14 + vacuum_defer_cleanup_age + FOR UPDATE + UPDATE

From
Andres Freund
Date:
Hi,

On 2023-01-10 21:32:54 +0100, Matthias van de Meent wrote:
> On Tue, 10 Jan 2023 at 20:14, Andres Freund <andres@anarazel.de> wrote:
> > On 2023-01-10 15:03:42 +0100, Matthias van de Meent wrote:
> > What precisely do you mean with "skew" here? Do you just mean that it'd take a
> > long time until vacuum_defer_cleanup_age takes effect? Somehow it sounds like
> > you might mean more than that?
> 
> h->oldest_considered_running can be extremely old due to the global
> nature of the value and the potential existence of a snapshot in
> another database that started in parallel to a very old running
> transaction.

Here's a version that, I think, does not have that issue.

In an earlier, not posted, version I had an vacuum_defer_cleanup_age specific
helper function for this, but it seems likely we'll need it in other places
too.  So I named it TransactionIdRetreatSafely(). I made it accept the xid by
pointer, as the line lengths / repetition otherwise end up making it hard to
read the code.  For now I have TransactionIdRetreatSafely() be private to
procarray.c, but I expect we'll have to change that eventually.

Not sure I like TransactionIdRetreatSafely() as a name. Maybe
TransactionIdRetreatClamped() is better?


I've been working on a test for vacuum_defer_cleanup_age. It does catch the
corruption at hand, but not much more.  It's quite painful to write, right
now.  Some of the reasons:
https://postgr.es/m/20230130194350.zj5v467x4jgqt3d6%40awork3.anarazel.de



> > I'm tempted to go with reinterpreting 64bit xids as signed. Except that it
> > seems like a mighty invasive change to backpatch.
> 
> I'm not sure either. Protecting against underflow by halving the
> effective valid value space is quite the intervention, but if it is
> necessary to make this work in a performant manner, it would be worth
> it. Maybe someone else with more experience can provide their opinion
> here.

The attached assertions just removes 1/2**32'ths of the space, by reserving
the xid range with the upper 32bit set as something that shouldn't be
reachable.

Still requires us to change the input routines to reject that range, but I
think that's a worthy tradeoff.  I didn't find the existing limits for the
type to be documented anywhere.

Obviously something like that could only go into HEAD.

Greetings,

Andres Freund

Attachment

Re: BUG: Postgres 14 + vacuum_defer_cleanup_age + FOR UPDATE + UPDATE

From
Matthias van de Meent
Date:
On Mon, 30 Jan 2023 at 21:19, Andres Freund <andres@anarazel.de> wrote:
> On 2023-01-10 21:32:54 +0100, Matthias van de Meent wrote:
> > On Tue, 10 Jan 2023 at 20:14, Andres Freund <andres@anarazel.de> wrote:
> > > On 2023-01-10 15:03:42 +0100, Matthias van de Meent wrote:
> > > What precisely do you mean with "skew" here? Do you just mean that it'd take a
> > > long time until vacuum_defer_cleanup_age takes effect? Somehow it sounds like
> > > you might mean more than that?
> >
> > h->oldest_considered_running can be extremely old due to the global
> > nature of the value and the potential existence of a snapshot in
> > another database that started in parallel to a very old running
> > transaction.
>
> Here's a version that, I think, does not have that issue.

Thanks!

> In an earlier, not posted, version I had an vacuum_defer_cleanup_age specific
> helper function for this, but it seems likely we'll need it in other places
> too.  So I named it TransactionIdRetreatSafely(). I made it accept the xid by
> pointer, as the line lengths / repetition otherwise end up making it hard to
> read the code.  For now I have TransactionIdRetreatSafely() be private to
> procarray.c, but I expect we'll have to change that eventually.

If TransactionIdRetreatSafely will be exposed outside procarray.c,
then I think the xid pointer should be replaced with normal
arguments/returns; both for parity with TransactionIdRetreatedBy and
to remove this memory store dependency in this hot code path.

> Not sure I like TransactionIdRetreatSafely() as a name. Maybe
> TransactionIdRetreatClamped() is better?

I think the 'safely' version is fine.

> I've been working on a test for vacuum_defer_cleanup_age. It does catch the
> corruption at hand, but not much more.  It's quite painful to write, right
> now.  Some of the reasons:
> https://postgr.es/m/20230130194350.zj5v467x4jgqt3d6%40awork3.anarazel.de
>
>
>
> > > I'm tempted to go with reinterpreting 64bit xids as signed. Except that it
> > > seems like a mighty invasive change to backpatch.
> >
> > I'm not sure either. Protecting against underflow by halving the
> > effective valid value space is quite the intervention, but if it is
> > necessary to make this work in a performant manner, it would be worth
> > it. Maybe someone else with more experience can provide their opinion
> > here.
>
> The attached assertions just removes 1/2**32'ths of the space, by reserving
> the xid range with the upper 32bit set as something that shouldn't be
> reachable.

I think that is acceptible.

> Still requires us to change the input routines to reject that range, but I
> think that's a worthy tradeoff.

Agreed.

> I didn't find the existing limits for the
> type to be documented anywhere.
>
> Obviously something like that could only go into HEAD.

Yeah.

Comments on 0003:

> +        /*
> +         * FIXME, doubtful this is the best fix.
> +         *
> +         * Can't represent the 32bit xid as a 64bit xid, as it's before fxid
> +         * 0. Represent it as an xid from the future instead.
> +         */
> +        if (epoch == 0)
> +            return FullTransactionIdFromEpochAndXid(0, xid);

Shouldn't this be an error condition instead, as this XID should not
be able to appear?

on 0004:

> -       '0xffffffffffffffff'::xid8,
> -       '-1'::xid8;
> +       '0xefffffffffffffff'::xid8,
> +       '0'::xid8;

The 0xFF... usages were replaced with "0xEFFF...". Shouldn't we also
test on 0xffff_fffE_ffff_ffff to test for input of our actual max
value?

> @@ -326,7 +329,11 @@ parse_snapshot(const char *str, Node *escontext)
>     while (*str != '\0')
>     {
>         /* read next value */
> -        val = FullTransactionIdFromU64(strtou64(str, &endp, 10));
> +        raw_fxid = strtou64(str, &endp, 10);
> +
> +        val = FullTransactionIdFromU64(raw_fxid);
> +        if (!InFullTransactionIdRange(raw_fxid))
> +            goto bad_format;

With assertions enabled FullTransactionIdFromU64 will assert the
InFullTransactionIdRange condition, meaning we wouldn't hit the branch
into bad_format.
I think these operations should be swapped, as parsing a snapshot
shouldn't run into assertion failures like this if it can error
normally. Maybe this can be added to tests as well?

Kind regards,

Matthias van de Meent



Re: BUG: Postgres 14 + vacuum_defer_cleanup_age + FOR UPDATE + UPDATE

From
Andres Freund
Date:
Hi,

On 2023-01-31 15:05:17 +0100, Matthias van de Meent wrote:
> On Mon, 30 Jan 2023 at 21:19, Andres Freund <andres@anarazel.de> wrote:
> > In an earlier, not posted, version I had an vacuum_defer_cleanup_age specific
> > helper function for this, but it seems likely we'll need it in other places
> > too.  So I named it TransactionIdRetreatSafely(). I made it accept the xid by
> > pointer, as the line lengths / repetition otherwise end up making it hard to
> > read the code.  For now I have TransactionIdRetreatSafely() be private to
> > procarray.c, but I expect we'll have to change that eventually.
> 
> If TransactionIdRetreatSafely will be exposed outside procarray.c,
> then I think the xid pointer should be replaced with normal
> arguments/returns; both for parity with TransactionIdRetreatedBy

That's why I named one version *Retreat the other Retreated :)

I think it'll make the code easier to read in the other places too, the
variable names / function names in this space are uncomfortably long to
fit into 78chars..., particularly when there's two references to the
same variable in the same line.


> and to remove this memory store dependency in this hot code path.

I doubt that matters much here and the places it's going to be used
in. And presumably the compiler will inline it anyway. I'd probably make
it a static inline in the header too.

What's making me hesitate about exposing it is that it's quite easy to
get things wrong by using a wrong fxid or such.


> > +        /*
> > +         * FIXME, doubtful this is the best fix.
> > +         *
> > +         * Can't represent the 32bit xid as a 64bit xid, as it's before fxid
> > +         * 0. Represent it as an xid from the future instead.
> > +         */
> > +        if (epoch == 0)
> > +            return FullTransactionIdFromEpochAndXid(0, xid);
> 
> Shouldn't this be an error condition instead, as this XID should not
> be able to appear?

If you mean error in the sense of ERROR, no, I don't think so. That code
tries hard to be able to check many tuples in a row. And if we were to
error out here, we'd not able to do that. We should still report those
tuples as corrupt, fwiw.

The reason this path is hit is that a test intentionally corrupts some
xids. So the path is reachable and we need to cope somehow.

I'm not really satisfied with this fix either - I mostly wanted to
include something sufficient to prevent assertion failures.

I had hoped that Mark would look at the amcheck bits and come up with
more complete fixes.


> on 0004:
> 
> > -       '0xffffffffffffffff'::xid8,
> > -       '-1'::xid8;
> > +       '0xefffffffffffffff'::xid8,
> > +       '0'::xid8;
> 
> The 0xFF... usages were replaced with "0xEFFF...". Shouldn't we also
> test on 0xffff_fffE_ffff_ffff to test for input of our actual max
> value?

Probably a good idea.


> > @@ -326,7 +329,11 @@ parse_snapshot(const char *str, Node *escontext)
> >     while (*str != '\0')
> >     {
> >         /* read next value */
> > -        val = FullTransactionIdFromU64(strtou64(str, &endp, 10));
> > +        raw_fxid = strtou64(str, &endp, 10);
> > +
> > +        val = FullTransactionIdFromU64(raw_fxid);
> > +        if (!InFullTransactionIdRange(raw_fxid))
> > +            goto bad_format;
> 
> With assertions enabled FullTransactionIdFromU64 will assert the
> InFullTransactionIdRange condition, meaning we wouldn't hit the branch
> into bad_format.
> I think these operations should be swapped, as parsing a snapshot
> shouldn't run into assertion failures like this if it can error
> normally.

Yep.


> Maybe this can be added to tests as well?

I'll check. I thought for a bit it'd not work because we'd perform range
checks on the xids, but we don't...

Greetings,

Andres Freund



Re: BUG: Postgres 14 + vacuum_defer_cleanup_age + FOR UPDATE + UPDATE

From
Matthias van de Meent
Date:
On Tue, 31 Jan 2023 at 23:48, Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
> On 2023-01-31 15:05:17 +0100, Matthias van de Meent wrote:
> > If TransactionIdRetreatSafely will be exposed outside procarray.c,
> > then I think the xid pointer should be replaced with normal
> > arguments/returns; both for parity with TransactionIdRetreatedBy
>
> That's why I named one version *Retreat the other Retreated :)

That part I noticed too :) I don't mind either way, I was just
concerned with exposing the function as a prototype, not as an inline
static.

> I think it'll make the code easier to read in the other places too, the
> variable names / function names in this space are uncomfortably long to
> fit into 78chars..., particularly when there's two references to the
> same variable in the same line.

I guess that's true, and once inlined there should indeed be no extra
runtime overhead.

> 78 chars
Didn't we use 80 columns/chars? How did you get to 78? Not that I
can't think of any ways, but none of them stand out to me as obviously
correct.

> > and to remove this memory store dependency in this hot code path.
>
> I doubt that matters much here and the places it's going to be used
> in.

I thought that this was executed while still in ProcArrayLock, but
instead we've released that lock already by the time we're trying to
adjust the horizons, so the 'hot code path' concern is mostly
relieved.

> And presumably the compiler will inline it anyway. I'd probably make
> it a static inline in the header too.

Yes, my concern was based on an extern prototype with private
implementation, as that does prohibit inlining and thus would have a
requirement to push the data to memory (probably only L1, but still
memory).

> What's making me hesitate about exposing it is that it's quite easy to
> get things wrong by using a wrong fxid or such.

I'm less concerned about that when the function is well-documented.

> > > +        /*
> > > +         * FIXME, doubtful this is the best fix.
> > > +         *
> > > +         * Can't represent the 32bit xid as a 64bit xid, as it's before fxid
> > > +         * 0. Represent it as an xid from the future instead.
> > > +         */
> > > +        if (epoch == 0)
> > > +            return FullTransactionIdFromEpochAndXid(0, xid);
> >
> > Shouldn't this be an error condition instead, as this XID should not
> > be able to appear?
>
> If you mean error in the sense of ERROR, no, I don't think so. That code
> tries hard to be able to check many tuples in a row. And if we were to
> error out here, we'd not able to do that. We should still report those
> tuples as corrupt, fwiw.
>
> The reason this path is hit is that a test intentionally corrupts some
> xids. So the path is reachable and we need to cope somehow.

I see.

Kind regards,

Matthias van de Meent



Re: BUG: Postgres 14 + vacuum_defer_cleanup_age + FOR UPDATE + UPDATE

From
Andres Freund
Date:
Hi,

Heikki, Andrey, CCing you because you wrote

commit 6655a7299d835dea9e8e0ba69cc5284611b96f29
Author: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date:   2019-07-24 20:24:07 +0300

    Use full 64-bit XID for checking if a deleted GiST page is old enough.


On 2023-01-07 19:09:56 -0800, Andres Freund wrote:
> I haven't found other problematic places in HEAD, but did end up find a less
> serious version of this bug in < 14: GetFullRecentGlobalXmin(). I did verify
> that with vacuum_defer_cleanup_age set GetFullRecentGlobalXmin() returns
> values that look likely to cause problems. Its "just" used in gist luckily.

Is there a good way to make breakage in the page recycling mechanism
visible with gist? I guess to see corruption, I'd have to halt a scan
before a page is visited with gdb, then cause the page to be recycled
prematurely in another session, then unblock the first? Which'd then
visit that page, thinking it to be in a different part of the tree than
it actually is?

I'm pretty sure it's broken though.

On 13, with vacuum_defer_cleanup_age=0, the attached script has two
consecutive VACUUM VERBOSEs output

106 index pages have been deleted, 0 are currently reusable.
106 index pages have been deleted, 0 are currently reusable.

in the presence of a prepared transaction. Which makes sense.

But with vacuum_defer_cleanup_age=10000

106 index pages have been deleted, 0 are currently reusable.
106 index pages have been deleted, 106 are currently reusable.


which clearly doesn't seem right.

I just can't quite judge how bad that is.

Greetings,

Andres Freund

Attachment

Re: BUG: Postgres 14 + vacuum_defer_cleanup_age + FOR UPDATE + UPDATE

From
Michail Nikolaev
Date:
Hello, Andres.

> Not sure I like TransactionIdRetreatSafely() as a name. Maybe
> TransactionIdRetreatClamped() is better?

I think it is better to just replace TransactionIdRetreatedBy.
It is not used anymore after
`v3-0001-WIP-Fix-corruption-due-to-vacuum_defer_cleanup_ag.patch` -
so, it is better to replace the dangerous version in order to avoid
similar issues in the future.
But we need also to move FullXidRelativeTo in that case (not sure it is safe).

> I think it'll make the code easier to read in the other places too, the
> variable names / function names in this space are uncomfortably long to
> fit into 78chars..., particularly when there's two references to the
> same variable in the same line.

Looks fine for my taste, but it is pretty far from perfect :)

Best regards,
Michail.

Attachment

Re: BUG: Postgres 14 + vacuum_defer_cleanup_age + FOR UPDATE + UPDATE

From
Andrey Borodin
Date:
On Sat, Feb 4, 2023 at 2:57 AM Andres Freund <andres@anarazel.de> wrote:
>
> Is there a good way to make breakage in the page recycling mechanism
> visible with gist? I guess to see corruption, I'd have to halt a scan
> before a page is visited with gdb, then cause the page to be recycled
> prematurely in another session, then unblock the first? Which'd then
> visit that page, thinking it to be in a different part of the tree than
> it actually is?
>

In most cases landing on one extra page will not affect the scan.
Worst case that I can imagine - scan is landing on a page that is the
new parent of the deleted page. Even then we cannot end up with
infinite index scan - we will just make one extra loop. Although,
IndexScan will yield duplicate tids.

In case of interference with concurrent insertion we will get a tree
structure departed from optimal, but that is not a problem.

Best regards, Andrey Borodin.



Re: BUG: Postgres 14 + vacuum_defer_cleanup_age + FOR UPDATE + UPDATE

From
Peter Geoghegan
Date:
On Sat, Feb 4, 2023 at 2:57 AM Andres Freund <andres@anarazel.de> wrote:
> Is there a good way to make breakage in the page recycling mechanism
> visible with gist? I guess to see corruption, I'd have to halt a scan
> before a page is visited with gdb, then cause the page to be recycled
> prematurely in another session, then unblock the first? Which'd then
> visit that page, thinking it to be in a different part of the tree than
> it actually is?

Yes. This bug is similar to an ancient nbtree bug fixed back in 2012,
by commit d3abbbeb.

> which clearly doesn't seem right.
>
> I just can't quite judge how bad that is.

It's really hard to judge, even if you're an expert. We're talking
about a fairly chaotic scenario. My guess is that there is a very
small chance of a very unpleasant scenario if you have a GiST index
that has regular page deletions, and if you use
vacuum_defer_cleanup_age. It's likely that most GiST indexes never
have any page deletions due to the workload characteristics.

-- 
Peter Geoghegan



Re: BUG: Postgres 14 + vacuum_defer_cleanup_age + FOR UPDATE + UPDATE

From
Andres Freund
Date:
Hi,

On 2023-02-04 11:10:55 -0800, Peter Geoghegan wrote:
> On Sat, Feb 4, 2023 at 2:57 AM Andres Freund <andres@anarazel.de> wrote:
> > Is there a good way to make breakage in the page recycling mechanism
> > visible with gist? I guess to see corruption, I'd have to halt a scan
> > before a page is visited with gdb, then cause the page to be recycled
> > prematurely in another session, then unblock the first? Which'd then
> > visit that page, thinking it to be in a different part of the tree than
> > it actually is?
> 
> Yes. This bug is similar to an ancient nbtree bug fixed back in 2012,
> by commit d3abbbeb.
> 
> > which clearly doesn't seem right.
> >
> > I just can't quite judge how bad that is.
> 
> It's really hard to judge, even if you're an expert. We're talking
> about a fairly chaotic scenario. My guess is that there is a very
> small chance of a very unpleasant scenario if you have a GiST index
> that has regular page deletions, and if you use
> vacuum_defer_cleanup_age. It's likely that most GiST indexes never
> have any page deletions due to the workload characteristics.

Thanks.


Sounds like a problem here is too hard to repro. I mostly wanted to know how
to be more confident about a fix working correctly. There's no tests for the
whole page recycling behaviour, afaics, so it's a bit scary to change things
around.

I didn't quite feel confident pushing a fix for this just before a minor
release, so I'll push once the minor releases are tagged. A quite minimal fix
to GetFullRecentGlobalXmin() in 12-13 (returning FirstNormalTransactionId if
epoch == 0 and RecentGlobalXmin > nextxid_xid), and the slightly larger fix in
14+.

Greetings,

Andres Freund



Re: BUG: Postgres 14 + vacuum_defer_cleanup_age + FOR UPDATE + UPDATE

From
Andres Freund
Date:
Hi,

On 2023-02-06 13:02:05 -0800, Andres Freund wrote:
> I didn't quite feel confident pushing a fix for this just before a minor
> release, so I'll push once the minor releases are tagged. A quite minimal fix
> to GetFullRecentGlobalXmin() in 12-13 (returning FirstNormalTransactionId if
> epoch == 0 and RecentGlobalXmin > nextxid_xid), and the slightly larger fix in
> 14+.

Pushed that.


Mark:

I worked some more on the fixes for amcheck, and fixes for amcheck.

The second amcheck fix ends up correcting some inaccurate output in the tests
- previously xids from before xid 0 were reported to be in the future.

Previously there was no test case exercising exceeding nextxid, without
wrapping around into the past. I added that at the end of
004_verify_heapam.pl, because renumbering seemed too annoying.

What do you think?


Somewhat random note:

Is it intentional that we VACUUM FREEZE test ROWCOUNT times? That's
effectively O(ROWCOUNT^2), albeit with small enough constants to not really
matter. I don't think we need to insert the rows one-by-one either. Changing
that to a single INSERT and FREEZE shaves 10-12% off the tests.  I didn't
change that, but we also fire off a psql for each tuple for heap_page_items(),
with offset $N no less. That seems to be another 500ms.

Greetings,

Andres Freund

Attachment

Re: BUG: Postgres 14 + vacuum_defer_cleanup_age + FOR UPDATE + UPDATE

From
Mark Dilger
Date:

> On Mar 8, 2023, at 4:15 PM, Andres Freund <andres@anarazel.de> wrote:
>
> I worked some more on the fixes for amcheck, and fixes for amcheck.
>
> The second amcheck fix ends up correcting some inaccurate output in the tests
> - previously xids from before xid 0 were reported to be in the future.
>
> Previously there was no test case exercising exceeding nextxid, without
> wrapping around into the past. I added that at the end of
> 004_verify_heapam.pl, because renumbering seemed too annoying.
>
> What do you think?

The changes look reasonable to me.

> Somewhat random note:
>
> Is it intentional that we VACUUM FREEZE test ROWCOUNT times? That's
> effectively O(ROWCOUNT^2), albeit with small enough constants to not really
> matter. I don't think we need to insert the rows one-by-one either. Changing
> that to a single INSERT and FREEZE shaves 10-12% off the tests.  I didn't
> change that, but we also fire off a psql for each tuple for heap_page_items(),
> with offset $N no less. That seems to be another 500ms.

I don't recall the reasoning.  Feel free to optimize the tests.

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company






Re: BUG: Postgres 14 + vacuum_defer_cleanup_age + FOR UPDATE + UPDATE

From
Andres Freund
Date:
Hi,

On 2023-03-09 12:15:16 -0800, Mark Dilger wrote:
> > Somewhat random note:
> > 
> > Is it intentional that we VACUUM FREEZE test ROWCOUNT times? That's
> > effectively O(ROWCOUNT^2), albeit with small enough constants to not really
> > matter. I don't think we need to insert the rows one-by-one either. Changing
> > that to a single INSERT and FREEZE shaves 10-12% off the tests.  I didn't
> > change that, but we also fire off a psql for each tuple for heap_page_items(),
> > with offset $N no less. That seems to be another 500ms.
> 
> I don't recall the reasoning.  Feel free to optimize the tests.

Something like the attached.

I don't know enough perl to know how to interpolate something like
use constant ROWCOUNT => 17;
so I just made it a variable.

Greetings,

Andres Freund

Attachment

Re: BUG: Postgres 14 + vacuum_defer_cleanup_age + FOR UPDATE + UPDATE

From
Mark Dilger
Date:

> On Mar 11, 2023, at 3:22 PM, Andres Freund <andres@anarazel.de> wrote:
>
> Something like the attached.

I like that your patch doesn't make the test longer.  I assume you've already run the tests and that it works.

> I don't know enough perl to know how to interpolate something like
> use constant ROWCOUNT => 17;
> so I just made it a variable.

Seems fair.  I certainly don't mind.

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company






Re: BUG: Postgres 14 + vacuum_defer_cleanup_age + FOR UPDATE + UPDATE

From
Andres Freund
Date:
Hi,

On 2023-03-11 15:34:55 -0800, Mark Dilger wrote:
> > On Mar 11, 2023, at 3:22 PM, Andres Freund <andres@anarazel.de> wrote:
> > 
> > Something like the attached.
> 
> I like that your patch doesn't make the test longer.  I assume you've already run the tests and that it works.

I did check that, yes :). My process of writing perl is certainly, uh,
iterative. No way I would get anything close to working without testing it.

CI now finished the tests as well:
https://cirrus-ci.com/build/6675457702100992

So I'll go ahead and push that.

Greetings,

Andres Freund



Re: BUG: Postgres 14 + vacuum_defer_cleanup_age + FOR UPDATE + UPDATE

From
Peter Eisentraut
Date:
On 12.03.23 00:41, Andres Freund wrote:
> Hi,
> 
> On 2023-03-11 15:34:55 -0800, Mark Dilger wrote:
>>> On Mar 11, 2023, at 3:22 PM, Andres Freund <andres@anarazel.de> wrote:
>>>
>>> Something like the attached.
>>
>> I like that your patch doesn't make the test longer.  I assume you've already run the tests and that it works.
> 
> I did check that, yes :). My process of writing perl is certainly, uh,
> iterative. No way I would get anything close to working without testing it.
> 
> CI now finished the tests as well:
> https://cirrus-ci.com/build/6675457702100992
> 
> So I'll go ahead and push that.

There is a small issue with this commit (a4f23f9b3c).

In src/bin/pg_amcheck/t/004_verify_heapam.pl, there is code to detect 
whether the page layout matches expectations and if not it calls plan 
skip_all.

This commit adds a test

is(scalar @lp_off, $ROWCOUNT, "acquired row offsets");

*before* that skip_all call.  This appears to be invalid.  If the 
skip_all happens, you get a complaint like

t/004_verify_heapam.pl (Wstat: 0 Tests: 1 Failed: 0)
   Parse errors: Bad plan.  You planned 0 tests but ran 1.

We could move the is() test after all the skip_all's.  Any thoughts?




Re: BUG: Postgres 14 + vacuum_defer_cleanup_age + FOR UPDATE + UPDATE

From
Andres Freund
Date:
Hi,

On 2023-05-10 17:44:07 +0200, Peter Eisentraut wrote:
> On 12.03.23 00:41, Andres Freund wrote:
> > Hi,
> > 
> > On 2023-03-11 15:34:55 -0800, Mark Dilger wrote:
> > > > On Mar 11, 2023, at 3:22 PM, Andres Freund <andres@anarazel.de> wrote:
> > > > 
> > > > Something like the attached.
> > > 
> > > I like that your patch doesn't make the test longer.  I assume you've already run the tests and that it works.
> > 
> > I did check that, yes :). My process of writing perl is certainly, uh,
> > iterative. No way I would get anything close to working without testing it.
> > 
> > CI now finished the tests as well:
> > https://cirrus-ci.com/build/6675457702100992
> > 
> > So I'll go ahead and push that.
> 
> There is a small issue with this commit (a4f23f9b3c).
> 
> In src/bin/pg_amcheck/t/004_verify_heapam.pl, there is code to detect
> whether the page layout matches expectations and if not it calls plan
> skip_all.

Some of these skip_all's don't seem like a good idea. Why is a broken
relfrozenxid a cause for skipping a test? But anyway, that's really unrelated
to the topic at hand.


> This commit adds a test
> 
> is(scalar @lp_off, $ROWCOUNT, "acquired row offsets");
> 
> *before* that skip_all call.  This appears to be invalid.  If the skip_all
> happens, you get a complaint like
> 
> t/004_verify_heapam.pl (Wstat: 0 Tests: 1 Failed: 0)
>   Parse errors: Bad plan.  You planned 0 tests but ran 1.
> 
> We could move the is() test after all the skip_all's.  Any thoughts?

I think the easiest fix is to just die if we can't get the offsets - it's not
like we can really continue afterwards...

Greetings,

Andres Freund



Re: BUG: Postgres 14 + vacuum_defer_cleanup_age + FOR UPDATE + UPDATE

From
Peter Eisentraut
Date:
On 10.05.23 20:04, Andres Freund wrote:
>> This commit adds a test
>>
>> is(scalar @lp_off, $ROWCOUNT, "acquired row offsets");
>>
>> *before* that skip_all call.  This appears to be invalid.  If the skip_all
>> happens, you get a complaint like
>>
>> t/004_verify_heapam.pl (Wstat: 0 Tests: 1 Failed: 0)
>>    Parse errors: Bad plan.  You planned 0 tests but ran 1.
>>
>> We could move the is() test after all the skip_all's.  Any thoughts?
> 
> I think the easiest fix is to just die if we can't get the offsets - it's not
> like we can really continue afterwards...

This should do it:

-is(scalar @lp_off, $ROWCOUNT, "acquired row offsets");
+scalar @lp_off == $ROWCOUNT or BAIL_OUT("row offset counts mismatch");

But I'm not sure what the latest thinking on BAIL_OUT is.  It is used 
nearby in a similar way though.