Thread: HOT chain validation in verify_heapam()

HOT chain validation in verify_heapam()

From
Himanshu Upadhyaya
Date:
Hi,

On Mon, Apr 4, 2022 at 11:46 PM Andres Freund <andres@anarazel.de> wrote:

I think there's a few more things that'd be good to check. For example amcheck
doesn't verify that HOT chains are reasonable, which can often be spotted
looking at an individual page. Which is a bit unfortunate, given how many bugs
we had in that area.

Stuff to check around that:
- target of redirect has HEAP_ONLY_TUPLE, HEAP_UPDATED set
- In a valid ctid chain within a page (i.e. xmax = xmin):
  - tuples have HEAP_UPDATED set
  - HEAP_ONLY_TUPLE / HEAP_HOT_UPDATED matches across chains elements

(I changed the subject because the attached patch is related to HOT chain validation).

Please find attached the patch with the above idea of HOT chain's validation(within a Page) and a few more validation as below.

* If the predecessor’s xmin is aborted or in progress, the current tuples xmin should be aborted or in progress respectively. Also, Both xmin must be equal.
* If the predecessor’s xmin is not frozen, then-current tuple’s shouldn’t be either.
* If the predecessor’s xmin is equal to the current tuple’s xmin, the current tuple’s cmin should be greater than the predecessor’s xmin.
* If the current tuple is not HOT then its predecessor’s tuple must not be HEAP_HOT_UPDATED.
* If the current Tuple is HOT then its predecessor’s tuple must be HEAP_HOT_UPDATED and vice-versa.
* If xmax is 0, which means it's the last tuple in the chain, then it must not be HEAP_HOT_UPDATED.
* If the current tuple is the last tuple in the HOT chain then the next tuple should not be HOT.

I am looking into the process of adding the TAP test for these changes and finding a way to corrupt a page in the tap test. Will try to include these test cases in my Upcoming version of the patch.

--
Regards,
Himanshu Upadhyaya
EnterpriseDB: http://www.enterprisedb.com
Attachment

Re: HOT chain validation in verify_heapam()

From
Robert Haas
Date:
Hi,

Thanks for working on this.

+                htup = (HeapTupleHeader) PageGetItem(ctx.page, rditem);
+                if (!(HeapTupleHeaderIsHeapOnly(htup) &&
htup->t_infomask & HEAP_UPDATED))
+                    report_corruption(&ctx,
+                                      psprintf("Redirected Tuple at
line pointer offset %u is not HEAP_ONLY_TUPLE or HEAP_UPDATED tuple",
+                                               (unsigned) rdoffnum));

This isn't safe because the line pointer referenced by rditem may not
have been sanity-checked yet. Refer to the comment just below where it
says "Sanity-check the line pointer's offset and length values".

There are multiple problems with this error message. First, if you
take a look at the existing messages - which is always a good thing to
do when adding new ones - you will see that they are capitalized
differently. Try to match the existing style. Second, we made a real
effort with the existing messages to avoid referring to the names of
identifiers that only exist at the C level. For example, just above
you will see a message which says "line pointer redirection to item at
offset %u precedes minimum offset %u". It deliberately does not say
"line pointer redirection to item at offset %u is less than
FirstOffsetNumber" even though that would be an equally correct
statement of the problem. The intent here is to make the messages at
least somewhat accessible to users who are somewhat familiar with how
PostgreSQL storage works but may not read the C code. These comments
apply to every message you add in the patch.

The message also does not match the code. The code tests for
HEAP_UPDATED, but the message claims that the code is testing for
either HEAP_ONLY_TUPLE or HEAP_UPDATED. As a general rule, it is best
not to club related tests together in cases like this, because it
enables better and more specific error messages.

It would be clearer to make an explicit comparison to 0, like
(htup->t_infomask & HEAP_UPDATED) != 0, rather than relying on 0 being
false and non-zero being true. It doesn't matter to the compiler, but
it may help human readers.

+            /*
+             * Add line pointer offset to predecessor array.
+             * 1) If xmax is matching with xmin of next
Tuple(reaching via its t_ctid).
+             * 2) If next tuple is in the same page.
+             * Raise corruption if:
+             * We have two tuples having same predecessor.
+             *
+             * We add offset to predecessor irrespective of
transaction(t_xmin) status. We will
+             * do validation related to transaction status(and also
all other validations)
+             * when we loop over predecessor array.
+             */

The formatting of this comment will, I think, be mangled if pgindent
is run against the file. You can use ----- markers to prevent that, I
believe, or (better) write this as a paragraph without relying on the
lines ending up uneven in length.

+                if (predecessor[nextTupOffnum] != 0)
+                {
+                    report_corruption(&ctx,
+                                psprintf("Tuple at offset %u is
reachable from two or more updated tuple",
+                                    (unsigned) nextTupOffnum));
+                    continue;
+                }

You need to do this test after xmin/xmax matching. Otherwise you might
get false positives. Also, the message should be more specific and
match the style of the existing messages. ctx.offnum is already going
to be reported in another column, but we can report both nextTupOffnum
and predecessor[nextTupOffnum] here e.g. "updated version at offset %u
is also the updated version of tuple at offset %u".

+                currTupXmax = HeapTupleHeaderGetUpdateXid(ctx.tuphdr);
+                lp   = PageGetItemId(ctx.page, nextTupOffnum);
+
+                htup = (HeapTupleHeader) PageGetItem(ctx.page, lp);

This has the same problem I mentioned in my first comment above,
namely, we haven't necessarily sanity-checked the length and offset
values for nextTupOffnum yet. Saying that another way, if the contents
of lp are corrupt and point off the page, we want that to be reported
as corruption (which the current code will already do) and we want
this check to be skipped so that we don't crash or access random
memory addresses. You need to think about how to rearrange the code so
that we only perform checks that are known to be safe.

+        /* Now loop over offset and validate data in predecessor array.*/
+        for ( ctx.offnum = FirstOffsetNumber; ctx.offnum <= maxoff;
+             ctx.offnum = OffsetNumberNext(ctx.offnum))

Please take the time to format your code according to the PostgeSQL
standard practice. If you don't know what that looks like, use
pgindent.

+        {
+            HeapTupleHeader pred_htup, curr_htup;
+            TransactionId   pred_xmin, curr_xmin, curr_xmax;
+            ItemId      pred_lp, curr_lp;

Same here.

Within this loop, you need to think about what to include in the
columns of the output other than 'msg' and what to include in the
message itself. There's no reason to include ctx.offnum in the message
text because it's already included in the 'offnum' column of the
output.

I think it would actually be a good idea to set ctx.offnum to the
predecessor's offset number, and use a separate variable for the
current offset number. The reason why I think this is that I believe
it will make it easier to phrase the messages appropriately. For
example, if ctx.offnum is the predecessor tuple, then we can issue
complaints like this:

tuple with uncommitted xmin %u was updated to produce a tuple at
offset %u with differing xmin %u
unfrozen tuple was updated to produce a tuple at offset %u which is not frozen
tuple with uncommitted xmin %u has cmin %u, but was updated to produce
a tuple with cmin %u
non-heap-only update produced a heap-only tuple at offset %u
heap-only update produced a non-heap only tuple at offset %u

+            if (!TransactionIdIsValid(curr_xmax) &&
+                HeapTupleHeaderIsHotUpdated(curr_htup))
+            {
+                report_corruption(&ctx,
+                            psprintf("Current tuple at offset %u is
HOT but is last tuple in the HOT chain.",
+                            (unsigned) ctx.offnum));
+            }

This check has nothing to do with the predecessor[] array, so it seems
like it belongs in check_tuple() rather than here. Also, the message
is rather confused, because the test is checking whether the tuple has
been HOT-updated, while the message is talking about whether the tuple
was *itself* created by a HOT update. Also, when we're dealing with
corruption, statements like "is last tuple in the HOT chain" are
pretty ambiguous. Also, isn't this an issue for both HOT-updated
tuples and also just regular updated tuples? i.e. maybe what we should
be complaining about here is something like "tuple has been updated,
but xmax is 0" and then make the test check exactly that.

+            if (!HeapTupleHeaderIsHotUpdated(pred_htup) &&
+                HeapTupleHeaderIsHeapOnly(pred_htup) &&
+                HeapTupleHeaderIsHeapOnly(curr_htup))
+            {
+                report_corruption(&ctx,
+                            psprintf("Current tuple at offset %u is
HOT but it is next updated tuple of last Tuple in HOT chain.",
+                            (unsigned) ctx.offnum));
+            }

Three if-statements up, you tested two out of these three conditions
and complained if they were met. So any time this fires, that will
have also fired.

...Robert



Re: HOT chain validation in verify_heapam()

From
Himanshu Upadhyaya
Date:

Hi Robert,

Thanks for sharing the feedback.

On Sat, Aug 27, 2022 at 1:47 AM Robert Haas <robertmhaas@gmail.com> wrote:
+                htup = (HeapTupleHeader) PageGetItem(ctx.page, rditem);
+                if (!(HeapTupleHeaderIsHeapOnly(htup) &&
htup->t_infomask & HEAP_UPDATED))
+                    report_corruption(&ctx,
+                                      psprintf("Redirected Tuple at
line pointer offset %u is not HEAP_ONLY_TUPLE or HEAP_UPDATED tuple",
+                                               (unsigned) rdoffnum));

This isn't safe because the line pointer referenced by rditem may not
have been sanity-checked yet. Refer to the comment just below where it
says "Sanity-check the line pointer's offset and length values".

handled by creating a new function check_lp and calling it before accessing the redirected tuple.


+            /*
+             * Add line pointer offset to predecessor array.
+             * 1) If xmax is matching with xmin of next
Tuple(reaching via its t_ctid).
+             * 2) If next tuple is in the same page.
+             * Raise corruption if:
+             * We have two tuples having same predecessor.
+             *
+             * We add offset to predecessor irrespective of
transaction(t_xmin) status. We will
+             * do validation related to transaction status(and also
all other validations)
+             * when we loop over predecessor array.
+             */

The formatting of this comment will, I think, be mangled if pgindent
is run against the file. You can use ----- markers to prevent that, I
believe, or (better) write this as a paragraph without relying on the
lines ending up uneven in length.

 
Done, reformatted using pg_indent. 

+                if (predecessor[nextTupOffnum] != 0)
+                {
+                    report_corruption(&ctx,
+                                psprintf("Tuple at offset %u is
reachable from two or more updated tuple",
+                                    (unsigned) nextTupOffnum));
+                    continue;
+                }

You need to do this test after xmin/xmax matching. Otherwise you might
get false positives. Also, the message should be more specific and
match the style of the existing messages. ctx.offnum is already going
to be reported in another column, but we can report both nextTupOffnum
and predecessor[nextTupOffnum] here e.g. "updated version at offset %u
is also the updated version of tuple at offset %u".

 
agree, done.

+                currTupXmax = HeapTupleHeaderGetUpdateXid(ctx.tuphdr);
+                lp   = PageGetItemId(ctx.page, nextTupOffnum);
+
+                htup = (HeapTupleHeader) PageGetItem(ctx.page, lp);

This has the same problem I mentioned in my first comment above,
namely, we haven't necessarily sanity-checked the length and offset
values for nextTupOffnum yet. Saying that another way, if the contents
of lp are corrupt and point off the page, we want that to be reported
as corruption (which the current code will already do) and we want
this check to be skipped so that we don't crash or access random
memory addresses. You need to think about how to rearrange the code so
that we only perform checks that are known to be safe.

 
Moved logic of sanity checked to a new function check_lp() and called before accessing the next tuple while populating the predecessor array.
 
Please take the time to format your code according to the PostgeSQL
standard practice. If you don't know what that looks like, use
pgindent.

+        {
+            HeapTupleHeader pred_htup, curr_htup;
+            TransactionId   pred_xmin, curr_xmin, curr_xmax;
+            ItemId      pred_lp, curr_lp;

Same here.
 
Done. 

I think it would actually be a good idea to set ctx.offnum to the
predecessor's offset number, and use a separate variable for the
current offset number. The reason why I think this is that I believe
it will make it easier to phrase the messages appropriately. For
example, if ctx.offnum is the predecessor tuple, then we can issue
complaints like this:

tuple with uncommitted xmin %u was updated to produce a tuple at
offset %u with differing xmin %u
unfrozen tuple was updated to produce a tuple at offset %u which is not frozen
tuple with uncommitted xmin %u has cmin %u, but was updated to produce
a tuple with cmin %u
non-heap-only update produced a heap-only tuple at offset %u
heap-only update produced a non-heap only tuple at offset %u


Agree, Done.
 
+            if (!TransactionIdIsValid(curr_xmax) &&
+                HeapTupleHeaderIsHotUpdated(curr_htup))
+            {
+                report_corruption(&ctx,
+                            psprintf("Current tuple at offset %u is
HOT but is last tuple in the HOT chain.",
+                            (unsigned) ctx.offnum));
+            }

This check has nothing to do with the predecessor[] array, so it seems
like it belongs in check_tuple() rather than here. Also, the message
is rather confused, because the test is checking whether the tuple has
been HOT-updated, while the message is talking about whether the tuple
was *itself* created by a HOT update. Also, when we're dealing with
corruption, statements like "is last tuple in the HOT chain" are
pretty ambiguous. Also, isn't this an issue for both HOT-updated
tuples and also just regular updated tuples? i.e. maybe what we should
be complaining about here is something like "tuple has been updated,
but xmax is 0" and then make the test check exactly that.
 
Moved to check_tuple_header. This should be applicable for both HOT and normal updates but even the last updated tuple in the normal update is HEAP_UPDATED so not sure how we can apply this check for a normal update?

+            if (!HeapTupleHeaderIsHotUpdated(pred_htup) &&
+                HeapTupleHeaderIsHeapOnly(pred_htup) &&
+                HeapTupleHeaderIsHeapOnly(curr_htup))
+            {
+                report_corruption(&ctx,
+                            psprintf("Current tuple at offset %u is
HOT but it is next updated tuple of last Tuple in HOT chain.",
+                            (unsigned) ctx.offnum));
+            }

Three if-statements up, you tested two out of these three conditions
and complained if they were met. So any time this fires, that will
have also fired.

Yes, the above condition is not required. Now removed.

--
Regards,
Himanshu Upadhyaya
EnterpriseDB: http://www.enterprisedb.com
Attachment

Re: HOT chain validation in verify_heapam()

From
Aleksander Alekseev
Date:
Hi Himanshu,

Many thanks for working on this!

> Please find attached the patch with the above idea of HOT chain's validation

Please correct me if I'm wrong, but don't we have a race condition here:

```
+            if ((TransactionIdDidAbort(pred_xmin) ||
TransactionIdIsInProgress(pred_xmin))
+                && !TransactionIdEquals(pred_xmin, curr_xmin))
             {
```

The scenario that concerns me is the following:

1. TransactionIdDidAbort(pred_xmin) returns false
2. The transaction aborts
3. TransactionIdIsInProgress(pred_xmin) returns false
4. (false || false) gives us false. An error is reported, although
actually the condition should have been true.

-- 
Best regards,
Aleksander Alekseev



Re: HOT chain validation in verify_heapam()

From
Aleksander Alekseev
Date:
Hi again,

> I am looking into the process of adding the TAP test for these changes and finding a way to corrupt a page in the tap
test

Please note that currently the patch breaks many existing tests. I
suggest fixing these first.

For the details please see the cfbot report [1] or execute the tests
locally. Personally I'm using a little script for this [2].

[1]: http://cfbot.cputube.org/
[2]: https://github.com/afiskon/pgscripts/blob/master/full-build.sh

-- 
Best regards,
Aleksander Alekseev



Re: HOT chain validation in verify_heapam()

From
Aleksander Alekseev
Date:
Hi hackers,

> Please correct me if I'm wrong, but don't we have a race condition here:
>
> ```
> +            if ((TransactionIdDidAbort(pred_xmin) ||
> TransactionIdIsInProgress(pred_xmin))
> +                && !TransactionIdEquals(pred_xmin, curr_xmin))
>              {
> ```
>
> The scenario that concerns me is the following:
>
> 1. TransactionIdDidAbort(pred_xmin) returns false
> 2. The transaction aborts
> 3. TransactionIdIsInProgress(pred_xmin) returns false
> 4. (false || false) gives us false. An error is reported, although
> actually the condition should have been true.

It looks like I had a slight brain fade here.

In order to report a false error either TransactionIdDidAbort() or
TransactionIdIsInProgress() should return true and
TransactionIdEquals() should be false. So actually under rare
conditions the error will NOT be reported while it should. Other than
that we seem to be safe from the concurrency perspective, unless I'm
missing something again.

Personally I don't have a strong opinion on whether we should bother
about this scenario. Probably an explicit comment will not hurt.

Also I suggest checking TransactionIdEquals() first though since it's cheaper.

-- 
Best regards,
Aleksander Alekseev



Re: HOT chain validation in verify_heapam()

From
Robert Haas
Date:
On Tue, Sep 6, 2022 at 9:38 AM Aleksander Alekseev
<aleksander@timescale.com> wrote:
> > Please correct me if I'm wrong, but don't we have a race condition here:
> >
> > ```
> > +            if ((TransactionIdDidAbort(pred_xmin) ||
> > TransactionIdIsInProgress(pred_xmin))
> > +                && !TransactionIdEquals(pred_xmin, curr_xmin))
> >              {
> > ```
>
> It looks like I had a slight brain fade here.
>
> In order to report a false error either TransactionIdDidAbort() or
> TransactionIdIsInProgress() should return true and
> TransactionIdEquals() should be false. So actually under rare
> conditions the error will NOT be reported while it should. Other than
> that we seem to be safe from the concurrency perspective, unless I'm
> missing something again.
>
> Personally I don't have a strong opinion on whether we should bother
> about this scenario. Probably an explicit comment will not hurt.
>
> Also I suggest checking TransactionIdEquals() first though since it's cheaper.

I think the check should be written like this:

!TransactionIdEquals(pred_xmin, curr_xmin) && !TransctionIdDidCommit(pred_xmin)

The TransactionIdEquals check should be done first for the reason you
state: it's cheaper.

I think that we shouldn't be using TransactionIdDidAbort() at all,
because it can sometimes return false even when the transaction
actually did abort. See test_lockmode_for_conflict() and
TransactionIdIsInProgress() for examples of logic that copes with
this. What's happening here is that TransactionIdDidAbort doesn't ever
get called if the system crashes while a transaction is running. So we
can use TransactionIdDidAbort() only as an optimization: if it returns
true, the transaction is definitely aborted, but if it returns false,
we have to check whether it's still running. If not, it aborted
anyway.

TransactionIdDidCommit() does not have the same issue. A transaction
can abort without updating CLOG if the system crashes, but it can
never commit without updating CLOG. If the transaction didn't commit,
then it is either aborted or still in progress (and we don't care
which, because neither is an error here).

As to whether the existing formulation of the test has an error
condition, you're generally right that we should test
TransactionIdIsInProgress() before TransactionIdDidCommit/Abort,
because we during commit or abort, we first set the status in CLOG -
which is queried by TransactionIdDidCommit/Abort - and only afterward
update the procarray - which is queried by TransactionIdIsInProgress.
So normally TransactionIdIsInProgress should be checked first, and
TransactionIdDidCommit/Abort should only be checked if it returns
false, at which point we know that the return values of the latter
calls can't ever change. Possibly there is an argument for including
the TransactionIdInProgress check here too:

!TransactionIdEquals(pred_xmin, curr_xmin) &&
(TransactionIdIsInProgress(pred_xmin) ||
!TransctionIdDidCommit(pred_xmin))

...but I don't think it could change the answer. Most places that
check TransactionIdIsInProgress() first are concerned with MVCC
semantics, and here we are not. I think the only effects of including
or excluding the TransactionIdIsInProgress() test are (1) performance,
in that searching the procarray might avoid expense if it's cheaper
than searching clog, or add expense if the reverse is true and (2)
slightly changing the time at which we're first able to detect this
form of corruption. I am inclined to prefer the simpler form of the
test without TransactionIdIsInProgress() unless someone can say why we
shouldn't go that route.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: HOT chain validation in verify_heapam()

From
Robert Haas
Date:
On Tue, Sep 6, 2022 at 6:34 AM Himanshu Upadhyaya
<upadhyaya.himanshu@gmail.com> wrote:
>> This isn't safe because the line pointer referenced by rditem may not
>> have been sanity-checked yet. Refer to the comment just below where it
>> says "Sanity-check the line pointer's offset and length values".
>>
> handled by creating a new function check_lp and calling it before accessing the redirected tuple.

I think this is going to result in duplicate error messages, because
if A redirects to B, what keeps us from calling check_lp(B) once when
we reach A and again when we reach B?

I am kind of generally suspicious of the idea that, both for redirects
and for ctid links, you just have it check_lp() on the target line
pointer and then maybe try to skip doing that again later when we get
there. That feels error-prone to me. I think we should try to find a
way of organizing the code where we do the check_lp() checks on all
line pointers in order without skipping around or backing up. It's not
100% clear to me what the best way of accomplishing that is, though.

But here's one random idea: add a successor[] array and an lp_valid[]
array. In the first loop, set lp_valid[offset] = true if it passes the
check_lp() checks, and set successor[A] = B if A redirects to B or has
a CTID link to B, without matching xmin/xmax. Then, in a second loop,
iterate over the successor[] array. If successor[A] = B && lp_valid[A]
&& lp_valid[B], then check whether A.xmax = B.xmin; if so, then
complain if predecessor[B] is already set, else set predecessor[B] =
A. Then, in the third loop, iterate over the predecessor array just as
you're doing now. Then it's clear that we do the lp_valid checks
exactly once for every offset that might need them, and in order. And
it's also clear that the predecessor-based checks can never happen
unless the lp_valid checks passed for both of the offsets involved.

> Done, reformatted using pg_indent.

Thanks, but the new check_lp() function's declaration is not formatted
according to pgindent guidelines. It's not enough to fix the problems
once, you have to avoid reintroducing them.

>> +            if (!TransactionIdIsValid(curr_xmax) &&
>> +                HeapTupleHeaderIsHotUpdated(curr_htup))
>> +            {
>> +                report_corruption(&ctx,
>> +                            psprintf("Current tuple at offset %u is
>> HOT but is last tuple in the HOT chain.",
>> +                            (unsigned) ctx.offnum));
>> +            }
>>
>> This check has nothing to do with the predecessor[] array, so it seems
>> like it belongs in check_tuple() rather than here. Also, the message
>> is rather confused, because the test is checking whether the tuple has
>> been HOT-updated, while the message is talking about whether the tuple
>> was *itself* created by a HOT update. Also, when we're dealing with
>> corruption, statements like "is last tuple in the HOT chain" are
>> pretty ambiguous. Also, isn't this an issue for both HOT-updated
>> tuples and also just regular updated tuples? i.e. maybe what we should
>> be complaining about here is something like "tuple has been updated,
>> but xmax is 0" and then make the test check exactly that.
>
> Moved to check_tuple_header. This should be applicable for both HOT and normal updates but even the last updated
tuplein the normal update is HEAP_UPDATED so not sure how we can apply this check for a normal update?
 

Oh, yeah.  You're right. I was thinking that HEAP_UPDATED was like
HEAP_HOT_UPDATED, but it's not: HEAP_UPDATED gets set on the new
tuple, while HEAP_HOT_UPDATED gets set on the old tuple.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: HOT chain validation in verify_heapam()

From
Himanshu Upadhyaya
Date:


On Wed, Sep 7, 2022 at 2:49 AM Robert Haas <robertmhaas@gmail.com> wrote:

But here's one random idea: add a successor[] array and an lp_valid[]
array. In the first loop, set lp_valid[offset] = true if it passes the
check_lp() checks, and set successor[A] = B if A redirects to B or has
a CTID link to B, without matching xmin/xmax. Then, in a second loop,
iterate over the successor[] array. If successor[A] = B && lp_valid[A]
&& lp_valid[B], then check whether A.xmax = B.xmin; if so, then
complain if predecessor[B] is already set, else set predecessor[B] =
A. Then, in the third loop, iterate over the predecessor array just as
you're doing now. Then it's clear that we do the lp_valid checks
exactly once for every offset that might need them, and in order. And
it's also clear that the predecessor-based checks can never happen
unless the lp_valid checks passed for both of the offsets involved.



Approach of introducing a successor array is good but I see one overhead with having both successor and predecessor array, that is, we will traverse each offset on page thrice(one more for original loop on offset) and with each offset we have to retrieve 
/reach an ItemID(PageGetItemId) and Item(PageGetItem) itself. This is not much overhead as they are all preprocessors but there will be some overhead.
How about having new array(initialised with LP_NOT_CHECKED) of enum LPStatus as below

typedef enum LPStatus
{
LP_NOT_CHECKED,
LP_VALID,
LP_NOT_VALID
}LPStatus;

and validating and setting with proper status at three places
1) while validating Redirect Tuple
2) while validating populating predecessor array and
3) at original place  of "sanity check"


something like:
"     if (lpStatus[rdoffnum] == LP_NOT_CHECKED)
                                {
                                        ctx.offnum = rdoffnum;
                                        if (!check_lp(&ctx, ItemIdGetLength(rditem), ItemIdGetOffset(rditem)))
                                        {
                                                lpStatus[rdoffnum] = LP_NOT_VALID;
                                                continue;
                                        }
                                        lpStatus[rdoffnum] = LP_VALID;
                                }
                                else if (lpStatus[rdoffnum] == LP_NOT_VALID)
                                        continue;
"



thoughts?
--
Regards,
Himanshu Upadhyaya
EnterpriseDB: http://www.enterprisedb.com

Re: HOT chain validation in verify_heapam()

From
Robert Haas
Date:
On Fri, Sep 9, 2022 at 10:00 AM Himanshu Upadhyaya
<upadhyaya.himanshu@gmail.com> wrote:
> Approach of introducing a successor array is good but I see one overhead with having both successor and predecessor
array,that is, we will traverse each offset on page thrice(one more for original loop on offset) and with each offset
wehave to retrieve
 
> /reach an ItemID(PageGetItemId) and Item(PageGetItem) itself. This is not much overhead as they are all preprocessors
butthere will be some overhead.
 
> How about having new array(initialised with LP_NOT_CHECKED) of enum LPStatus as below
>
> typedef enum LPStatus
> {
> LP_NOT_CHECKED,
> LP_VALID,
> LP_NOT_VALID
> }LPStatus;
>
> and validating and setting with proper status at three places
> 1) while validating Redirect Tuple
> 2) while validating populating predecessor array and
> 3) at original place  of "sanity check"

Well, having to duplicate the logic in three places doesn't seem all
that clean to me. Admittedly, I haven't tried implementing my
proposal, so maybe that doesn't come out very clean either. I don't
know. But I think having the code be clearly correct here is the most
important thing, not shaving a few CPU cycles here or there. It's not
even clear to me that your way would be cheaper, because an "if"
statement is certainly not free, and in fact is probably more
expensive than an extra call to PageGetItem() or PageGetItemId().
Branches are usually more expensive than math. But actually I doubt
that it matters much either way. I think the way to figure out whether
we have a performance problem is to write the code in the
stylistically best way and then test it. There may be no problem at
all, and if there is a problem, it may not be where we think it will
be.

In short, let's apply Knuth's optimization principle.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: HOT chain validation in verify_heapam()

From
Himanshu Upadhyaya
Date:


On Wed, Sep 7, 2022 at 2:49 AM Robert Haas <robertmhaas@gmail.com> wrote:

But here's one random idea: add a successor[] array and an lp_valid[]
array. In the first loop, set lp_valid[offset] = true if it passes the
check_lp() checks, and set successor[A] = B if A redirects to B or has
a CTID link to B, without matching xmin/xmax. Then, in a second loop,
iterate over the successor[] array. If successor[A] = B && lp_valid[A]
&& lp_valid[B], then check whether A.xmax = B.xmin; if so, then
complain if predecessor[B] is already set, else set predecessor[B] =
A. Then, in the third loop, iterate over the predecessor array just as
you're doing now. Then it's clear that we do the lp_valid checks
exactly once for every offset that might need them, and in order. And
it's also clear that the predecessor-based checks can never happen
unless the lp_valid checks passed for both of the offsets involved.


ok, I have introduced a new approach to first construct a successor array and then loop over the successor array to construct a predecessor array.

--
Regards,
Himanshu Upadhyaya
EnterpriseDB: http://www.enterprisedb.com
Attachment

Re: HOT chain validation in verify_heapam()

From
Himanshu Upadhyaya
Date:


On Wed, Sep 7, 2022 at 12:11 AM Robert Haas <robertmhaas@gmail.com> wrote:

I think the check should be written like this:

!TransactionIdEquals(pred_xmin, curr_xmin) && !TransctionIdDidCommit(pred_xmin)

The TransactionIdEquals check should be done first for the reason you
state: it's cheaper.

I think that we shouldn't be using TransactionIdDidAbort() at all,
because it can sometimes return false even when the transaction
actually did abort. See test_lockmode_for_conflict() and
TransactionIdIsInProgress() for examples of logic that copes with
this. What's happening here is that TransactionIdDidAbort doesn't ever
get called if the system crashes while a transaction is running. So we
can use TransactionIdDidAbort() only as an optimization: if it returns
true, the transaction is definitely aborted, but if it returns false,
we have to check whether it's still running. If not, it aborted
anyway.

TransactionIdDidCommit() does not have the same issue. A transaction
can abort without updating CLOG if the system crashes, but it can
never commit without updating CLOG. If the transaction didn't commit,
then it is either aborted or still in progress (and we don't care
which, because neither is an error here).

As to whether the existing formulation of the test has an error
condition, you're generally right that we should test
TransactionIdIsInProgress() before TransactionIdDidCommit/Abort,
because we during commit or abort, we first set the status in CLOG -
which is queried by TransactionIdDidCommit/Abort - and only afterward
update the procarray - which is queried by TransactionIdIsInProgress.
So normally TransactionIdIsInProgress should be checked first, and
TransactionIdDidCommit/Abort should only be checked if it returns
false, at which point we know that the return values of the latter
calls can't ever change. Possibly there is an argument for including
the TransactionIdInProgress check here too:

!TransactionIdEquals(pred_xmin, curr_xmin) &&
(TransactionIdIsInProgress(pred_xmin) ||
!TransctionIdDidCommit(pred_xmin))

...but I don't think it could change the answer. Most places that
check TransactionIdIsInProgress() first are concerned with MVCC
semantics, and here we are not. I think the only effects of including
or excluding the TransactionIdIsInProgress() test are (1) performance,
in that searching the procarray might avoid expense if it's cheaper
than searching clog, or add expense if the reverse is true and (2)
slightly changing the time at which we're first able to detect this
form of corruption. I am inclined to prefer the simpler form of the
test without TransactionIdIsInProgress() unless someone can say why we
shouldn't go that route.

Done, updated in the v3 patch.


--
Regards,
Himanshu Upadhyaya
EnterpriseDB: http://www.enterprisedb.com

Re: HOT chain validation in verify_heapam()

From
Aleksander Alekseev
Date:
Hi Himanshu,

> Done, updated in the v3 patch.

Thanks for the updated patch.

Here is v4 with fixed compiler warnings and some minor tweaks from me.

I didn't put too much thought into the algorithm but I already see
something strange. At verify_heapam.c:553 you declared curr_xmax and
next_xmin. However the variables are not used/initialized until you
do:

```
            if (lp_valid[nextoffnum] && lp_valid[ctx.offnum] &&
                TransactionIdIsValid(curr_xmax) &&
                TransactionIdEquals(curr_xmax, next_xmin)) {
/* ... */
```

In v4 I elected to initialize both curr_xmax and next_xmin with
InvalidTransactionId for safety and in order to silence the compiler
but still there is no way this condition can succeed.

Please make sure there is no logic missing.

-- 
Best regards,
Aleksander Alekseev

Attachment

Re: HOT chain validation in verify_heapam()

From
Himanshu Upadhyaya
Date:


On Mon, Sep 19, 2022 at 8:27 PM Aleksander Alekseev <aleksander@timescale.com> wrote:
Hi Himanshu,

> Done, updated in the v3 patch.

Thanks for the updated patch.

Here is v4 with fixed compiler warnings and some minor tweaks from me.

I didn't put too much thought into the algorithm but I already see
something strange. At verify_heapam.c:553 you declared curr_xmax and
next_xmin. However the variables are not used/initialized until you
do:

```
            if (lp_valid[nextoffnum] && lp_valid[ctx.offnum] &&
                TransactionIdIsValid(curr_xmax) &&
                TransactionIdEquals(curr_xmax, next_xmin)) {
/* ... */
```

In v4 I elected to initialize both curr_xmax and next_xmin with
InvalidTransactionId for safety and in order to silence the compiler
but still there is no way this condition can succeed.

Please make sure there is no logic missing.


Hi Aleksander,

Thanks for sharing the feedback,
It's my mistake, sorry about that, I was trying to merge two if conditions and forgot to move the initialization part for xmin and xmax. Now I think that it will be good to have nested if, and have an inner if condition to test xmax and xmin matching. This way we can retrieve and populate xmin and xmax when it is actually required for the inner if condition.
I have changed this in the attached patch.
 
--
Regards,
Himanshu Upadhyaya
EnterpriseDB: http://www.enterprisedb.com
Attachment

Re: HOT chain validation in verify_heapam()

From
Aleksander Alekseev
Date:
Hi Himanshu,

> I have changed this in the attached patch.

If it's not too much trouble could you please base your changes on v4
that I submitted? I put some effort into writing a proper commit
message, editing the comments, etc. The easiest way of doing this is
using `git am` and `git format-patch`.

-- 
Best regards,
Aleksander Alekseev



Re: HOT chain validation in verify_heapam()

From
Himanshu Upadhyaya
Date:


On Mon, Sep 19, 2022 at 10:04 PM Aleksander Alekseev <aleksander@timescale.com> wrote:
Hi Himanshu,

> I have changed this in the attached patch.

If it's not too much trouble could you please base your changes on v4
that I submitted? I put some effort into writing a proper commit
message, editing the comments, etc. The easiest way of doing this is
using `git am` and `git format-patch`.

Please find it attached.

--
Regards,
Himanshu Upadhyaya
EnterpriseDB: http://www.enterprisedb.com
Attachment

Re: HOT chain validation in verify_heapam()

From
Robert Haas
Date:
On Tue, Sep 20, 2022 at 5:00 AM Himanshu Upadhyaya
<upadhyaya.himanshu@gmail.com> wrote:
> Please find it attached.

This patch still has no test cases. Just as we have test cases for the
existing corruption checks, we should have test cases for these new
corruption checks, showing cases where they actually fire.

I think I would be inclined to set lp_valid[x] = true in both the
redirected and non-redirected case, and then have the very first thing
that the second loop does be if (nextoffnum == 0 ||
!lp_valid[ctx.offnum]) continue. I think that would be more clear
about the intent to ignore line pointers that failed validation. Also,
if you did it that way, then you could catch the case of a redirected
line pointer pointing to another redirected line pointer, which is a
corruption condition that the current code does not appear to check.

+            /*
+             * Validation via the predecessor array. 1) If the predecessor's
+             * xmin is aborted or in progress, the current tuples xmin should
+             * be aborted or in progress respectively. Also both xmin's must
+             * be equal. 2) If the predecessor's xmin is not frozen, then
+             * current tuple's shouldn't be either. 3) If the predecessor's
+             * xmin is equal to the current tuple's xmin, the current tuple's
+             * cmin should be greater than the predecessor's cmin. 4) If the
+             * current tuple is not HOT then its predecessor's tuple must not
+             * be HEAP_HOT_UPDATED. 5) If the current tuple is HOT then its
+             * predecessor's tuple must be HEAP_HOT_UPDATED.
+             */

This comment needs to be split up into pieces and the pieces need to
be moved closer to the tests to which they correspond.

+                                  psprintf("unfrozen tuple was
updated to produce a tuple at offset %u which is not frozen",

Shouldn't this say "which is frozen"?

+             * Not a corruption if current tuple is updated/deleted by a
+             * different transaction, means t_cid will point to cmax (that is
+             * command id of deleting transaction) and cid of predecessor not
+             * necessarily will be smaller than cid of current tuple. t_cid

I think that the next person who reads this code is likely to
understand that the CIDs of different transactions are numerically
unrelated. What's less obvious is that if the XID is the same, the
newer update must have a higher CID.

+             * can hold combo command id but we are not worrying here since
+             * combo command id of the next updated tuple (if present) must be
+             * greater than combo command id of the current tuple. So here we
+             * are not checking HEAP_COMBOCID flag and simply doing t_cid
+             * comparison.

I disapprove of ignoring the HEAP_COMBOCID flag. Emitting a message
claiming that the CID has a certain value when that's actually a combo
CID is misleading, so at least a different message wording is needed
in such cases. But it's also not clear to me that the newer update has
to have a higher combo CID, because combo CIDs can be reused. If you
have multiple cursors open in the same transaction, the updates can be
interleaved, and it seems to me that it might be possible for an older
CID to have created a certain combo CID after a newer CID, and then
both cursors could update the same page in succession and end up with
combo CIDs out of numerical order. Unless somebody can make a
convincing argument that this isn't possible, I think we should just
skip this check for cases where either tuple has a combo CID.

+            if (TransactionIdEquals(pred_xmin, curr_xmin) &&
+                (TransactionIdEquals(curr_xmin, curr_xmax) ||
+                 !TransactionIdIsValid(curr_xmax)) && pred_cmin >= curr_cmin)

I don't understand the reason for the middle part of this condition --
TransactionIdEquals(curr_xmin, curr_xmax) ||
!TransactionIdIsValid(curr_xmax). I suppose the comment is meant to
explain this, but I still don't get it. If a tuple with XMIN 12345
CMIN 2 is updated to produce a tuple with XMIN 12345 CMIN 1, that's
corruption, regardless of what the XMAX of the second tuple may happen
to be.

+            if (HeapTupleHeaderIsHeapOnly(curr_htup) &&
+                !HeapTupleHeaderIsHotUpdated(pred_htup))

+            if (!HeapTupleHeaderIsHeapOnly(curr_htup) &&
+                HeapTupleHeaderIsHotUpdated(pred_htup))

I think it would be slightly clearer to write these tests the other
way around i.e. check the previous tuple's state first.

+    if (!TransactionIdIsValid(curr_xmax) &&
HeapTupleHeaderIsHotUpdated(tuphdr))
+    {
+        report_corruption(ctx,
+                          psprintf("tuple has been updated, but xmax is 0"));
+        result = false;
+    }

I guess this message needs to say "tuple has been HOT updated, but
xmax is 0" or something like that.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: HOT chain validation in verify_heapam()

From
Himanshu Upadhyaya
Date:


On Tue, Sep 20, 2022 at 6:43 PM Robert Haas <robertmhaas@gmail.com> wrote:
I disapprove of ignoring the HEAP_COMBOCID flag. Emitting a message
claiming that the CID has a certain value when that's actually a combo
CID is misleading, so at least a different message wording is needed
in such cases. But it's also not clear to me that the newer update has
to have a higher combo CID, because combo CIDs can be reused. If you
have multiple cursors open in the same transaction, the updates can be
interleaved, and it seems to me that it might be possible for an older
CID to have created a certain combo CID after a newer CID, and then
both cursors could update the same page in succession and end up with
combo CIDs out of numerical order. Unless somebody can make a
convincing argument that this isn't possible, I think we should just
skip this check for cases where either tuple has a combo CID.

Here our objective is to validate if both Predecessor's xmin and current Tuple's xmin are same then cmin of predecessor must be less than current Tuple's cmin. In case when both tuple xmin's are same then I think predecessor's t_cid will always hold combo CID.
Then either one or both tuple will always have a combo CID and skipping this check based on "either tuple has a combo CID" will make this if condition to be evaluated to false ''.
 
+            if (TransactionIdEquals(pred_xmin, curr_xmin) &&
+                (TransactionIdEquals(curr_xmin, curr_xmax) ||
+                 !TransactionIdIsValid(curr_xmax)) && pred_cmin >= curr_cmin)

I don't understand the reason for the middle part of this condition --
TransactionIdEquals(curr_xmin, curr_xmax) ||
!TransactionIdIsValid(curr_xmax). I suppose the comment is meant to
explain this, but I still don't get it. If a tuple with XMIN 12345
CMIN 2 is updated to produce a tuple with XMIN 12345 CMIN 1, that's
corruption, regardless of what the XMAX of the second tuple may happen
to be.

tuple | t_xmin | t_xmax | t_cid | t_ctid |              tuple_data_split               |                                                   heap_tuple_infomask_flags                                      
             
-------+--------+--------+-------+--------+---------------------------------------------+------------------------------------------------------------------------------------------------------------------
-------------
     1 |    971 |    971 |     0 | (0,3)  | {"\\x1774657374312020202020","\\x01000000"} | ("{HEAP_HASVARWIDTH,HEAP_COMBOCID,HEAP_XMIN_COMMITTED,HEAP_XMAX_COMMITTED,HEAP_HOT_UPDATED}",{})
     2 |    971 |      0 |     1 | (0,2)  | {"\\x1774657374322020202020","\\x02000000"} | ("{HEAP_HASVARWIDTH,HEAP_XMAX_INVALID}",{})
     3 |    971 |    971 |     1 | (0,4)  | {"\\x1774657374322020202020","\\x01000000"} | ("{HEAP_HASVARWIDTH,HEAP_COMBOCID,HEAP_XMIN_COMMITTED,HEAP_XMAX_COMMITTED,HEAP_UPDATED,HEAP_HOT_UPDATED,HEAP_ONLY
_TUPLE}",{})
     4 |    971 |    972 |     0 | (0,5)  | {"\\x1774657374332020202020","\\x01000000"} | ("{HEAP_HASVARWIDTH,HEAP_XMIN_COMMITTED,HEAP_UPDATED,HEAP_HOT_UPDATED,HEAP_ONLY_TUPLE}",{})
     5 |    972 |      0 |     0 | (0,5)  | {"\\x1774657374342020202020","\\x01000000"} | ("{HEAP_HASVARWIDTH,HEAP_XMAX_INVALID,HEAP_UPDATED,HEAP_ONLY_TUPLE}",{})

In the above case Tuple 1->3->4 is inserted and updated by xid 971 and tuple 4 is next update by xid 972, here t_cid of tuple 4 is 0 where as its predecessor's t_cid is 1, because in Tuple 4 t_cid is having command ID of deleting transaction(cmax), that is why we need to check xmax of the Tuple. 

Please correct me if I am missing anything here?
--
Regards,
Himanshu Upadhyaya
EnterpriseDB: http://www.enterprisedb.com

Re: HOT chain validation in verify_heapam()

From
Robert Haas
Date:
On Sat, Sep 24, 2022 at 8:45 AM Himanshu Upadhyaya
<upadhyaya.himanshu@gmail.com> wrote:
> Here our objective is to validate if both Predecessor's xmin and current Tuple's xmin are same then cmin of
predecessormust be less than current Tuple's cmin. In case when both tuple xmin's are same then I think predecessor's
t_cidwill always hold combo CID. 
> Then either one or both tuple will always have a combo CID and skipping this check based on "either tuple has a combo
CID"will make this if condition to be evaluated to false ''. 

Fair point. I think maybe we should just remove the CID validation
altogether. I thought initially that it would be possible to have a
newer update with a numerically lower combo CID, but after some
experimentation I don't see a way to do it. However, it also doesn't
seem very useful to me to check that the combo CIDs are in ascending
order. I mean, even if that's not supposed to happen and does anyway,
there aren't really any enduring consequences, because command IDs are
ignored completely outside of the transaction that performed the
operation originally. So even if the combo CIDs were set to completely
random values, is that really corruption? At most it messes things up
for the duration of one transaction. And if we just have plain CIDs
rather than combo CIDs, the same thing is true: they could be totally
messed up and it wouldn't really matter beyond the lifetime of that
one transaction.

Also, it would be a lot more tempting to check this if we could check
it in all cases, but we can't. If a tuple is inserted in transaction
T1 and ten updated twice in transaction T2, we'll have only one combo
CID and nothing to compare it against, nor any way to decode what CMIN
and CMAX it originally represented. And this is probably a pretty
common type of case.

>> +            if (TransactionIdEquals(pred_xmin, curr_xmin) &&
>> +                (TransactionIdEquals(curr_xmin, curr_xmax) ||
>> +                 !TransactionIdIsValid(curr_xmax)) && pred_cmin >= curr_cmin)
>>
>> I don't understand the reason for the middle part of this condition --
>> TransactionIdEquals(curr_xmin, curr_xmax) ||
>> !TransactionIdIsValid(curr_xmax). I suppose the comment is meant to
>> explain this, but I still don't get it. If a tuple with XMIN 12345
>> CMIN 2 is updated to produce a tuple with XMIN 12345 CMIN 1, that's
>> corruption, regardless of what the XMAX of the second tuple may happen
>> to be.
>
> tuple | t_xmin | t_xmax | t_cid | t_ctid |              tuple_data_split               |
                    heap_tuple_infomask_flags 
>
>
-------+--------+--------+-------+--------+---------------------------------------------+------------------------------------------------------------------------------------------------------------------
> -------------
>      1 |    971 |    971 |     0 | (0,3)  | {"\\x1774657374312020202020","\\x01000000"} |
("{HEAP_HASVARWIDTH,HEAP_COMBOCID,HEAP_XMIN_COMMITTED,HEAP_XMAX_COMMITTED,HEAP_HOT_UPDATED}",{})
>      2 |    971 |      0 |     1 | (0,2)  | {"\\x1774657374322020202020","\\x02000000"} |
("{HEAP_HASVARWIDTH,HEAP_XMAX_INVALID}",{})
>      3 |    971 |    971 |     1 | (0,4)  | {"\\x1774657374322020202020","\\x01000000"} |
("{HEAP_HASVARWIDTH,HEAP_COMBOCID,HEAP_XMIN_COMMITTED,HEAP_XMAX_COMMITTED,HEAP_UPDATED,HEAP_HOT_UPDATED,HEAP_ONLY
> _TUPLE}",{})
>      4 |    971 |    972 |     0 | (0,5)  | {"\\x1774657374332020202020","\\x01000000"} |
("{HEAP_HASVARWIDTH,HEAP_XMIN_COMMITTED,HEAP_UPDATED,HEAP_HOT_UPDATED,HEAP_ONLY_TUPLE}",{})
>      5 |    972 |      0 |     0 | (0,5)  | {"\\x1774657374342020202020","\\x01000000"} |
("{HEAP_HASVARWIDTH,HEAP_XMAX_INVALID,HEAP_UPDATED,HEAP_ONLY_TUPLE}",{})
>
> In the above case Tuple 1->3->4 is inserted and updated by xid 971 and tuple 4 is next update by xid 972, here t_cid
oftuple 4 is 0 where as its predecessor's t_cid is 1, because in Tuple 4 t_cid is having command ID of deleting
transaction(cmax),that is why we need to check xmax of the Tuple. 
>
> Please correct me if I am missing anything here?

Hmm, I see, so basically you're trying to check whether the CID field
contains a CMIN as opposed to a CMAX. But I'm not sure this test is
entirely reliable, because heap_prepare/execute_freeze_tuple() can set
a tuple's xmax to InvalidTransactionId even after it's had some other
value, and that won't do anything to the contents of the CID field.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: HOT chain validation in verify_heapam()

From
Himanshu Upadhyaya
Date:


On Tue, Sep 27, 2022 at 1:35 AM Robert Haas <robertmhaas@gmail.com> wrote:
On Sat, Sep 24, 2022 at 8:45 AM Himanshu Upadhyaya
<upadhyaya.himanshu@gmail.com> wrote:
> Here our objective is to validate if both Predecessor's xmin and current Tuple's xmin are same then cmin of predecessor must be less than current Tuple's cmin. In case when both tuple xmin's are same then I think predecessor's t_cid will always hold combo CID.
> Then either one or both tuple will always have a combo CID and skipping this check based on "either tuple has a combo CID" will make this if condition to be evaluated to false ''.

Fair point. I think maybe we should just remove the CID validation
altogether. I thought initially that it would be possible to have a
newer update with a numerically lower combo CID, but after some
experimentation I don't see a way to do it. However, it also doesn't
seem very useful to me to check that the combo CIDs are in ascending
order. I mean, even if that's not supposed to happen and does anyway,
there aren't really any enduring consequences, because command IDs are
ignored completely outside of the transaction that performed the
operation originally. So even if the combo CIDs were set to completely
random values, is that really corruption? At most it messes things up
for the duration of one transaction. And if we just have plain CIDs
rather than combo CIDs, the same thing is true: they could be totally
messed up and it wouldn't really matter beyond the lifetime of that
one transaction.

Also, it would be a lot more tempting to check this if we could check
it in all cases, but we can't. If a tuple is inserted in transaction
T1 and ten updated twice in transaction T2, we'll have only one combo
CID and nothing to compare it against, nor any way to decode what CMIN
and CMAX it originally represented. And this is probably a pretty
common type of case.

ok, I will be removing this entire validation of cmin/cid in my next patch.
 
>> +            if (TransactionIdEquals(pred_xmin, curr_xmin) &&
>> +                (TransactionIdEquals(curr_xmin, curr_xmax) ||
>> +                 !TransactionIdIsValid(curr_xmax)) && pred_cmin >= curr_cmin)
>>
>> I don't understand the reason for the middle part of this condition --
>> TransactionIdEquals(curr_xmin, curr_xmax) ||
>> !TransactionIdIsValid(curr_xmax). I suppose the comment is meant to
>> explain this, but I still don't get it. If a tuple with XMIN 12345
>> CMIN 2 is updated to produce a tuple with XMIN 12345 CMIN 1, that's
>> corruption, regardless of what the XMAX of the second tuple may happen
>> to be.
>
> tuple | t_xmin | t_xmax | t_cid | t_ctid |              tuple_data_split               |                                                   heap_tuple_infomask_flags
>
> -------+--------+--------+-------+--------+---------------------------------------------+------------------------------------------------------------------------------------------------------------------
> -------------
>      1 |    971 |    971 |     0 | (0,3)  | {"\\x1774657374312020202020","\\x01000000"} | ("{HEAP_HASVARWIDTH,HEAP_COMBOCID,HEAP_XMIN_COMMITTED,HEAP_XMAX_COMMITTED,HEAP_HOT_UPDATED}",{})
>      2 |    971 |      0 |     1 | (0,2)  | {"\\x1774657374322020202020","\\x02000000"} | ("{HEAP_HASVARWIDTH,HEAP_XMAX_INVALID}",{})
>      3 |    971 |    971 |     1 | (0,4)  | {"\\x1774657374322020202020","\\x01000000"} | ("{HEAP_HASVARWIDTH,HEAP_COMBOCID,HEAP_XMIN_COMMITTED,HEAP_XMAX_COMMITTED,HEAP_UPDATED,HEAP_HOT_UPDATED,HEAP_ONLY
> _TUPLE}",{})
>      4 |    971 |    972 |     0 | (0,5)  | {"\\x1774657374332020202020","\\x01000000"} | ("{HEAP_HASVARWIDTH,HEAP_XMIN_COMMITTED,HEAP_UPDATED,HEAP_HOT_UPDATED,HEAP_ONLY_TUPLE}",{})
>      5 |    972 |      0 |     0 | (0,5)  | {"\\x1774657374342020202020","\\x01000000"} | ("{HEAP_HASVARWIDTH,HEAP_XMAX_INVALID,HEAP_UPDATED,HEAP_ONLY_TUPLE}",{})
>
> In the above case Tuple 1->3->4 is inserted and updated by xid 971 and tuple 4 is next update by xid 972, here t_cid of tuple 4 is 0 where as its predecessor's t_cid is 1, because in Tuple 4 t_cid is having command ID of deleting transaction(cmax), that is why we need to check xmax of the Tuple.
>
> Please correct me if I am missing anything here?

Hmm, I see, so basically you're trying to check whether the CID field
contains a CMIN as opposed to a CMAX. But I'm not sure this test is
entirely reliable, because heap_prepare/execute_freeze_tuple() can set
a tuple's xmax to InvalidTransactionId even after it's had some other
value, and that won't do anything to the contents of the CID field.
 
ok, Got it, as we are removing this cmin/cid validation so we don't need any change here. 

--
Regards,
Himanshu Upadhyaya
EnterpriseDB: http://www.enterprisedb.com

Re: HOT chain validation in verify_heapam()

From
Himanshu Upadhyaya
Date:

On Tue, Sep 20, 2022 at 6:43 PM Robert Haas <robertmhaas@gmail.com> wrote:
On Tue, Sep 20, 2022 at 5:00 AM Himanshu Upadhyaya
<upadhyaya.himanshu@gmail.com> wrote:
> Please find it attached.

This patch still has no test cases. Just as we have test cases for the
existing corruption checks, we should have test cases for these new
corruption checks, showing cases where they actually fire.

Test cases are now part of this v6 patch.
 
I think I would be inclined to set lp_valid[x] = true in both the
redirected and non-redirected case, and then have the very first thing
that the second loop does be if (nextoffnum == 0 ||
!lp_valid[ctx.offnum]) continue. I think that would be more clear
about the intent to ignore line pointers that failed validation. Also,
if you did it that way, then you could catch the case of a redirected
line pointer pointing to another redirected line pointer, which is a
corruption condition that the current code does not appear to check.

Yes, it's a good idea to do this additional validation with a redirected line pointer. Done. 
+            /*
+             * Validation via the predecessor array. 1) If the predecessor's
+             * xmin is aborted or in progress, the current tuples xmin should
+             * be aborted or in progress respectively. Also both xmin's must
+             * be equal. 2) If the predecessor's xmin is not frozen, then
+             * current tuple's shouldn't be either. 3) If the predecessor's
+             * xmin is equal to the current tuple's xmin, the current tuple's
+             * cmin should be greater than the predecessor's cmin. 4) If the
+             * current tuple is not HOT then its predecessor's tuple must not
+             * be HEAP_HOT_UPDATED. 5) If the current tuple is HOT then its
+             * predecessor's tuple must be HEAP_HOT_UPDATED.
+             */

This comment needs to be split up into pieces and the pieces need to
be moved closer to the tests to which they correspond.

Done.
 
+                                  psprintf("unfrozen tuple was
updated to produce a tuple at offset %u which is not frozen", 
Shouldn't this say "which is frozen"?

Done.
 
+             * Not a corruption if current tuple is updated/deleted by a
+             * different transaction, means t_cid will point to cmax (that is
+             * command id of deleting transaction) and cid of predecessor not
+             * necessarily will be smaller than cid of current tuple. t_cid

I think that the next person who reads this code is likely to
understand that the CIDs of different transactions are numerically
unrelated. What's less obvious is that if the XID is the same, the
newer update must have a higher CID.

+             * can hold combo command id but we are not worrying here since
+             * combo command id of the next updated tuple (if present) must be
+             * greater than combo command id of the current tuple. So here we
+             * are not checking HEAP_COMBOCID flag and simply doing t_cid
+             * comparison.

I disapprove of ignoring the HEAP_COMBOCID flag. Emitting a message
claiming that the CID has a certain value when that's actually a combo
CID is misleading, so at least a different message wording is needed
in such cases. But it's also not clear to me that the newer update has
to have a higher combo CID, because combo CIDs can be reused. If you
have multiple cursors open in the same transaction, the updates can be
interleaved, and it seems to me that it might be possible for an older
CID to have created a certain combo CID after a newer CID, and then
both cursors could update the same page in succession and end up with
combo CIDs out of numerical order. Unless somebody can make a
convincing argument that this isn't possible, I think we should just
skip this check for cases where either tuple has a combo CID.

+            if (TransactionIdEquals(pred_xmin, curr_xmin) &&
+                (TransactionIdEquals(curr_xmin, curr_xmax) ||
+                 !TransactionIdIsValid(curr_xmax)) && pred_cmin >= curr_cmin)

I don't understand the reason for the middle part of this condition --
TransactionIdEquals(curr_xmin, curr_xmax) ||
!TransactionIdIsValid(curr_xmax). I suppose the comment is meant to
explain this, but I still don't get it. If a tuple with XMIN 12345
CMIN 2 is updated to produce a tuple with XMIN 12345 CMIN 1, that's
corruption, regardless of what the XMAX of the second tuple may happen
to be.

As discussed in our last discussion, I am removing this check altogether.
 
+            if (HeapTupleHeaderIsHeapOnly(curr_htup) &&
+                !HeapTupleHeaderIsHotUpdated(pred_htup))

+            if (!HeapTupleHeaderIsHeapOnly(curr_htup) &&
+                HeapTupleHeaderIsHotUpdated(pred_htup))

I think it would be slightly clearer to write these tests the other
way around i.e. check the previous tuple's state first.

Done.
 
+    if (!TransactionIdIsValid(curr_xmax) &&
HeapTupleHeaderIsHotUpdated(tuphdr))
+    {
+        report_corruption(ctx,
+                          psprintf("tuple has been updated, but xmax is 0"));
+        result = false;
+    }

I guess this message needs to say "tuple has been HOT updated, but
xmax is 0" or something like that.

Done.


--
Regards,
Himanshu Upadhyaya
EnterpriseDB: http://www.enterprisedb.com
Attachment

Re: HOT chain validation in verify_heapam()

From
Aleksander Alekseev
Date:
Hi Himanshu,

> Test cases are now part of this v6 patch.

I believe the patch is in pretty good shape now. I'm going to change
its status to "Ready for Committer" soon unless there are going to be
any objections.

-- 
Best regards,
Aleksander Alekseev



Re: HOT chain validation in verify_heapam()

From
Andres Freund
Date:
Hi,

To start with: I think this is an extremely helpful and important
feature. Both for checking production systems and for finding problems during
development.


> From 08fe01f5073c0a850541265494bb4a875bec7d3f Mon Sep 17 00:00:00 2001
> From: Himanshu Upadhyaya <himanshu.upadhyaya@enterprisedb.com>
> Date: Fri, 30 Sep 2022 17:44:56 +0530
> Subject: [PATCH v6] Implement HOT chain validation in verify_heapam()
> 
> Himanshu Upadhyaya, reviewed by Robert Haas, Aleksander Alekseev
> 
> Discussion: https://postgr.es/m/CAPF61jBBR2-iE-EmN_9v0hcQEfyz_17e5Lbb0%2Bu2%3D9ukA9sWmQ%40mail.gmail.com
> ---
>  contrib/amcheck/verify_heapam.c           | 207 ++++++++++++++++++++++
>  src/bin/pg_amcheck/t/004_verify_heapam.pl | 192 ++++++++++++++++++--
>  2 files changed, 388 insertions(+), 11 deletions(-)
> 
> diff --git a/contrib/amcheck/verify_heapam.c b/contrib/amcheck/verify_heapam.c
> index c875f3e5a2..007f7b2f37 100644
> --- a/contrib/amcheck/verify_heapam.c
> +++ b/contrib/amcheck/verify_heapam.c
> @@ -399,6 +399,9 @@ verify_heapam(PG_FUNCTION_ARGS)
>      for (ctx.blkno = first_block; ctx.blkno <= last_block; ctx.blkno++)
>      {
>          OffsetNumber maxoff;
> +        OffsetNumber predecessor[MaxOffsetNumber] = {0};
> +        OffsetNumber successor[MaxOffsetNumber] = {0};
> +        bool        lp_valid[MaxOffsetNumber] = {false};
>  
>          CHECK_FOR_INTERRUPTS();
>  
> @@ -433,6 +436,8 @@ verify_heapam(PG_FUNCTION_ARGS)
>          for (ctx.offnum = FirstOffsetNumber; ctx.offnum <= maxoff;
>               ctx.offnum = OffsetNumberNext(ctx.offnum))
>          {
> +            OffsetNumber nextoffnum;
> +
>              ctx.itemid = PageGetItemId(ctx.page, ctx.offnum);
>  
>              /* Skip over unused/dead line pointers */
> @@ -469,6 +474,13 @@ verify_heapam(PG_FUNCTION_ARGS)
>                      report_corruption(&ctx,
>                                        psprintf("line pointer redirection to unused item at offset %u",
>                                                 (unsigned) rdoffnum));
> +
> +                /*
> +                 * make entry in successor array, redirected tuple will be
> +                 * validated at the time when we loop over successor array
> +                 */
> +                successor[ctx.offnum] = rdoffnum;
> +                lp_valid[ctx.offnum] = true;
>                  continue;
>              }
>  
> @@ -504,9 +516,197 @@ verify_heapam(PG_FUNCTION_ARGS)
>              /* It should be safe to examine the tuple's header, at least */
>              ctx.tuphdr = (HeapTupleHeader) PageGetItem(ctx.page, ctx.itemid);
>              ctx.natts = HeapTupleHeaderGetNatts(ctx.tuphdr);
> +            lp_valid[ctx.offnum] = true;
>  
>              /* Ok, ready to check this next tuple */
>              check_tuple(&ctx);
> +
> +            /*
> +             * Add the data to the successor array if next updated tuple is in
> +             * the same page. It will be used later to generate the
> +             * predecessor array.
> +             *
> +             * We need to access the tuple's header to populate the
> +             * predecessor array. However the tuple is not necessarily sanity
> +             * checked yet so delaying construction of predecessor array until
> +             * all tuples are sanity checked.
> +             */
> +            nextoffnum = ItemPointerGetOffsetNumber(&(ctx.tuphdr)->t_ctid);
> +            if (ItemPointerGetBlockNumber(&(ctx.tuphdr)->t_ctid) == ctx.blkno &&
> +                nextoffnum != ctx.offnum)
> +            {
> +                successor[ctx.offnum] = nextoffnum;
> +            }

I don't really understand this logic - why can't we populate the predecessor
array, if we can construct a successor entry?


> +        }
> +
> +        /*
> +         * Loop over offset and populate predecessor array from all entries
> +         * that are present in successor array.
> +         */
> +        ctx.attnum = -1;
> +        for (ctx.offnum = FirstOffsetNumber; ctx.offnum <= maxoff;
> +             ctx.offnum = OffsetNumberNext(ctx.offnum))
> +        {
> +            ItemId        curr_lp;
> +            ItemId        next_lp;
> +            HeapTupleHeader curr_htup;
> +            HeapTupleHeader next_htup;
> +            TransactionId curr_xmax;
> +            TransactionId next_xmin;
> +
> +            OffsetNumber nextoffnum = successor[ctx.offnum];
> +
> +            curr_lp = PageGetItemId(ctx.page, ctx.offnum);

Why do we get the item when nextoffnum is 0?


> +            if (nextoffnum == 0 || !lp_valid[ctx.offnum] || !lp_valid[nextoffnum])
> +            {
> +                /*
> +                 * This is either the last updated tuple in the chain or a
> +                 * corruption raised for this tuple.
> +                 */

"or a corruption raised" isn't quite right grammatically.


> +                continue;
> +            }
> +            if (ItemIdIsRedirected(curr_lp))
> +            {
> +                next_lp = PageGetItemId(ctx.page, nextoffnum);
> +                if (ItemIdIsRedirected(next_lp))
> +                {
> +                    report_corruption(&ctx,
> +                                      psprintf("redirected line pointer pointing to another redirected line pointer
atoffset %u",
 
> +                                               (unsigned) nextoffnum));
> +                    continue;
> +                }
> +                next_htup = (HeapTupleHeader) PageGetItem(ctx.page, next_lp);
> +                if (!HeapTupleHeaderIsHeapOnly(next_htup))
> +                {
> +                    report_corruption(&ctx,
> +                                      psprintf("redirected tuple at line pointer offset %u is not heap only tuple",
> +                                               (unsigned) nextoffnum));
> +                }
> +                if ((next_htup->t_infomask & HEAP_UPDATED) == 0)
> +                {
> +                    report_corruption(&ctx,
> +                                      psprintf("redirected tuple at line pointer offset %u is not heap updated
tuple",
> +                                               (unsigned) nextoffnum));
> +                }
> +                continue;
> +            }
> +
> +            /*
> +             * Add a line pointer offset to the predecessor array if xmax is
> +             * matching with xmin of next tuple (reaching via its t_ctid).
> +             * Prior to PostgreSQL 9.4, we actually changed the xmin to
> +             * FrozenTransactionId

I'm doubtful it's a good idea to try to validate the 9.4 case. The likelihood
of getting that right seems low and I don't see us gaining much by even trying.


> so we must add offset to predecessor
> +             * array(irrespective of xmax-xmin matching) if updated tuple xmin
> +             * is frozen, so that we can later do validation related to frozen
> +             * xmin. Raise corruption if we have two tuples having the same
> +             * predecessor.
> +             * We add the offset to the predecessor array irrespective of the
> +             * transaction (t_xmin) status. We will do validation related to
> +             * the transaction status (and also all other validations) when we
> +             * loop over the predecessor array.
> +             */
> +            curr_htup = (HeapTupleHeader) PageGetItem(ctx.page, curr_lp);
> +            curr_xmax = HeapTupleHeaderGetUpdateXid(curr_htup);
> +            next_lp = PageGetItemId(ctx.page, nextoffnum);
> +            next_htup = (HeapTupleHeader) PageGetItem(ctx.page, next_lp);
> +            next_xmin = HeapTupleHeaderGetXmin(next_htup);
> +            if (TransactionIdIsValid(curr_xmax) &&
> +                (TransactionIdEquals(curr_xmax, next_xmin) ||
> +                 next_xmin == FrozenTransactionId))
> +            {
> +                if (predecessor[nextoffnum] != 0)
> +                {
> +                    report_corruption(&ctx,
> +                                      psprintf("updated version at offset %u is also the updated version of tuple at
offset%u",
 
> +                                               (unsigned) nextoffnum, (unsigned) predecessor[nextoffnum]));
> +                    continue;

I doubt it is correct to enter this path with next_xmin ==
FrozenTransactionId. This is following a ctid chain that we normally wouldn't
follow, because it doesn't satisfy the t_self->xmax == t_ctid->xmin condition.

I don't immediately see what prevents the frozen tuple being from an entirely
different HOT chain than the two tuples pointing to it.




> +        }
> +
> +        /* Loop over offsets and validate the data in the predecessor array. */
> +        for (OffsetNumber currentoffnum = FirstOffsetNumber; currentoffnum <= maxoff;
> +             currentoffnum = OffsetNumberNext(currentoffnum))
> +        {
> +            HeapTupleHeader pred_htup;
> +            HeapTupleHeader curr_htup;
> +            TransactionId pred_xmin;
> +            TransactionId curr_xmin;
> +            ItemId        pred_lp;
> +            ItemId        curr_lp;
> +
> +            ctx.offnum = predecessor[currentoffnum];
> +            ctx.attnum = -1;
> +
> +            if (ctx.offnum == 0)
> +            {
> +                /*
> +                 * Either the root of the chain or an xmin-aborted tuple from
> +                 * an abandoned portion of the HOT chain.
> +                 */

Hm - couldn't we check that the tuple could conceivably be at the root of a
chain? I.e. isn't HEAP_HOT_UPDATED? Or alternatively has an aborted xmin?


> +                continue;
> +            }
> +
> +            curr_lp = PageGetItemId(ctx.page, currentoffnum);
> +            curr_htup = (HeapTupleHeader) PageGetItem(ctx.page, curr_lp);
> +            curr_xmin = HeapTupleHeaderGetXmin(curr_htup);
> +
> +            ctx.itemid = pred_lp = PageGetItemId(ctx.page, ctx.offnum);
> +            pred_htup = (HeapTupleHeader) PageGetItem(ctx.page, pred_lp);
> +            pred_xmin = HeapTupleHeaderGetXmin(pred_htup);
> +
> +            /*
> +             * If the predecessor's xmin is aborted or in progress, the
> +             * current tuples xmin should be aborted or in progress
> +             * respectively. Also both xmin's must be equal.
> +             */
> +            if (!TransactionIdEquals(pred_xmin, curr_xmin) &&
> +                !TransactionIdDidCommit(pred_xmin))
> +            {
> +                report_corruption(&ctx,
> +                                  psprintf("tuple with uncommitted xmin %u was updated to produce a tuple at offset
%uwith differing xmin %u",
 
> +                                           (unsigned) pred_xmin, (unsigned) currentoffnum, (unsigned) curr_xmin));

Is this necessarily true? What about a tuple that was inserted in a
subtransaction and then updated in another subtransaction of the same toplevel
transaction?


> +            }
> +
> +            /*
> +             * If the predecessor's xmin is not frozen, then current tuple's
> +             * shouldn't be either.
> +             */
> +            if (pred_xmin != FrozenTransactionId && curr_xmin == FrozenTransactionId)
> +            {
> +                report_corruption(&ctx,
> +                                  psprintf("unfrozen tuple was updated to produce a tuple at offset %u which is
frozen",
> +                                           (unsigned) currentoffnum));
> +            }

Can't we have a an update chain that is e.g.
xmin 10, xmax 5 -> xmin 5, xmax invalid

and a vacuum cutoff of 7? That'd preent the first tuple from being removed,
but would allow 5 to be frozen.

I think there were recent patches proposing we don't freeze in that case, but
we'll having done that in the past....


Greetings,

Andres Freund



Re: HOT chain validation in verify_heapam()

From
Peter Geoghegan
Date:
On Wed, Nov 9, 2022 at 2:08 PM Andres Freund <andres@anarazel.de> wrote:
> To start with: I think this is an extremely helpful and important
> feature. Both for checking production systems and for finding problems during
> development.

+1.

It's painful to get this in, in part because we now have to actually
decide what the rules really are with total precision, including for
all of the tricky edge cases. The exercise of writing this code should
"keep us honest" about whether or not we really know what the
invariants are, which is more than half the battle.

> > +                     /*
> > +                      * Add a line pointer offset to the predecessor array if xmax is
> > +                      * matching with xmin of next tuple (reaching via its t_ctid).
> > +                      * Prior to PostgreSQL 9.4, we actually changed the xmin to
> > +                      * FrozenTransactionId
>
> I'm doubtful it's a good idea to try to validate the 9.4 case. The likelihood
> of getting that right seems low and I don't see us gaining much by even trying.

This is the kind of comment that I'd usually agree with, but I
disagree in this instance because of special considerations that apply
to amcheck (or should IMV apply, at least). We're living in a world
where we have to assume that the pre-9.4 format can occur in the
field. If we can't get it right in amcheck, what chance do we have
with other new code that tickles the same areas? I think that we need
to support obsolescent heapam representations (both
FrozenTransactionId and xvac) here on general principle.

Besides, why not accept some small chance of getting this wrong? The
worst that can happen is that we'll have a relatively benign bug. If
we get it wrong then it's a short term problem, but also an
opportunity to be less wrong in the future -- including in places
where the consequences of being wrong are much more serious.

> I doubt it is correct to enter this path with next_xmin ==
> FrozenTransactionId. This is following a ctid chain that we normally wouldn't
> follow, because it doesn't satisfy the t_self->xmax == t_ctid->xmin condition.

We should never see FrozenTransactionId in an xmax field (nor should
it be returned by HeapTupleHeaderGetUpdateXid() under any
circumstances). We can "freeze xmax" during VACUUM, but that actually
means setting xmax to InvalidTransactionId (in rare cases it might
mean replacing a Multi with a new Multi). The terminology in this area
is a bit tricky.

Anyway, it follows that we cannot expect "next_xmin ==
FrozenTransactionId", because that would mean that we'd called
HeapTupleHeaderGetUpdateXid() which returned FrozenTransactionId -- an
impossibility. (Maybe we should be checking that it really is an
impossibility by checking the HeapTupleHeaderGetUpdateXid() return
value, but that should be enough.)

> I don't immediately see what prevents the frozen tuple being from an entirely
> different HOT chain than the two tuples pointing to it.

In my view it simply isn't possible for a valid HOT chain to be in
this state in the first place. So by definition it wouldn't be a HOT
chain. That would be a form of corruption, which is something that
would probably be detected by noticing orphaned heap-only tuples
(heap-only tuples not reachable from some root item on the same page,
or some other intermediary heap-only tuple reachable from a root
item).

> Can't we have a an update chain that is e.g.
> xmin 10, xmax 5 -> xmin 5, xmax invalid
>
> and a vacuum cutoff of 7? That'd preent the first tuple from being removed,
> but would allow 5 to be frozen.

I don't see how that can be possible. That is contradictory, and
cannot possibly work, since it supposes a situation where every
possible MVCC snapshot sees the update that generated the
second/successor tuple as committed, while at the same time also
somehow needing the original tuple to stay in place. Surely both
things can never be true at the same time.

I believe you're right that an update chain that looks like this one
is possible. However, I don't think it's possible for
OldestXmin/FreezeLimit to take on a value like that (i.e. a value that
"skewers" the update chain like this, the value 7 from your example).
We ought to be able to rely on an OldestXmin value that can never let
such a situation emerge. Right?

--
Peter Geoghegan



Re: HOT chain validation in verify_heapam()

From
Andres Freund
Date:
Hi,

On 2022-11-09 15:03:39 -0800, Peter Geoghegan wrote:
> > > +                     /*
> > > +                      * Add a line pointer offset to the predecessor array if xmax is
> > > +                      * matching with xmin of next tuple (reaching via its t_ctid).
> > > +                      * Prior to PostgreSQL 9.4, we actually changed the xmin to
> > > +                      * FrozenTransactionId
> >
> > I'm doubtful it's a good idea to try to validate the 9.4 case. The likelihood
> > of getting that right seems low and I don't see us gaining much by even trying.
> 
> This is the kind of comment that I'd usually agree with, but I
> disagree in this instance because of special considerations that apply
> to amcheck (or should IMV apply, at least). We're living in a world
> where we have to assume that the pre-9.4 format can occur in the
> field. If we can't get it right in amcheck, what chance do we have
> with other new code that tickles the same areas? I think that we need
> to support obsolescent heapam representations (both
> FrozenTransactionId and xvac) here on general principle.

To me this is extending the problem into more areas rather than reducing
it. I'd have *zero* confidence in any warnings that amcheck issued that
involved <9.4 special cases.

We've previously discussed adding pg_class column tracking the PG version that
last scanned the whole relation. We really should get to that one of these
decades :(.


> Besides, why not accept some small chance of getting this wrong? The
> worst that can happen is that we'll have a relatively benign bug. If
> we get it wrong then it's a short term problem, but also an
> opportunity to be less wrong in the future -- including in places
> where the consequences of being wrong are much more serious.

I think it doesn't just affect the < 9.4 path, but also makes us implement
things differently for >= 9.4. And we loose some accuracy due to that.


> > I doubt it is correct to enter this path with next_xmin ==
> > FrozenTransactionId. This is following a ctid chain that we normally wouldn't
> > follow, because it doesn't satisfy the t_self->xmax == t_ctid->xmin condition.
> 
> We should never see FrozenTransactionId in an xmax field (nor should
> it be returned by HeapTupleHeaderGetUpdateXid() under any
> circumstances).

The field we check for FrozenTransactionId in the code I was quoting is the
xmin of the follower tuple. We follow the chain if either
cur->xmax == next->xmin or if next->xmin == FrozenTransactionId

What I'm doubting is the FrozenTransactionId path.


> Anyway, it follows that we cannot expect "next_xmin ==
> FrozenTransactionId", because that would mean that we'd called
> HeapTupleHeaderGetUpdateXid() which returned FrozenTransactionId -- an
> impossibility. (Maybe we should be checking that it really is an
> impossibility by checking the HeapTupleHeaderGetUpdateXid() return
> value, but that should be enough.)

next_xmin is acquired via HeapTupleHeaderGetXmin(next_htup), not
HeapTupleHeaderGetUpdateXid(cur_typ).


> > I don't immediately see what prevents the frozen tuple being from an entirely
> > different HOT chain than the two tuples pointing to it.
> 
> In my view it simply isn't possible for a valid HOT chain to be in
> this state in the first place. So by definition it wouldn't be a HOT
> chain.

We haven't done any visibility checking at this point and my whole point is
that there's no guarantee that the pointed-to tuple actually belongs to the
same hot chain, given that we follow as soon as "xmin == FrozenXid". So the
pointing tuple might be an orphaned tuple.


> That would be a form of corruption, which is something that
> would probably be detected by noticing orphaned heap-only tuples
> (heap-only tuples not reachable from some root item on the same page,
> or some other intermediary heap-only tuple reachable from a root
> item).

You're saying that there's no way that there's a tuple pointing to another
tuple on the same page, which the pointed-to tuple belonging to a different
HOT chain?

I'm fairly certain that that at least used to be possible, and likely is still
possible. Isn't that pretty much what you'd expect to happen if there's
concurrent aborts leading to abandoned hot chains?



> > Can't we have a an update chain that is e.g.
> > xmin 10, xmax 5 -> xmin 5, xmax invalid
> >
> > and a vacuum cutoff of 7? That'd preent the first tuple from being removed,
> > but would allow 5 to be frozen.
> 
> I don't see how that can be possible. That is contradictory, and
> cannot possibly work, since it supposes a situation where every
> possible MVCC snapshot sees the update that generated the
> second/successor tuple as committed, while at the same time also
> somehow needing the original tuple to stay in place. Surely both
> things can never be true at the same time.

The xmin horizon is very coarse grained. Just because it is 7 doesn't mean
that xid 10 is still running. All it means that one backend or slot has an
xmin or xid of 7.

s1: acquire xid 5
s2: acquire xid 7
s3: acquire xid 10

s3: insert
s3: commit
s1: update
s1: commit

s2: get a new snapshot, xmin 7 (or just hold no snapshot)

At this point the xmin horizon is 7. The first tuple's xmin can't be
frozen. The second tuple's xmin can be.

Note that indeed no backend could actually see the first tuple - xid 7's
snapshot won't have it marked as running, therefore it will be invisible. But
we will think the first tuple is recently dead rather than burried deeply,
because the xmin horizon is only 7.


> I believe you're right that an update chain that looks like this one
> is possible. However, I don't think it's possible for
> OldestXmin/FreezeLimit to take on a value like that (i.e. a value that
> "skewers" the update chain like this, the value 7 from your example).
> We ought to be able to rely on an OldestXmin value that can never let
> such a situation emerge. Right?

I don't see anything that'd guarantee that currently, nor do immediately see a
possible way to get there.

What do you think prevents such an OldestXmin?

I might be missing something myself...

Greetings,

Andres Freund



Re: HOT chain validation in verify_heapam()

From
Peter Geoghegan
Date:
On Wed, Nov 9, 2022 at 4:15 PM Andres Freund <andres@anarazel.de> wrote:
> To me this is extending the problem into more areas rather than reducing
> it. I'd have *zero* confidence in any warnings that amcheck issued that
> involved <9.4 special cases.

Maybe you would at first. But then we get to learn what mistake we
made. And then we get to fix the bug, and get to know better next time
around. Next time (or maybe the time after that) you really will have
confidence in amcheck, because it'll have been battle tested at that
point.

For something like this that seems like the best way to go. Either we
support an on-disk format that includes legacy representations, or we
don't.

> I think it doesn't just affect the < 9.4 path, but also makes us implement
> things differently for >= 9.4. And we loose some accuracy due to that.

I don't follow. How so?

> The field we check for FrozenTransactionId in the code I was quoting is the
> xmin of the follower tuple. We follow the chain if either
> cur->xmax == next->xmin or if next->xmin == FrozenTransactionId
>
> What I'm doubting is the FrozenTransactionId path.

AFAICT we shouldn't be treating it as part of the same HOT chain. Only
the first heap-only tuple in a valid HOT chain should have an xmin
that is FrozenTransactionId (and only with an LP_REDIRECT root item, I
think). Otherwise the "prev_xmax==xmin" HOT chain traversal logic used
in places like heap_hot_search_buffer() simply won't work.

> > In my view it simply isn't possible for a valid HOT chain to be in
> > this state in the first place. So by definition it wouldn't be a HOT
> > chain.
>
> We haven't done any visibility checking at this point and my whole point is
> that there's no guarantee that the pointed-to tuple actually belongs to the
> same hot chain, given that we follow as soon as "xmin == FrozenXid". So the
> pointing tuple might be an orphaned tuple.

But an orphaned heap-only tuple shouldn't ever have
xmin==FrozenTransactionId to begin with. The only valid source of
orphaned heap-only tuples is transaction aborts. Aborted XID xmin
fields are never frozen.

> > That would be a form of corruption, which is something that
> > would probably be detected by noticing orphaned heap-only tuples
> > (heap-only tuples not reachable from some root item on the same page,
> > or some other intermediary heap-only tuple reachable from a root
> > item).
>
> You're saying that there's no way that there's a tuple pointing to another
> tuple on the same page, which the pointed-to tuple belonging to a different
> HOT chain?

Define "belongs to a different HOT chain".

You can get orphaned heap-only tuples, obviously. But only due to
transaction abort. Any page with an orphaned heap-only tuple that is
not consistent with it being from an earlier abort is a corrupt heap
page.

> > > Can't we have a an update chain that is e.g.
> > > xmin 10, xmax 5 -> xmin 5, xmax invalid
> > >
> > > and a vacuum cutoff of 7? That'd preent the first tuple from being removed,
> > > but would allow 5 to be frozen.
> >
> > I don't see how that can be possible. That is contradictory, and
> > cannot possibly work, since it supposes a situation where every
> > possible MVCC snapshot sees the update that generated the
> > second/successor tuple as committed, while at the same time also
> > somehow needing the original tuple to stay in place. Surely both
> > things can never be true at the same time.
>
> The xmin horizon is very coarse grained. Just because it is 7 doesn't mean
> that xid 10 is still running. All it means that one backend or slot has an
> xmin or xid of 7.

Of course that's true. But I wasn't talking about the general case --
I was talking about your "xmin 10, xmax 5 -> xmin 5, xmax invalid"
update chain case specifically, with its "skewered" OldestXmin of 7.

> s1: acquire xid 5
> s2: acquire xid 7
> s3: acquire xid 10
>
> s3: insert
> s3: commit
> s1: update
> s1: commit
>
> s2: get a new snapshot, xmin 7 (or just hold no snapshot)
>
> At this point the xmin horizon is 7. The first tuple's xmin can't be
> frozen. The second tuple's xmin can be.

Basically what I'm saying about OldestXmin is that it ought to "work
transitively", from the updater to the inserter that inserted the
now-updated tuple. That is, the OldestXmin should either count both
XIDs that appear in the update chain, or neither XID.

> > I believe you're right that an update chain that looks like this one
> > is possible. However, I don't think it's possible for
> > OldestXmin/FreezeLimit to take on a value like that (i.e. a value that
> > "skewers" the update chain like this, the value 7 from your example).
> > We ought to be able to rely on an OldestXmin value that can never let
> > such a situation emerge. Right?
>
> I don't see anything that'd guarantee that currently, nor do immediately see a
> possible way to get there.
>
> What do you think prevents such an OldestXmin?

ComputeXidHorizons() computes VACUUM's OldestXmin (actually it
computes h->data_oldest_nonremovable values) by scanning the proc
array. And counts PGPROC.xmin from each running xact. So ultimately
the inserter and updater are tied together by that. It's either an
OldestXmin that includes both, or one that includes neither.

Here are some facts that I think we both agree on already:

1. It is definitely possible to have an update chain like your "xmin
10, xmax 5 -> xmin 5, xmax invalid" example.

2. It is definitely not possible to "freeze xmax" by setting its value
to FrozenTransactionId or something similar -- there is simply no code
path that can do that, and never has been. (The term "freeze xmax" is
a bit ambiguous, though it usually means set xmax to
InvalidTransactionId.)

3. There is no specific reason to believe that there is a live bug here.

Putting all 3 together: doesn't it seem quite likely that the way that
we compute OldestXmin is the factor that prevents "skewering" of an
update chain? What else could possibly be preventing corruption here?
(Theoretically it might never have been discovered, but that seems
pretty hard to believe.)

-- 
Peter Geoghegan



Re: HOT chain validation in verify_heapam()

From
Andres Freund
Date:
Hi,

On 2022-11-09 17:32:46 -0800, Peter Geoghegan wrote:
> > The xmin horizon is very coarse grained. Just because it is 7 doesn't mean
> > that xid 10 is still running. All it means that one backend or slot has an
> > xmin or xid of 7.
> 
> Of course that's true. But I wasn't talking about the general case --
> I was talking about your "xmin 10, xmax 5 -> xmin 5, xmax invalid"
> update chain case specifically, with its "skewered" OldestXmin of 7.

The sequence below produces such an OldestXmin:

> > s1: acquire xid 5
> > s2: acquire xid 7
> > s3: acquire xid 10
> >
> > s3: insert
> > s3: commit
> > s1: update
> > s1: commit
> >
> > s2: get a new snapshot, xmin 7 (or just hold no snapshot)
> >
> > At this point the xmin horizon is 7. The first tuple's xmin can't be
> > frozen. The second tuple's xmin can be.
> 
> Basically what I'm saying about OldestXmin is that it ought to "work
> transitively", from the updater to the inserter that inserted the
> now-updated tuple. That is, the OldestXmin should either count both
> XIDs that appear in the update chain, or neither XID.

It doesn't work that way. The above sequence shows one case where it doesn't.


> > > I believe you're right that an update chain that looks like this one
> > > is possible. However, I don't think it's possible for
> > > OldestXmin/FreezeLimit to take on a value like that (i.e. a value that
> > > "skewers" the update chain like this, the value 7 from your example).
> > > We ought to be able to rely on an OldestXmin value that can never let
> > > such a situation emerge. Right?
> >
> > I don't see anything that'd guarantee that currently, nor do immediately see a
> > possible way to get there.
> >
> > What do you think prevents such an OldestXmin?
> 
> ComputeXidHorizons() computes VACUUM's OldestXmin (actually it
> computes h->data_oldest_nonremovable values) by scanning the proc
> array. And counts PGPROC.xmin from each running xact. So ultimately
> the inserter and updater are tied together by that. It's either an
> OldestXmin that includes both, or one that includes neither.

> Here are some facts that I think we both agree on already:
> 
> 1. It is definitely possible to have an update chain like your "xmin
> 10, xmax 5 -> xmin 5, xmax invalid" example.
> 
> 2. It is definitely not possible to "freeze xmax" by setting its value
> to FrozenTransactionId or something similar -- there is simply no code
> path that can do that, and never has been. (The term "freeze xmax" is
> a bit ambiguous, though it usually means set xmax to
> InvalidTransactionId.)
> 
> 3. There is no specific reason to believe that there is a live bug here.

I don't think there's a live bug here. I think the patch isn't dealing
correctly with that issue though.


> Putting all 3 together: doesn't it seem quite likely that the way that
> we compute OldestXmin is the factor that prevents "skewering" of an
> update chain? What else could possibly be preventing corruption here?
> (Theoretically it might never have been discovered, but that seems
> pretty hard to believe.)

I don't see how that follows. The existing code is just ok with that. In fact
we have explicit code trying to exploit this:

        /*
         * If the DEAD tuple is at the end of the chain, the entire chain is
         * dead and the root line pointer can be marked dead.  Otherwise just
         * redirect the root to the correct chain member.
         */
        if (i >= nchain)
            heap_prune_record_dead(prstate, rootoffnum);
        else
            heap_prune_record_redirect(prstate, rootoffnum, chainitems[i]);


Greetings,

Andres Freund



Re: HOT chain validation in verify_heapam()

From
Andres Freund
Date:
Hi,

And thinking about it, it'd be quite bad if the horizon worked that way. You can easily construct a workload where every single xid would "skewer" some chain, never allowing the horizon to be raised.

Andres
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.

Re: HOT chain validation in verify_heapam()

From
Peter Geoghegan
Date:
On Wed, Nov 9, 2022 at 5:46 PM Andres Freund <andres@anarazel.de> wrote:
> > Putting all 3 together: doesn't it seem quite likely that the way that
> > we compute OldestXmin is the factor that prevents "skewering" of an
> > update chain? What else could possibly be preventing corruption here?
> > (Theoretically it might never have been discovered, but that seems
> > pretty hard to believe.)
>
> I don't see how that follows. The existing code is just ok with that.

My remarks about "3 facts we agree on" were not intended to be a
watertight argument. More like: what else could it possibly be that
prevents problems in practice, if not *something* to do with how we
compute OldestXmin?

Leaving aside the specifics of how OldestXmin is computed for a
moment: what alternative explanation is even remotely plausible? There
just aren't that many moving parts involved here. The idea that we can
ever freeze the xmin of a successor tuple/version from an update chain
without also pruning away earlier versions of the same chain is wildly
implausible. It sounds totally contradictory.

> In fact
> we have explicit code trying to exploit this:
>
>                 /*
>                  * If the DEAD tuple is at the end of the chain, the entire chain is
>                  * dead and the root line pointer can be marked dead.  Otherwise just
>                  * redirect the root to the correct chain member.
>                  */
>                 if (i >= nchain)
>                         heap_prune_record_dead(prstate, rootoffnum);
>                 else
>                         heap_prune_record_redirect(prstate, rootoffnum, chainitems[i]);

I don't see why this code is relevant.

-- 
Peter Geoghegan



Re: HOT chain validation in verify_heapam()

From
Peter Geoghegan
Date:
On Wed, Nov 9, 2022 at 6:10 PM Andres Freund <andres@anarazel.de> wrote:
> And thinking about it, it'd be quite bad if the horizon worked that way. You can easily construct a workload where
everysingle xid would "skewer" some chain, never allowing the horizon to be raised.
 

Your whole scenario is one involving a insert of a tuple by XID 10,
which is then updated by XID 5 -- a lower XID. Obviously that's
possible, but it's relatively rare. I have to imagine that the vast
majority of updates affect tuples inserted by an XID before the XID of
the updater.

My use of the term "skewer" was limited to updates that look like
that. So I don't know what you mean about never allowing the horizon
to be raised.

-- 
Peter Geoghegan



Re: HOT chain validation in verify_heapam()

From
Andres Freund
Date:
Hi,

On 2022-11-09 18:35:12 -0800, Peter Geoghegan wrote:
> On Wed, Nov 9, 2022 at 6:10 PM Andres Freund <andres@anarazel.de> wrote:
> > And thinking about it, it'd be quite bad if the horizon worked that way. You can easily construct a workload where
everysingle xid would "skewer" some chain, never allowing the horizon to be raised.
 
> 
> Your whole scenario is one involving a insert of a tuple by XID 10,
> which is then updated by XID 5 -- a lower XID. Obviously that's
> possible, but it's relatively rare. I have to imagine that the vast
> majority of updates affect tuples inserted by an XID before the XID of
> the updater.

> My use of the term "skewer" was limited to updates that look like
> that. So I don't know what you mean about never allowing the horizon
> to be raised.

You don't need it to happen all the time, it's enough when it happens
occasionally, since that'd "block" the whole range of xids between. So you
you'd just need occasional transactions to prevent the horizon from
increasing.


Anyway, I played a bit around with this. It's hard to hit, not because we
somehow won't choose such a horizon, but because we'll commonly prune the
earlier tuple version away due to xmax being old enough. It *is* possible to
hit, if the horizon increases between the two tuple version checks (e.g. if
there's another tuple inbetween that we check the visibility of).

I think there's another way it can happen in older cluster, but don't want to
spend the time to verify it.

Either way, we can't error out in this situation - there's nothing invalid
about it.

Greetings,

Andres Freund



Re: HOT chain validation in verify_heapam()

From
Peter Geoghegan
Date:
On Mon, Nov 14, 2022 at 9:38 AM Andres Freund <andres@anarazel.de> wrote:
> Anyway, I played a bit around with this. It's hard to hit, not because we
> somehow won't choose such a horizon, but because we'll commonly prune the
> earlier tuple version away due to xmax being old enough.

That must be a bug, then. Since, as I said, I can't see how it could
possibly be okay to freeze an xmin of tuple in a HOT chain without
also making sure that it has no earlier versions left behind. If there
are earlier versions that we have to go through to get to the
frozen-xmin tuple (not just an LP_REDIRECT), we're going to break the
HOT chain traversal logic in code like heap_hot_search_buffer in a
rather obvious way.

HOT chain traversal logic code will interpret the frozen xmin from the
tuple as FrozenTransactionId (not as its raw xmin). So traversal is
just broken in this scenario.

> It *is* possible to
> hit, if the horizon increases between the two tuple version checks (e.g. if
> there's another tuple inbetween that we check the visibility of).

I suppose that that's the detail that "protects" us, then -- that
would explain the apparent lack of problems in the field. Your
sequence requires 3 sessions, not just 2.

-- 
Peter Geoghegan



Re: HOT chain validation in verify_heapam()

From
Robert Haas
Date:
On Wed, Nov 9, 2022 at 5:08 PM Andres Freund <andres@anarazel.de> wrote:
> I don't really understand this logic - why can't we populate the predecessor
> array, if we can construct a successor entry?

This whole thing was my idea, so let me try to explain. I think the
naming and comments need work, but I believe the fundamental idea may
be sound.

successor[x] = y means that when we looked at line pointer x, we saw
that it was either a redirect to line pointer y, or else it had
storage and the associated tuple's CTID pointed to line pointer y. At
this point, we do not have any idea whether y is at all sane, nor we
do we know anything about which of x and y is larger. Furthermore, it
is possible that successor[x] = successor[x'] since the page might be
corrupted and we haven't checked otherwise.

predecessor[y] = x means that successor[x] = y but in addition we've
checked that y is sane, and that x.xmax=y.xmin. If there are multiple
tuples for which these conditions hold, we've issued complaints about
all but one and entered the last into the predecessor array.

An earlier version of the algorithm had only a predecessor[] array but
the code to try to populate in a single pass was complex and looked
ugly and error-prone. To set a predecessor entry in one step, we had
to sanity-check each of x and y but only if that hadn't yet been done,
which was quite awkward. For example, imagine line pointers 1 and 2
both point to 3, and line pointer 3 points backward to line pointer 1
(because of corruption, since it shouldn't ever be circular). We can't
reason about the relationship between 1 and 3 without first making
sure that each one is sane in isolation. But if we do that when we're
at line pointer 1, then when we get to 2, we need to check 2 but don't
need to recheck 3, and when we get to 3 we need to recheck neither 3
nor 1. This algorithm lets us run through and do all the basic sanity
checks first, while populating the successor array, and then check
relationships in later stages.

Part of the motivation here is also driven by trying to figure out how
to word the complaints. We have a dedicated field in the amcheck that
can hold one tuple offset or the other, but if we're checking the
relationships between tuples, what do we put there? I feel it will be
easiest to understand if we put the offset of the older tuple in that
field and then phrase the complaint as the patch does, e.g.:

tuple with uncommitted xmin %u was updated to produce a tuple at
offset %u with differing xmin %u

We could flip that around and put the newer tuple offset in the field
and then phrase the complaint the other way around, but it seems a bit
awkward, e.g.:

tuple with uncommited xmin %u at offset %u was updated to produce this
tuple with differing xmin %u

I think if we did do it that way around (and figured out how to phrase
the messages) we might not need both arrays any more (though I'm not
positive about that). It seems hard to avoid needing at least one,
else you can't explicitly notice two converging HOT chains, which
seems like a case we probably ought to notice. But the "to produce
this tuple" phrasing is just confusing, I think, and removing "this"
doesn't help. You need to somehow get people to understand that the
offset they probably saw in another field is the second tuple, not the
first one. Maybe:

xmin %u does not match xmax %u of prior tuple at offset %u

Hmm.

Anyway, whether it was the right idea or not, the desire to have the
earlier tuple be the focus of the error messages was part of the
motivation here.

> I'm doubtful it's a good idea to try to validate the 9.4 case. The likelihood
> of getting that right seems low and I don't see us gaining much by even trying.

I agree with Peter. We have to try to get that case right. If we can
eventually eliminate it as a valid case by some mechanism, hooray. But
in the meantime we have to deal with it as best we can.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: HOT chain validation in verify_heapam()

From
Andres Freund
Date:
Hi,

On 2022-11-14 09:50:48 -0800, Peter Geoghegan wrote:
> On Mon, Nov 14, 2022 at 9:38 AM Andres Freund <andres@anarazel.de> wrote:
> > Anyway, I played a bit around with this. It's hard to hit, not because we
> > somehow won't choose such a horizon, but because we'll commonly prune the
> > earlier tuple version away due to xmax being old enough.
>
> That must be a bug, then. Since, as I said, I can't see how it could
> possibly be okay to freeze an xmin of tuple in a HOT chain without
> also making sure that it has no earlier versions left behind.

Hard to imagine us having bugs in this code. Ahem.

I really wish I knew of a reasonably complex way to utilize coverage guided
fuzzing on heap pruning / vacuuming.


I wonder if we ought to add an error check to heap_prepare_freeze_tuple()
against this scenario. We're working towards being more aggressive around
freezing, which will make it more likely to hit corner cases around this.



> If there are earlier versions that we have to go through to get to the
> frozen-xmin tuple (not just an LP_REDIRECT), we're going to break the HOT
> chain traversal logic in code like heap_hot_search_buffer in a rather
> obvious way.
>
> HOT chain traversal logic code will interpret the frozen xmin from the
> tuple as FrozenTransactionId (not as its raw xmin). So traversal is
> just broken in this scenario.
>

Which'd still be fine if the whole chain were already "fully dead". One way I
think this can happen is <= PG 13's HEAPTUPLE_DEAD handling in
lazy_scan_heap().

I now suspect that the seemingly-odd "We will advance past RECENTLY_DEAD
tuples just in case there's a DEAD one after them;" logic in
heap_prune_chain() might be required for correctness.  Which IIRC we'd been
talking about getting rid elsewhere?

<tinkers>

At least as long as correctness requires not ending up in endless loops -
indeed. We end up with lazy_scan_prune() endlessly retrying. Without a chance
to interrupt. Shouldn't there at least be a CFI somewhere?  The attached
isolationtester spec has a commented out test for this.


I think the problem partially is that the proposed verify_heapam() code is too
"aggressive" considering things to be part of the same hot chain - which then
means we have to be very careful about erroring out.

The attached isolationtester test triggers:
"unfrozen tuple was updated to produce a tuple at offset %u which is frozen"
"updated version at offset 3 is also the updated version of tuple at offset %u"

Despite there afaict not being any corruption. Worth noting that this happens
regardless of hot/non-hot updates being used (uncomment s3ci to see).


> > It *is* possible to
> > hit, if the horizon increases between the two tuple version checks (e.g. if
> > there's another tuple inbetween that we check the visibility of).
>
> I suppose that that's the detail that "protects" us, then -- that
> would explain the apparent lack of problems in the field. Your
> sequence requires 3 sessions, not just 2.

One important protection right now is that vacuumlazy.c uses a more
pessimistic horizon than pruneheap.c. Even if visibility determinations within
pruning recompute the horizon, vacuumlazy.c won't freeze based on the advanced
horizon.  I don't quite know where we we'd best put a comment with a warning
about this fact.

Greetings,

Andres Freund

Attachment

Re: HOT chain validation in verify_heapam()

From
Peter Geoghegan
Date:
On Mon, Nov 14, 2022 at 11:28 AM Robert Haas <robertmhaas@gmail.com> wrote:
> Part of the motivation here is also driven by trying to figure out how
> to word the complaints. We have a dedicated field in the amcheck that
> can hold one tuple offset or the other, but if we're checking the
> relationships between tuples, what do we put there? I feel it will be
> easiest to understand if we put the offset of the older tuple in that
> field and then phrase the complaint as the patch does, e.g.:

That makes a lot of sense to me, and reminds me of how things work in
verify_nbtree.c.

At a high level verify_nbtree.c works by doing a breadth-first
traversal of the tree. The search makes each distinct page the "target
page" exactly once. The target page is the clear focal point for
everything -- almost every complaint about corruption frames the
problem as a problem in the target page. We consistently describe
things in terms of their relationship with the target page, so under
this scheme everybody is...on the same page (ahem).

Being very deliberate about that probably had some small downsides.
Maybe it would have made a little more sense to word certain
particular corruption report messages in a way that placed blame on
"ancillary" pages like sibling/child pages (not the target page) as
problems in the ancillary page itself, not the target page. This still
seems like the right trade-off -- the control flow can be broken up
into understandable parts once you understand that the target page is
the thing that we use to describe every other page.

> > I'm doubtful it's a good idea to try to validate the 9.4 case. The likelihood
> > of getting that right seems low and I don't see us gaining much by even trying.
>
> I agree with Peter. We have to try to get that case right. If we can
> eventually eliminate it as a valid case by some mechanism, hooray. But
> in the meantime we have to deal with it as best we can.

Practiced intellectual humility seems like the way to go here. On some
level I suspect that we'll have problems in exactly the places that we
don't look for them.

-- 
Peter Geoghegan



Re: HOT chain validation in verify_heapam()

From
Andres Freund
Date:
Hi,

On 2022-11-14 14:27:54 -0500, Robert Haas wrote:
> On Wed, Nov 9, 2022 at 5:08 PM Andres Freund <andres@anarazel.de> wrote:
> > I don't really understand this logic - why can't we populate the predecessor
> > array, if we can construct a successor entry?
> 
> This whole thing was my idea, so let me try to explain. I think the
> naming and comments need work, but I believe the fundamental idea may
> be sound.
> 
> successor[x] = y means that when we looked at line pointer x, we saw
> that it was either a redirect to line pointer y, or else it had
> storage and the associated tuple's CTID pointed to line pointer y.

> At this point, we do not have any idea whether y is at all sane, nor we do
> we know anything about which of x and y is larger.

What do you mean with "larger" here?


> Furthermore, it is
> possible that successor[x] = successor[x'] since the page might be corrupted
> and we haven't checked otherwise.
> 
> predecessor[y] = x means that successor[x] = y but in addition we've
> checked that y is sane, and that x.xmax=y.xmin. If there are multiple
> tuples for which these conditions hold, we've issued complaints about
> all but one and entered the last into the predecessor array.

As shown by the isolationtester test I just posted, this doesn't quite work
right now. Probably fixable.

I don't think we can follow non-HOT ctid chains if they're older than the xmin
horizon, including all cases of xmin being frozen. There's just nothing
guaranteeing that the tuples are actually "related".


It seems like we should do a bit more validation within a chain of
tuples. E.g. that no live tuple can follow an !DidCommit xmin?



> > I'm doubtful it's a good idea to try to validate the 9.4 case. The likelihood
> > of getting that right seems low and I don't see us gaining much by even trying.
> 
> I agree with Peter. We have to try to get that case right. If we can
> eventually eliminate it as a valid case by some mechanism, hooray. But
> in the meantime we have to deal with it as best we can.

I now think that the 9.4 specific reasoning is bogus in the first place. The
patch says:

             * Add a line pointer offset to the predecessor array if xmax is
             * matching with xmin of next tuple (reaching via its t_ctid).
             * Prior to PostgreSQL 9.4, we actually changed the xmin to
             * FrozenTransactionId so we must add offset to predecessor
             * array(irrespective of xmax-xmin matching) if updated tuple xmin
             * is frozen, so that we can later do validation related to frozen
             * xmin. Raise corruption if we have two tuples having the same
             * predecessor.

but it's simply not correct to iterate through xmin=FrozenTransactionId - as
shown in the isolationtester test. And that's unrelated to 9.4, because we
couldn't rely on the raw xmin value either, because even if they match, they
could be from different epochs.

Greetings,

Andres Freund



Re: HOT chain validation in verify_heapam()

From
Peter Geoghegan
Date:
On Mon, Nov 14, 2022 at 12:58 PM Andres Freund <andres@anarazel.de> wrote:
> I wonder if we ought to add an error check to heap_prepare_freeze_tuple()
> against this scenario. We're working towards being more aggressive around
> freezing, which will make it more likely to hit corner cases around this.

In theory my work on freezing doesn't change the basic rules about how
freezing works, and doesn't know anything about HOT, so it shouldn't
introduce any new risk. Even still, I agree that this seems like
something to do in the scope of the same work, just in case. Plus it's
just important.

It would be possible to have exhaustive heap_prepare_freeze_tuple
checks in assert-only builds -- we can exhaustively check the final
array of prepared freeze plans that we collected for a given heap
page, and check it against the page exhaustively right before freezing
is executed. That's not perfect, but it would be a big improvement.

Right now I am not entirely sure what I would need to check in such a
mechanism. I am legitimately unsure of what the rules are in light of
this new information.

> Which'd still be fine if the whole chain were already "fully dead". One way I
> think this can happen is <= PG 13's HEAPTUPLE_DEAD handling in
> lazy_scan_heap().

You mean the tupgone thing? Perhaps it would have avoided this
particular problem, or one like it. But it had so many other problems
that I don't see why it matters now.

> At least as long as correctness requires not ending up in endless loops -
> indeed. We end up with lazy_scan_prune() endlessly retrying. Without a chance
> to interrupt. Shouldn't there at least be a CFI somewhere?

Probably, but that wouldn't change the fact that it's a bug when this
happens. Obviously it's more important to avoid such a bug than it is
to ameliorate it.

> I think the problem partially is that the proposed verify_heapam() code is too
> "aggressive" considering things to be part of the same hot chain - which then
> means we have to be very careful about erroring out.
>
> The attached isolationtester test triggers:
> "unfrozen tuple was updated to produce a tuple at offset %u which is frozen"
> "updated version at offset 3 is also the updated version of tuple at offset %u"
>
> Despite there afaict not being any corruption. Worth noting that this happens
> regardless of hot/non-hot updates being used (uncomment s3ci to see).

Why don't you think that there is corruption?

The terminology here is tricky. It's possible that the amcheck patch
makes a very good point here, even without necessarily complaining
about a state that leads to obviously wrong behavior. It's also
possible that there really is wrong behavior, at least in my mind -- I
don't know what your remarks about no corruption are really based on.

I feel like I'm repeating myself more than I should, but: why isn't it
as simple as "HOT chain traversal logic is broken by frozen xmin in
the obvious way, therefore all bets are off"? Maybe you're right about
the proposed new functionality getting things wrong with your
adversarial isolation test, but I seem to have missed the underlying
argument. Are you just talking about regular update chains here, not
HOT chains? Something else?

> One important protection right now is that vacuumlazy.c uses a more
> pessimistic horizon than pruneheap.c. Even if visibility determinations within
> pruning recompute the horizon, vacuumlazy.c won't freeze based on the advanced
> horizon.  I don't quite know where we we'd best put a comment with a warning
> about this fact.

We already have comments discussing the relationship between
OldestXmin and vistest (as well as rel_pages) in heap_vacuum_rel().
That seems like the obvious place to put something like this, at least
to me.

-- 
Peter Geoghegan



Re: HOT chain validation in verify_heapam()

From
Andres Freund
Date:
Hi,

On 2022-11-14 14:13:10 -0800, Peter Geoghegan wrote:
> > I think the problem partially is that the proposed verify_heapam() code is too
> > "aggressive" considering things to be part of the same hot chain - which then
> > means we have to be very careful about erroring out.
> >
> > The attached isolationtester test triggers:
> > "unfrozen tuple was updated to produce a tuple at offset %u which is frozen"
> > "updated version at offset 3 is also the updated version of tuple at offset %u"
> >
> > Despite there afaict not being any corruption. Worth noting that this happens
> > regardless of hot/non-hot updates being used (uncomment s3ci to see).
> 
> Why don't you think that there is corruption?

I looked at the state after the test and the complaint is bogus. It's caused
by the patch ignoring the cur->xmax == next->xmin condition if next->xmin is
FrozenTransactionId. The isolationtester test creates a situation where that
leads to verify_heapam() considering tuples to be part of the same chain even
though they aren't.


> Because I feel like I'm repeating myself more than I should, but: why isn't
> it as simple as "HOT chain traversal logic is broken by frozen xmin in the
> obvious way, therefore all bets are off"?

Because that's irrelevant for the testcase and a good number of my concerns.


> Maybe you're right about the proposed new functionality getting things wrong
> with your adversarial isolation test, but I seem to have missed the
> underlying argument. Are you just talking about regular update chains here,
> not HOT chains? Something else?

As I noted, it happens regardless of HOT being used or not. The tuples aren't
part of the same chain, but the patch treats them as if they were.  The reason
the patch considers them to be part of the same chain is precisely the
FrozenTransactionId condition I was worried about. Just because t_ctid points
to a tuple on the same page and the next tuple has xmin ==
FrozenTransactionId, doesn't mean they're part of the same chain. Once you
encounter a tuple with a frozen xmin you simply cannot assume it's part of the
chain you've been following.

Greetings,

Andres Freund



Re: HOT chain validation in verify_heapam()

From
Peter Geoghegan
Date:
On Mon, Nov 14, 2022 at 2:33 PM Andres Freund <andres@anarazel.de> wrote:
> > Why don't you think that there is corruption?
>
> I looked at the state after the test and the complaint is bogus. It's caused
> by the patch ignoring the cur->xmax == next->xmin condition if next->xmin is
> FrozenTransactionId. The isolationtester test creates a situation where that
> leads to verify_heapam() considering tuples to be part of the same chain even
> though they aren't.

Having looked at your isolation test in more detail, it seems like you
were complaining about a fairly specific and uncontroversial
shortcoming in the patch itself: it complains about a newly inserted
tuple that gets frozen. It thinks that the inserted tuple is part of
the same HOT chain (or at least the same update chain) as other tuples
on the same heap page, when in fact it's just some wholly unrelated
tuple/logical row. It seems as if the new amcheck code doesn't get all
the details of validating HOT chains right, and so jumps the gun here,
reporting corruption based on a faulty assumption that the frozen-xmin
tuple is in any way related to the chain.

I was confused about whether we were talking about this patch, bugs in
HEAD, or both.

> > Maybe you're right about the proposed new functionality getting things wrong
> > with your adversarial isolation test, but I seem to have missed the
> > underlying argument. Are you just talking about regular update chains here,
> > not HOT chains? Something else?
>
> As I noted, it happens regardless of HOT being used or not. The tuples aren't
> part of the same chain, but the patch treats them as if they were.  The reason
> the patch considers them to be part of the same chain is precisely the
> FrozenTransactionId condition I was worried about. Just because t_ctid points
> to a tuple on the same page and the next tuple has xmin ==
> FrozenTransactionId, doesn't mean they're part of the same chain. Once you
> encounter a tuple with a frozen xmin you simply cannot assume it's part of the
> chain you've been following.

Got it.

That seems relatively straightforward and uncontroversial to me,
because it's just how code like heap_hot_search_buffer (HOT chain
traversal code) works already. The patch got some of those details
wrong, and should be revised.

What does this have to tell us, if anything, about the implications
for code on HEAD? I don't see any connection between this problem and
the possibility of a live bug on HEAD involving freezing later tuple
versions in a HOT chain, leaving earlier non-frozen versions behind to
break HOT chain traversal code. Should I have noticed such a
connection?

-- 
Peter Geoghegan



Re: HOT chain validation in verify_heapam()

From
Andres Freund
Date:
Hi,

On 2022-11-14 14:42:16 -0800, Peter Geoghegan wrote:
> What does this have to tell us, if anything, about the implications
> for code on HEAD?

Nothing really test I sent (*) - I wanted to advance the discussion about the
patch being wrong as-is in a concrete way.

This logic was one of my main complaints in
https://postgr.es/m/20221109220803.t25sosmfvkeglhy4%40awork3.anarazel.de
and you went in a very different direction in your reply. Hence a test
showcasing the issue.

Note that neither of my complaints around FrozenTransactionId in that email
actually require that HOT is involved. The code in the patch doesn't
differentiate between hot and not-hot until later.


> I don't see any connection between this problem and the possibility of a
> live bug on HEAD involving freezing later tuple versions in a HOT chain,
> leaving earlier non-frozen versions behind to break HOT chain traversal
> code. Should I have noticed such a connection?

No.

Greetings,

Andres Freund

(*) the commented out test perhaps is an argument for expanding the comment
nd "We will advance past RECENTLY_DEAD tuples just in case there's a DEAD
after them;" in heap_prune_chain()



Re: HOT chain validation in verify_heapam()

From
Peter Geoghegan
Date:
On Mon, Nov 14, 2022 at 2:58 PM Andres Freund <andres@anarazel.de> wrote:
> On 2022-11-14 14:42:16 -0800, Peter Geoghegan wrote:
> > What does this have to tell us, if anything, about the implications
> > for code on HEAD?
>
> Nothing really test I sent (*) - I wanted to advance the discussion about the
> patch being wrong as-is in a concrete way.

Got it.

> This logic was one of my main complaints in
> https://postgr.es/m/20221109220803.t25sosmfvkeglhy4%40awork3.anarazel.de
> and you went in a very different direction in your reply. Hence a test
> showcasing the issue.

I guess I was also confused by the fact that you called it
"skewer.diff", which is a terminology I invented to describe the scary
HOT chain freezing bug. You probably started out writing an isolation
test to do something like that, but then repurposed it to show a bug
in the patch. Anyway, never mind, I understand you now.

> Note that neither of my complaints around FrozenTransactionId in that email
> actually require that HOT is involved. The code in the patch doesn't
> differentiate between hot and not-hot until later.

I understand. I mentioned HOT only because it's more obviously not
okay with HOT -- you can point to the precise code that is broken
quite easily (index scans break with HOT chain traversals give wrong
answers in the problem scenario with freezing HOT chains in the wrong
place).

I'd really like to know if the scary HOT chain freezing scenario is
possible, for the very obvious reason. Have you tried to write a test
case for that?

-- 
Peter Geoghegan



Re: HOT chain validation in verify_heapam()

From
Robert Haas
Date:
On Mon, Nov 14, 2022 at 5:02 PM Andres Freund <andres@anarazel.de> wrote:
> On 2022-11-14 14:27:54 -0500, Robert Haas wrote:
> > On Wed, Nov 9, 2022 at 5:08 PM Andres Freund <andres@anarazel.de> wrote:
> > > I don't really understand this logic - why can't we populate the predecessor
> > > array, if we can construct a successor entry?
> >
> > This whole thing was my idea, so let me try to explain. I think the
> > naming and comments need work, but I believe the fundamental idea may
> > be sound.
> >
> > successor[x] = y means that when we looked at line pointer x, we saw
> > that it was either a redirect to line pointer y, or else it had
> > storage and the associated tuple's CTID pointed to line pointer y.
>
> > At this point, we do not have any idea whether y is at all sane, nor we do
> > we know anything about which of x and y is larger.
>
> What do you mean with "larger" here?

Numerically bigger. As in, a redirect line pointer or CTID will most
commonly point to a tuple that appears later in the line pointer
array, because we assign offset numbers in ascending order. But we
also reuse line pointers, so it's possible for a redirect line pointer
or CTID to point backwards to a lower-number offset.

> > Furthermore, it is
> > possible that successor[x] = successor[x'] since the page might be corrupted
> > and we haven't checked otherwise.
> >
> > predecessor[y] = x means that successor[x] = y but in addition we've
> > checked that y is sane, and that x.xmax=y.xmin. If there are multiple
> > tuples for which these conditions hold, we've issued complaints about
> > all but one and entered the last into the predecessor array.
>
> As shown by the isolationtester test I just posted, this doesn't quite work
> right now. Probably fixable.
>
> I don't think we can follow non-HOT ctid chains if they're older than the xmin
> horizon, including all cases of xmin being frozen. There's just nothing
> guaranteeing that the tuples are actually "related".

Yeah, glad you caught that. I think it's clearly wrong to regard A and
B as related if B.xmin is frozen. It doesn't matter whether it's
"old-style" frozen where we actually put 2 into the xmin field, or
new-style frozen where we set hint bits. Because, even if it's
new-style frozen and the value actually stored in B.xmin is
numerically equal to A.xmax, it could be from a different epoch. If
it's from the same epoch, then something is corrupted, because when we
froze B we should have pruned A. But we have no way of knowing whether
that's the case, and shouldn't assume corruption.

> It seems like we should do a bit more validation within a chain of
> tuples. E.g. that no live tuple can follow an !DidCommit xmin?

I think this check is already present in stronger form. If we see a
!DidCommit xmin, the xmin of the next tuple in the chain not only
can't be committed, but had better be the same. See "tuple with
uncommitted xmin %u was updated to produce a tuple at offset %u with
differing xmin %u".

> I now think that the 9.4 specific reasoning is bogus in the first place. The
> patch says:
>
>                          * Add a line pointer offset to the predecessor array if xmax is
>                          * matching with xmin of next tuple (reaching via its t_ctid).
>                          * Prior to PostgreSQL 9.4, we actually changed the xmin to
>                          * FrozenTransactionId so we must add offset to predecessor
>                          * array(irrespective of xmax-xmin matching) if updated tuple xmin
>                          * is frozen, so that we can later do validation related to frozen
>                          * xmin. Raise corruption if we have two tuples having the same
>                          * predecessor.
>
> but it's simply not correct to iterate through xmin=FrozenTransactionId - as
> shown in the isolationtester test. And that's unrelated to 9.4, because we
> couldn't rely on the raw xmin value either, because even if they match, they
> could be from different epochs.

I agree completely.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: HOT chain validation in verify_heapam()

From
Andres Freund
Date:
Hi,

On 2022-11-15 11:36:21 -0500, Robert Haas wrote:
> On Mon, Nov 14, 2022 at 5:02 PM Andres Freund <andres@anarazel.de> wrote:
> > It seems like we should do a bit more validation within a chain of
> > tuples. E.g. that no live tuple can follow an !DidCommit xmin?
> 
> I think this check is already present in stronger form. If we see a
> !DidCommit xmin, the xmin of the next tuple in the chain not only can't be
> committed, but had better be the same.

As I think I mentioned before, I don't think the "better be the same" aspect
is correct, think subxacts. E.g.

off 0: xmin: top, xmax: child_1
off 1: xmin: child_1, xmax: invalid

If top hasn't committed yet, the current logic afaict will warn about this
situation, no? And I don't think we can generally the subxid parent at this
point, unfortunately (might have truncated subtrans).


Different aspect: Is it ok that we use TransactionIdDidCommit() without a
preceding IsInProgress() check?


I do think there's some potential for additional checks that don't run into
the above issue, e.g. checking that no in-progress xids follow an explicitly
aborted xact, that a committed xid can't follow an uncommitted xid etc.

Greetings,

Andres Freund



Re: HOT chain validation in verify_heapam()

From
Robert Haas
Date:
On Tue, Nov 15, 2022 at 2:50 PM Andres Freund <andres@anarazel.de> wrote:
> On 2022-11-15 11:36:21 -0500, Robert Haas wrote:
> > On Mon, Nov 14, 2022 at 5:02 PM Andres Freund <andres@anarazel.de> wrote:
> > > It seems like we should do a bit more validation within a chain of
> > > tuples. E.g. that no live tuple can follow an !DidCommit xmin?
> >
> > I think this check is already present in stronger form. If we see a
> > !DidCommit xmin, the xmin of the next tuple in the chain not only can't be
> > committed, but had better be the same.
>
> As I think I mentioned before, I don't think the "better be the same" aspect
> is correct, think subxacts. E.g.
>
> off 0: xmin: top, xmax: child_1
> off 1: xmin: child_1, xmax: invalid
>
> If top hasn't committed yet, the current logic afaict will warn about this
> situation, no? And I don't think we can generally the subxid parent at this
> point, unfortunately (might have truncated subtrans).

Woops, you're right.

> Different aspect: Is it ok that we use TransactionIdDidCommit() without a
> preceding IsInProgress() check?

Well, the code doesn't match the comments here, sadly. The comments
claim that we want to check that if the prior tuple's xmin was aborted
or in progress the current one is in the same state. If that's
actually what the code checked, we'd definitely need to check both
TransactionIdInProgress and TransactionIdCommit and be wary of the
possibility of the value changing concurrently. But the code doesn't
actually check the status of more than one XID, nor does it care about
the distinction between aborted and in progress, so I don't think that
the current code is buggy in that particular way, just in a bunch of
other ways.

> I do think there's some potential for additional checks that don't run into
> the above issue, e.g. checking that no in-progress xids follow an explicitly
> aborted xact, that a committed xid can't follow an uncommitted xid etc.

Yeah, maybe so.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: HOT chain validation in verify_heapam()

From
Andres Freund
Date:
Hi,

On 2022-11-14 15:07:05 -0800, Peter Geoghegan wrote:
> I'd really like to know if the scary HOT chain freezing scenario is
> possible, for the very obvious reason. Have you tried to write a test
> case for that?

I tried. Unfortunately, even if the bug exists, we currently don't have the
infrastructure to write isolationtester tests for it. There's just too many
points where we'd need to wait where I don't know of ways to wait with
isolationtester.

I'm quite certain that it's possible to end up freezing an earlier row
versions in a hot chain in < 14, I got there with careful gdb
orchestration. Of course possible I screwed something up, given I did it once,
interactively. Not sure if trying to fix it is worth the risk of backpatching
all the necessary changes to switch to the retry approach.

Greetings,

Andres Freund



Re: HOT chain validation in verify_heapam()

From
Himanshu Upadhyaya
Date:


On Thu, Nov 10, 2022 at 3:38 AM Andres Freund <andres@anarazel.de> wrote:

> +             }
> +
> +             /*
> +              * Loop over offset and populate predecessor array from all entries
> +              * that are present in successor array.
> +              */
> +             ctx.attnum = -1;
> +             for (ctx.offnum = FirstOffsetNumber; ctx.offnum <= maxoff;
> +                      ctx.offnum = OffsetNumberNext(ctx.offnum))
> +             {
> +                     ItemId          curr_lp;
> +                     ItemId          next_lp;
> +                     HeapTupleHeader curr_htup;
> +                     HeapTupleHeader next_htup;
> +                     TransactionId curr_xmax;
> +                     TransactionId next_xmin;
> +
> +                     OffsetNumber nextoffnum = successor[ctx.offnum];
> +
> +                     curr_lp = PageGetItemId(ctx.page, ctx.offnum);

Why do we get the item when nextoffnum is 0?

Yes, right, I will move this call to PageGetItemId, just after the next "if" condition in the patch.

> +                     if (nextoffnum == 0 || !lp_valid[ctx.offnum] || !lp_valid[nextoffnum])
> +                     {
> +                             /*
> +                              * This is either the last updated tuple in the chain or a
> +                              * corruption raised for this tuple.
> +                              */

"or a corruption raised" isn't quite right grammatically.

will change to "This is either the last updated tuple in the chain or corruption has been raised for this tuple" 

> +                             continue;
> +                     }
> +                     if (ItemIdIsRedirected(curr_lp))
> +                     {
> +                             next_lp = PageGetItemId(ctx.page, nextoffnum);
> +                             if (ItemIdIsRedirected(next_lp))
> +                             {
> +                                     report_corruption(&ctx,
> +                                                                       psprintf("redirected line pointer pointing to another redirected line pointer at offset %u",
> +                                                                                        (unsigned) nextoffnum));
> +                                     continue;
> +                             }
> +                             next_htup = (HeapTupleHeader) PageGetItem(ctx.page, next_lp);
> +                             if (!HeapTupleHeaderIsHeapOnly(next_htup))
> +                             {
> +                                     report_corruption(&ctx,
> +                                                                       psprintf("redirected tuple at line pointer offset %u is not heap only tuple",
> +                                                                                        (unsigned) nextoffnum));
> +                             }
> +                             if ((next_htup->t_infomask & HEAP_UPDATED) == 0)
> +                             {
> +                                     report_corruption(&ctx,
> +                                                                       psprintf("redirected tuple at line pointer offset %u is not heap updated tuple",
> +                                                                                        (unsigned) nextoffnum));
> +                             }
> +                             continue;
> +                     }
> +
> +                     /*
> +                      * Add a line pointer offset to the predecessor array if xmax is
> +                      * matching with xmin of next tuple (reaching via its t_ctid).
> +                      * Prior to PostgreSQL 9.4, we actually changed the xmin to
> +                      * FrozenTransactionId

I'm doubtful it's a good idea to try to validate the 9.4 case. The likelihood
of getting that right seems low and I don't see us gaining much by even trying.


> so we must add offset to predecessor
> +                      * array(irrespective of xmax-xmin matching) if updated tuple xmin
> +                      * is frozen, so that we can later do validation related to frozen
> +                      * xmin. Raise corruption if we have two tuples having the same
> +                      * predecessor.
> +                      * We add the offset to the predecessor array irrespective of the
> +                      * transaction (t_xmin) status. We will do validation related to
> +                      * the transaction status (and also all other validations) when we
> +                      * loop over the predecessor array.
> +                      */
> +                     curr_htup = (HeapTupleHeader) PageGetItem(ctx.page, curr_lp);
> +                     curr_xmax = HeapTupleHeaderGetUpdateXid(curr_htup);
> +                     next_lp = PageGetItemId(ctx.page, nextoffnum);
> +                     next_htup = (HeapTupleHeader) PageGetItem(ctx.page, next_lp);
> +                     next_xmin = HeapTupleHeaderGetXmin(next_htup);
> +                     if (TransactionIdIsValid(curr_xmax) &&
> +                             (TransactionIdEquals(curr_xmax, next_xmin) ||
> +                              next_xmin == FrozenTransactionId))
> +                     {
> +                             if (predecessor[nextoffnum] != 0)
> +                             {
> +                                     report_corruption(&ctx,
> +                                                                       psprintf("updated version at offset %u is also the updated version of tuple at offset %u",
> +                                                                                        (unsigned) nextoffnum, (unsigned) predecessor[nextoffnum]));
> +                                     continue;

I doubt it is correct to enter this path with next_xmin ==
FrozenTransactionId. This is following a ctid chain that we normally wouldn't
follow, because it doesn't satisfy the t_self->xmax == t_ctid->xmin condition.

I don't immediately see what prevents the frozen tuple being from an entirely
different HOT chain than the two tuples pointing to it.



Prior to 9.4 we can have xmin updated with FrozenTransactionId but with 9.4 (or later) we set XMIN_FROZEN bit in t_infomask. if updated tuple is via prior of 9.4 then "TransactionIdEquals(curr_xmax, next_xmin)" will be false for Frozen Tuple.
The Intention of adding  "next_xmin == FrozenTransactionId" to the path is because we wanted to do validation around Frozen Tuple when we loop over predecessor array.

 I need to look at the isolation test in details to understand how this can provide false alarm and but if there is a valid case then we can remove logic of raising corruption related with Frozen Tuple?


> +             }
> +
> +             /* Loop over offsets and validate the data in the predecessor array. */
> +             for (OffsetNumber currentoffnum = FirstOffsetNumber; currentoffnum <= maxoff;
> +                      currentoffnum = OffsetNumberNext(currentoffnum))
> +             {
> +                     HeapTupleHeader pred_htup;
> +                     HeapTupleHeader curr_htup;
> +                     TransactionId pred_xmin;
> +                     TransactionId curr_xmin;
> +                     ItemId          pred_lp;
> +                     ItemId          curr_lp;
> +
> +                     ctx.offnum = predecessor[currentoffnum];
> +                     ctx.attnum = -1;
> +
> +                     if (ctx.offnum == 0)
> +                     {
> +                             /*
> +                              * Either the root of the chain or an xmin-aborted tuple from
> +                              * an abandoned portion of the HOT chain.
> +                              */

Hm - couldn't we check that the tuple could conceivably be at the root of a
chain? I.e. isn't HEAP_HOT_UPDATED? Or alternatively has an aborted xmin?

 
 I don't see a way to check if tuple is at the root of HOT chain because predecessor array will always be having either xmin from non-abandoned transaction or it will be zero. We can't differentiate root or tuple inserted via abandoned transaction.


> +                             continue;
> +                     }
> +
> +                     curr_lp = PageGetItemId(ctx.page, currentoffnum);
> +                     curr_htup = (HeapTupleHeader) PageGetItem(ctx.page, curr_lp);
> +                     curr_xmin = HeapTupleHeaderGetXmin(curr_htup);
> +
> +                     ctx.itemid = pred_lp = PageGetItemId(ctx.page, ctx.offnum);
> +                     pred_htup = (HeapTupleHeader) PageGetItem(ctx.page, pred_lp);
> +                     pred_xmin = HeapTupleHeaderGetXmin(pred_htup);
> +
> +                     /*
> +                      * If the predecessor's xmin is aborted or in progress, the
> +                      * current tuples xmin should be aborted or in progress
> +                      * respectively. Also both xmin's must be equal.
> +                      */
> +                     if (!TransactionIdEquals(pred_xmin, curr_xmin) &&
> +                             !TransactionIdDidCommit(pred_xmin))
> +                     {
> +                             report_corruption(&ctx,
> +                                                               psprintf("tuple with uncommitted xmin %u was updated to produce a tuple at offset %u with differing xmin %u",
> +                                                                                (unsigned) pred_xmin, (unsigned) currentoffnum, (unsigned) curr_xmin));

Is this necessarily true? What about a tuple that was inserted in a
subtransaction and then updated in another subtransaction of the same toplevel
transaction?


not sure if I am getting? I have tried with below test and don't see any issue,

‘postgres[14723]=#’drop table test2;
DROP TABLE
‘postgres[14723]=#’create table test2 (a int, b int primary key);
CREATE TABLE
‘postgres[14723]=#’insert into test2 values (1,1);
INSERT 0 1
‘postgres[14723]=#’BEGIN;
BEGIN
‘postgres[14723]=#*’update test2 set a =2 where a =1;
UPDATE 1
‘postgres[14723]=#*’savepoint s1;
SAVEPOINT
‘postgres[14723]=#*’update test2 set a =6;
UPDATE 1
‘postgres[14723]=#*’rollback to savepoint s1;
ROLLBACK
‘postgres[14723]=#*’update test2 set a =6;
UPDATE 1
‘postgres[14723]=#*’savepoint s2;
SAVEPOINT
‘postgres[14723]=#*’update test2 set a =7;
UPDATE 1
‘postgres[14723]=#*’end;
COMMIT
‘postgres[14723]=#’SELECT lp as tuple, t_xmin, t_xmax, t_field3 as t_cid, t_ctid,tuple_data_split('test2'::regclass, t_data, t_infomask, t_infomask2, t_bits), heap_tuple_infomask_flags(t_infomask, t_infomask2) FROM heap_page_items(get_raw_page('test2', 0));
 tuple | t_xmin | t_xmax | t_cid | t_ctid |       tuple_data_split        |                         heap_tuple_infomask_flags                        
-------+--------+--------+-------+--------+-------------------------------+---------------------------------------------------------------------------
     1 |   1254 |   1255 |     0 | (0,2)  | {"\\x01000000","\\x01000000"} | ("{HEAP_XMIN_COMMITTED,HEAP_HOT_UPDATED}",{})
     2 |   1255 |   1257 |     1 | (0,4)  | {"\\x02000000","\\x01000000"} | ("{HEAP_COMBOCID,HEAP_UPDATED,HEAP_HOT_UPDATED,HEAP_ONLY_TUPLE}",{})
     3 |   1256 |      0 |     1 | (0,3)  | {"\\x06000000","\\x01000000"} | ("{HEAP_XMIN_INVALID,HEAP_XMAX_INVALID,HEAP_UPDATED,HEAP_ONLY_TUPLE}",{})
     4 |   1257 |   1258 |     2 | (0,5)  | {"\\x06000000","\\x01000000"} | ("{HEAP_COMBOCID,HEAP_UPDATED,HEAP_HOT_UPDATED,HEAP_ONLY_TUPLE}",{})
     5 |   1258 |      0 |     3 | (0,5)  | {"\\x07000000","\\x01000000"} | ("{HEAP_XMAX_INVALID,HEAP_UPDATED,HEAP_ONLY_TUPLE}",{})
(5 rows)
 

> +                     }
> +
> +                     /*
> +                      * If the predecessor's xmin is not frozen, then current tuple's
> +                      * shouldn't be either.
> +                      */
> +                     if (pred_xmin != FrozenTransactionId && curr_xmin == FrozenTransactionId)
> +                     {
> +                             report_corruption(&ctx,
> +                                                               psprintf("unfrozen tuple was updated to produce a tuple at offset %u which is frozen",
> +                                                                                (unsigned) currentoffnum));
> +                     }

Can't we have a an update chain that is e.g.
xmin 10, xmax 5 -> xmin 5, xmax invalid

and a vacuum cutoff of 7? That'd preent the first tuple from being removed,
but would allow 5 to be frozen.

I think there were recent patches proposing we don't freeze in that case, but
we'll having done that in the past....


Not very sure about this, was trying with such case but found hard to reproduce this.


--
Regards,
Himanshu Upadhyaya
EnterpriseDB: http://www.enterprisedb.com

Re: HOT chain validation in verify_heapam()

From
Himanshu Upadhyaya
Date:


On Wed, Nov 16, 2022 at 1:58 AM Robert Haas <robertmhaas@gmail.com> wrote:
On Tue, Nov 15, 2022 at 2:50 PM Andres Freund <andres@anarazel.de> wrote:
> On 2022-11-15 11:36:21 -0500, Robert Haas wrote:
> > On Mon, Nov 14, 2022 at 5:02 PM Andres Freund <andres@anarazel.de> wrote:
> > > It seems like we should do a bit more validation within a chain of
> > > tuples. E.g. that no live tuple can follow an !DidCommit xmin?
> >
> > I think this check is already present in stronger form. If we see a
> > !DidCommit xmin, the xmin of the next tuple in the chain not only can't be
> > committed, but had better be the same.
>
> As I think I mentioned before, I don't think the "better be the same" aspect
> is correct, think subxacts. E.g.
>
> off 0: xmin: top, xmax: child_1
> off 1: xmin: child_1, xmax: invalid
>
> If top hasn't committed yet, the current logic afaict will warn about this
> situation, no? And I don't think we can generally the subxid parent at this
> point, unfortunately (might have truncated subtrans).

Woops, you're right.

yes, got it, have tried to test and it is giving false corruption in case of subtransaction.
I think a better way to have this check is, we need to check that if pred_xmin is 
aborted then current_xmin should be aborted only. So there is no way that we
validate corruption with in_progress txid.


--
Regards,
Himanshu Upadhyaya
EnterpriseDB: http://www.enterprisedb.com

Re: HOT chain validation in verify_heapam()

From
Robert Haas
Date:
On Wed, Nov 16, 2022 at 4:51 AM Himanshu Upadhyaya
<upadhyaya.himanshu@gmail.com> wrote:
> yes, got it, have tried to test and it is giving false corruption in case of subtransaction.
> I think a better way to have this check is, we need to check that if pred_xmin is
> aborted then current_xmin should be aborted only. So there is no way that we
> validate corruption with in_progress txid.

Please note that you can't use TransactionIdDidAbort here, because
that will return false for transactions aborted by a crash. You have
to check that it's not in progress and then afterwards check that it's
not committed. Also note that if you check whether it's committed
first and then check whether it's in progress afterwards, there's a
race condition: it might commit just after you verify that it isn't
committed yet, and then it won't be in progress any more and will look
aborted.

I disagree with the idea that we can't check in progress. I think the
checks could look something like this:

pred_in_progress = TransactionIdIsInProgress(pred_xmin);
current_in_progress = TransactionIdIsInProgress(current_xmin);
if (pred_in_progress)
{
     if (current_in_progress)
        return ok;
     // recheck to avoid race condition
     if (TransactionIdIsInProgress(pred_xmin))
     {
         if (TransactionIdDidCommit(current_xmin))
             return corruption: predecessor xmin in progress, but
current xmin committed;
         else
             return corruption: predecessor xmin in progress, but
current xmin aborted;
     }
     // fallthrough: when we entered this if-block pred_xmin was still
in progress but no longer;
     pred_in_progress = false;
}

if (TransactionIdDidCommit(pred_xmin))
    return ok;

if (current_in_progress)
     return corruption: predecessor xmin aborted, but current xmin in progress;
else if (TransactionIdDidCommit(current_xmin))
     return corruption: predecessor xmin aborted, but current xmin committed;

The error messages as phrased here aren't actually what we should use;
they would need rephrasing. But I think, or hope anyway, that the
logic works. I think you basically just need the 1 recheck: if you see
the predecessor xmin in progress and then the current xmin in
progress, you have to go back and check that the predecessor xmin is
still in progress, because otherwise both could have committed or
aborted together in between.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: HOT chain validation in verify_heapam()

From
Himanshu Upadhyaya
Date:


On Wed, Nov 16, 2022 at 11:23 PM Robert Haas <robertmhaas@gmail.com> wrote:
On Wed, Nov 16, 2022 at 4:51 AM Himanshu Upadhyaya
<upadhyaya.himanshu@gmail.com> wrote:
> yes, got it, have tried to test and it is giving false corruption in case of subtransaction.
> I think a better way to have this check is, we need to check that if pred_xmin is
> aborted then current_xmin should be aborted only. So there is no way that we
> validate corruption with in_progress txid.

Please note that you can't use TransactionIdDidAbort here, because
that will return false for transactions aborted by a crash. You have
to check that it's not in progress and then afterwards check that it's
not committed. Also note that if you check whether it's committed
first and then check whether it's in progress afterwards, there's a
race condition: it might commit just after you verify that it isn't
committed yet, and then it won't be in progress any more and will look
aborted.

I disagree with the idea that we can't check in progress. I think the
checks could look something like this:

pred_in_progress = TransactionIdIsInProgress(pred_xmin);
current_in_progress = TransactionIdIsInProgress(current_xmin);
if (pred_in_progress)
{
     if (current_in_progress)
        return ok;
     // recheck to avoid race condition
     if (TransactionIdIsInProgress(pred_xmin))
     {
         if (TransactionIdDidCommit(current_xmin))
             return corruption: predecessor xmin in progress, but
current xmin committed;
         else
             return corruption: predecessor xmin in progress, but
current xmin aborted;
     }
I think we can have a situation where pred_xmin is in progress but curr_xmin is aborted, consider below example:
 ‘postgres[14723]=#’BEGIN;
BEGIN
‘postgres[14723]=#*’insert into test2 values (1,1);
INSERT 0 1
‘postgres[14723]=#*’savepoint s1;
SAVEPOINT
‘postgres[14723]=#*’update test2 set a =2;
UPDATE 1
‘postgres[14723]=#*’rollback to savepoint s1;
ROLLBACK

Now pred_xmin is in progress but curr_xmin is aborted, am I missing anything here?
I think if pred_xmin is aborted and curr_xmin is in progress we should consider it as a corruption case but vice versa is not true.


--
Regards,
Himanshu Upadhyaya
EnterpriseDB: http://www.enterprisedb.com

Re: HOT chain validation in verify_heapam()

From
Robert Haas
Date:
On Wed, Nov 16, 2022 at 10:57 PM Himanshu Upadhyaya
<upadhyaya.himanshu@gmail.com> wrote:
> I think if pred_xmin is aborted and curr_xmin is in progress we should consider it as a corruption case but vice
versais not true.
 

Yeah, you're right. I'm being stupid about subtransactions again.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: HOT chain validation in verify_heapam()

From
Himanshu Upadhyaya
Date:


On Wed, Nov 16, 2022 at 12:41 PM Himanshu Upadhyaya <upadhyaya.himanshu@gmail.com> wrote:


> +             }
> +
> +             /* Loop over offsets and validate the data in the predecessor array. */
> +             for (OffsetNumber currentoffnum = FirstOffsetNumber; currentoffnum <= maxoff;
> +                      currentoffnum = OffsetNumberNext(currentoffnum))
> +             {
> +                     HeapTupleHeader pred_htup;
> +                     HeapTupleHeader curr_htup;
> +                     TransactionId pred_xmin;
> +                     TransactionId curr_xmin;
> +                     ItemId          pred_lp;
> +                     ItemId          curr_lp;
> +
> +                     ctx.offnum = predecessor[currentoffnum];
> +                     ctx.attnum = -1;
> +
> +                     if (ctx.offnum == 0)
> +                     {
> +                             /*
> +                              * Either the root of the chain or an xmin-aborted tuple from
> +                              * an abandoned portion of the HOT chain.
> +                              */

Hm - couldn't we check that the tuple could conceivably be at the root of a
chain? I.e. isn't HEAP_HOT_UPDATED? Or alternatively has an aborted xmin?

 
 I don't see a way to check if tuple is at the root of HOT chain because predecessor array will always be having either xmin from non-abandoned transaction or it will be zero. We can't differentiate root or tuple inserted via abandoned transaction.

I was wrong here. I think this can be done and will be doing these changes in my next patch. 


--
Regards,
Himanshu Upadhyaya
EnterpriseDB: http://www.enterprisedb.com

Re: HOT chain validation in verify_heapam()

From
Himanshu Upadhyaya
Date:


On Tue, Nov 15, 2022 at 3:32 AM Andres Freund <andres@anarazel.de> wrote:

> Furthermore, it is
> possible that successor[x] = successor[x'] since the page might be corrupted
> and we haven't checked otherwise.
>
> predecessor[y] = x means that successor[x] = y but in addition we've
> checked that y is sane, and that x.xmax=y.xmin. If there are multiple
> tuples for which these conditions hold, we've issued complaints about
> all but one and entered the last into the predecessor array.

As shown by the isolationtester test I just posted, this doesn't quite work
right now. Probably fixable.

I don't think we can follow non-HOT ctid chains if they're older than the xmin
horizon, including all cases of xmin being frozen. There's just nothing
guaranteeing that the tuples are actually "related".

I understand the problem with frozen tuples but don't understand the concern with non-HOT chains,
could you please help with some explanation around it?


--
Regards,
Himanshu Upadhyaya
EnterpriseDB: http://www.enterprisedb.com

Re: HOT chain validation in verify_heapam()

From
Andres Freund
Date:
Hi,

On 2022-11-17 21:33:17 +0530, Himanshu Upadhyaya wrote:
> On Tue, Nov 15, 2022 at 3:32 AM Andres Freund <andres@anarazel.de> wrote:
> > > Furthermore, it is
> > > possible that successor[x] = successor[x'] since the page might be
> > corrupted
> > > and we haven't checked otherwise.
> > >
> > > predecessor[y] = x means that successor[x] = y but in addition we've
> > > checked that y is sane, and that x.xmax=y.xmin. If there are multiple
> > > tuples for which these conditions hold, we've issued complaints about
> > > all but one and entered the last into the predecessor array.
> >
> > As shown by the isolationtester test I just posted, this doesn't quite work
> > right now. Probably fixable.
> >
> > I don't think we can follow non-HOT ctid chains if they're older than the
> > xmin
> > horizon, including all cases of xmin being frozen. There's just nothing
> > guaranteeing that the tuples are actually "related".
> >
> I understand the problem with frozen tuples but don't understand the
> concern with non-HOT chains,
> could you please help with some explanation around it?

I think there might be cases where following non-HOT ctid-chains across tuples
within a page will trigger spurious errors, if the tuple versions are older
than the xmin horizon. But it's a bit hard to say without seeing the code with
a bunch of the other bugs fixed.

Greetings,

Andres Freund



Re: HOT chain validation in verify_heapam()

From
Peter Geoghegan
Date:
On Tue, Nov 15, 2022 at 10:55 PM Andres Freund <andres@anarazel.de> wrote:
> I'm quite certain that it's possible to end up freezing an earlier row
> versions in a hot chain in < 14, I got there with careful gdb
> orchestration. Of course possible I screwed something up, given I did it once,
> interactively. Not sure if trying to fix it is worth the risk of backpatching
> all the necessary changes to switch to the retry approach.

There is code in heap_prepare_freeze_tuple() that treats a raw xmax as
"xmax_already_frozen = true", even when the raw xmax value isn't
already set to InvalidTransactionId. I'm referring to this code:

    if ( ... ) // process raw xmax
    ....
    else if (TransactionIdIsNormal(xid))
    ....
    else if ((tuple->t_infomask & HEAP_XMAX_INVALID) ||
             !TransactionIdIsValid(HeapTupleHeaderGetRawXmax(tuple)))
    {
        freeze_xmax = false;
        xmax_already_frozen = true;
        /* No need for relfrozenxid_out handling for already-frozen xmax */
    }
    else
        ereport(ERROR,
                (errcode(ERRCODE_DATA_CORRUPTED),
                 errmsg_internal("found xmax %u (infomask 0x%04x) not
frozen, not multi, not normal",
                                 xid, tuple->t_infomask)));

Why should it be okay to not process xmax during this call (by setting
"xmax_already_frozen = true"), just because HEAP_XMAX_INVALID happens
to be set? Isn't HEAP_XMAX_INVALID purely a hint? (HEAP_XMIN_FROZEN is
*not* a hint, but we're dealing with xmax here.)

I'm not sure how relevant this is to the concerns you have about
frozen xmax, or even if it's any kind of problem, but it still seems
worth fixing. It seems to me that there should be clear rules on what
special transaction IDs can appear in xmax. Namely: the only special
transaction ID that can ever appear in xmax is InvalidTransactionId.
(Also, it's not okay to see *any* other XID in the
"xmax_already_frozen = true" path, nor would it be okay to leave any
other XID behind in xmax in the nearby "freeze_xmax = true" path.)

-- 
Peter Geoghegan



Re: HOT chain validation in verify_heapam()

From
Andres Freund
Date:
Hi,

On 2022-11-20 11:58:12 -0800, Peter Geoghegan wrote:
> There is code in heap_prepare_freeze_tuple() that treats a raw xmax as
> "xmax_already_frozen = true", even when the raw xmax value isn't
> already set to InvalidTransactionId. I'm referring to this code:
>
>     if ( ... ) // process raw xmax
>     ....
>     else if (TransactionIdIsNormal(xid))
>     ....
>     else if ((tuple->t_infomask & HEAP_XMAX_INVALID) ||
>              !TransactionIdIsValid(HeapTupleHeaderGetRawXmax(tuple)))
>     {
>         freeze_xmax = false;
>         xmax_already_frozen = true;
>         /* No need for relfrozenxid_out handling for already-frozen xmax */
>     }
>     else
>         ereport(ERROR,
>                 (errcode(ERRCODE_DATA_CORRUPTED),
>                  errmsg_internal("found xmax %u (infomask 0x%04x) not
> frozen, not multi, not normal",
>                                  xid, tuple->t_infomask)));
>
> Why should it be okay to not process xmax during this call (by setting
> "xmax_already_frozen = true"), just because HEAP_XMAX_INVALID happens
> to be set? Isn't HEAP_XMAX_INVALID purely a hint? (HEAP_XMIN_FROZEN is
> *not* a hint, but we're dealing with xmax here.)

Hm. But to get to that point we already need to have decided that xmax
is not a normal xid. Unhelpfully we reuse the 'xid' variable for xmax as
well:
    xid = HeapTupleHeaderGetRawXmax(tuple);

I don't really know the HEAP_XMAX_INVALID branch is trying to do. For
one, xid already is set to HeapTupleHeaderGetRawXmax(), why is it
refetching the value?

So it looks to me like this path should just test !TransactionIdIsValid(xid)?


> It seems to me that there should be clear rules on what
> special transaction IDs can appear in xmax. Namely: the only special
> transaction ID that can ever appear in xmax is InvalidTransactionId.
> (Also, it's not okay to see *any* other XID in the
> "xmax_already_frozen = true" path, nor would it be okay to leave any
> other XID behind in xmax in the nearby "freeze_xmax = true" path.)

Yea.

Greetings,

Andres Freund



Re: HOT chain validation in verify_heapam()

From
Peter Geoghegan
Date:
On Mon, Nov 21, 2022 at 1:34 PM Andres Freund <andres@anarazel.de> wrote:
> Hm. But to get to that point we already need to have decided that xmax
> is not a normal xid. Unhelpfully we reuse the 'xid' variable for xmax as
> well:
>         xid = HeapTupleHeaderGetRawXmax(tuple);
>
> I don't really know the HEAP_XMAX_INVALID branch is trying to do. For
> one, xid already is set to HeapTupleHeaderGetRawXmax(), why is it
> refetching the value?

Right, that detail is correct, but still weird. And suggests that it
might not have been super well thought through.

> So it looks to me like this path should just test !TransactionIdIsValid(xid)?

Agreed. Plus there should be a comment that reminds you that this is a
normal regular transaction ID (easy to miss, because the initial "if"
block for Multis is rather large).

I will push something like that soon.

-- 
Peter Geoghegan



Re: HOT chain validation in verify_heapam()

From
Himanshu Upadhyaya
Date:


On Thu, Nov 10, 2022 at 3:38 AM Andres Freund <andres@anarazel.de> wrote:

> +             }
> +
> +             /*
> +              * Loop over offset and populate predecessor array from all entries
> +              * that are present in successor array.
> +              */
> +             ctx.attnum = -1;
> +             for (ctx.offnum = FirstOffsetNumber; ctx.offnum <= maxoff;
> +                      ctx.offnum = OffsetNumberNext(ctx.offnum))
> +             {
> +                     ItemId          curr_lp;
> +                     ItemId          next_lp;
> +                     HeapTupleHeader curr_htup;
> +                     HeapTupleHeader next_htup;
> +                     TransactionId curr_xmax;
> +                     TransactionId next_xmin;
> +
> +                     OffsetNumber nextoffnum = successor[ctx.offnum];
> +
> +                     curr_lp = PageGetItemId(ctx.page, ctx.offnum);

Why do we get the item when nextoffnum is 0?

Fixed by moving  PageGetItemId() call after the 'if' check.


> +                     if (nextoffnum == 0 || !lp_valid[ctx.offnum] || !lp_valid[nextoffnum])
> +                     {
> +                             /*
> +                              * This is either the last updated tuple in the chain or a
> +                              * corruption raised for this tuple.
> +                              */

"or a corruption raised" isn't quite right grammatically.
 
done.

> +                             continue;
> +                     }
> +                     if (ItemIdIsRedirected(curr_lp))
> +                     {
> +                             next_lp = PageGetItemId(ctx.page, nextoffnum);
> +                             if (ItemIdIsRedirected(next_lp))
> +                             {
> +                                     report_corruption(&ctx,
> +                                                                       psprintf("redirected line pointer pointing to another redirected line pointer at offset %u",
> +                                                                                        (unsigned) nextoffnum));
> +                                     continue;
> +                             }
> +                             next_htup = (HeapTupleHeader) PageGetItem(ctx.page, next_lp);
> +                             if (!HeapTupleHeaderIsHeapOnly(next_htup))
> +                             {
> +                                     report_corruption(&ctx,
> +                                                                       psprintf("redirected tuple at line pointer offset %u is not heap only tuple",
> +                                                                                        (unsigned) nextoffnum));
> +                             }
> +                             if ((next_htup->t_infomask & HEAP_UPDATED) == 0)
> +                             {
> +                                     report_corruption(&ctx,
> +                                                                       psprintf("redirected tuple at line pointer offset %u is not heap updated tuple",
> +                                                                                        (unsigned) nextoffnum));
> +                             }
> +                             continue;
> +                     }
> +
> +                     /*
> +                      * Add a line pointer offset to the predecessor array if xmax is
> +                      * matching with xmin of next tuple (reaching via its t_ctid).
> +                      * Prior to PostgreSQL 9.4, we actually changed the xmin to
> +                      * FrozenTransactionId

I'm doubtful it's a good idea to try to validate the 9.4 case. The likelihood
of getting that right seems low and I don't see us gaining much by even trying.



removed code with regards to frozen tuple checks.

> so we must add offset to predecessor
> +                      * array(irrespective of xmax-xmin matching) if updated tuple xmin
> +                      * is frozen, so that we can later do validation related to frozen
> +                      * xmin. Raise corruption if we have two tuples having the same
> +                      * predecessor.
> +                      * We add the offset to the predecessor array irrespective of the
> +                      * transaction (t_xmin) status. We will do validation related to
> +                      * the transaction status (and also all other validations) when we
> +                      * loop over the predecessor array.
> +                      */
> +                     curr_htup = (HeapTupleHeader) PageGetItem(ctx.page, curr_lp);
> +                     curr_xmax = HeapTupleHeaderGetUpdateXid(curr_htup);
> +                     next_lp = PageGetItemId(ctx.page, nextoffnum);
> +                     next_htup = (HeapTupleHeader) PageGetItem(ctx.page, next_lp);
> +                     next_xmin = HeapTupleHeaderGetXmin(next_htup);
> +                     if (TransactionIdIsValid(curr_xmax) &&
> +                             (TransactionIdEquals(curr_xmax, next_xmin) ||
> +                              next_xmin == FrozenTransactionId))
> +                     {
> +                             if (predecessor[nextoffnum] != 0)
> +                             {
> +                                     report_corruption(&ctx,
> +                                                                       psprintf("updated version at offset %u is also the updated version of tuple at offset %u",
> +                                                                                        (unsigned) nextoffnum, (unsigned) predecessor[nextoffnum]));
> +                                     continue;

I doubt it is correct to enter this path with next_xmin ==
FrozenTransactionId. This is following a ctid chain that we normally wouldn't
follow, because it doesn't satisfy the t_self->xmax == t_ctid->xmin condition.

removed this frozen check.

> +             }
> +
> +             /* Loop over offsets and validate the data in the predecessor array. */
> +             for (OffsetNumber currentoffnum = FirstOffsetNumber; currentoffnum <= maxoff;
> +                      currentoffnum = OffsetNumberNext(currentoffnum))
> +             {
> +                     HeapTupleHeader pred_htup;
> +                     HeapTupleHeader curr_htup;
> +                     TransactionId pred_xmin;
> +                     TransactionId curr_xmin;
> +                     ItemId          pred_lp;
> +                     ItemId          curr_lp;
> +
> +                     ctx.offnum = predecessor[currentoffnum];
> +                     ctx.attnum = -1;
> +
> +                     if (ctx.offnum == 0)
> +                     {
> +                             /*
> +                              * Either the root of the chain or an xmin-aborted tuple from
> +                              * an abandoned portion of the HOT chain.
> +                              */

Hm - couldn't we check that the tuple could conceivably be at the root of a
chain? I.e. isn't HEAP_HOT_UPDATED? Or alternatively has an aborted xmin?

Done, I have added code to identify cases of missing offset in the predecessor[] array and added validation that root of the chain must not be HEAP_ONLY_TUPLE.

> +                             continue;
> +                     }
> +
> +                     curr_lp = PageGetItemId(ctx.page, currentoffnum);
> +                     curr_htup = (HeapTupleHeader) PageGetItem(ctx.page, curr_lp);
> +                     curr_xmin = HeapTupleHeaderGetXmin(curr_htup);
> +
> +                     ctx.itemid = pred_lp = PageGetItemId(ctx.page, ctx.offnum);
> +                     pred_htup = (HeapTupleHeader) PageGetItem(ctx.page, pred_lp);
> +                     pred_xmin = HeapTupleHeaderGetXmin(pred_htup);
> +
> +                     /*
> +                      * If the predecessor's xmin is aborted or in progress, the
> +                      * current tuples xmin should be aborted or in progress
> +                      * respectively. Also both xmin's must be equal.
> +                      */
> +                     if (!TransactionIdEquals(pred_xmin, curr_xmin) &&
> +                             !TransactionIdDidCommit(pred_xmin))
> +                     {
> +                             report_corruption(&ctx,
> +                                                               psprintf("tuple with uncommitted xmin %u was updated to produce a tuple at offset %u with differing xmin %u",
> +                                                                                (unsigned) pred_xmin, (unsigned) currentoffnum, (unsigned) curr_xmin));

Is this necessarily true? What about a tuple that was inserted in a
subtransaction and then updated in another subtransaction of the same toplevel
transaction?

 
patch has been updated to handle cases of sub-transaction. 


--
Regards,
Himanshu Upadhyaya
EnterpriseDB: http://www.enterprisedb.com
Attachment

Re: HOT chain validation in verify_heapam()

From
Andres Freund
Date:
Hi,

On 2022-11-30 16:09:19 +0530, Himanshu Upadhyaya wrote:
> has been updated to handle cases of sub-transaction.

Thanks!


> +        /* Loop over offsets and validate the data in the predecessor array. */
> +        for (OffsetNumber currentoffnum = FirstOffsetNumber; currentoffnum <= maxoff;
> +             currentoffnum = OffsetNumberNext(currentoffnum))
> +        {
> +            HeapTupleHeader pred_htup;
> +            HeapTupleHeader curr_htup;
> +            TransactionId pred_xmin;
> +            TransactionId curr_xmin;
> +            ItemId        pred_lp;
> +            ItemId        curr_lp;
> +            bool        pred_in_progress;
> +            XidCommitStatus xid_commit_status;
> +            XidBoundsViolation xid_status;
> +
> +            ctx.offnum = predecessor[currentoffnum];
> +            ctx.attnum = -1;
> +            curr_lp = PageGetItemId(ctx.page, currentoffnum);
> +            if (!lp_valid[currentoffnum] || ItemIdIsRedirected(curr_lp))
> +                continue;

I don't think we should do PageGetItemId(ctx.page, currentoffnum); if !lp_valid[currentoffnum].


> +            curr_htup = (HeapTupleHeader) PageGetItem(ctx.page, curr_lp);
> +            curr_xmin = HeapTupleHeaderGetXmin(curr_htup);
> +            xid_status = get_xid_status(curr_xmin, &ctx, &xid_commit_status);
> +            if (!(xid_status == XID_BOUNDS_OK || xid_status == XID_INVALID))
> +                continue;

Why can we even get here if the xid status isn't XID_BOUNDS_OK?


> +            if (ctx.offnum == 0)

For one, I think it'd be better to use InvalidOffsetNumber here. But more
generally, storing the predecessor in ctx.offnum seems quite confusing.


> +            {
> +                /*
> +                 * No harm in overriding value of ctx.offnum as we will always
> +                 * continue if we are here.
> +                 */
> +                ctx.offnum = currentoffnum;
> +                if (TransactionIdIsInProgress(curr_xmin) || TransactionIdDidCommit(curr_xmin))

Is it actually ok to call TransactionIdDidCommit() here? There's a reason
get_xid_status() exists after all. And we do have the xid status for xmin
already, so this could just check xid_commit_status, no?



Greetings,

Andres Freund



Re: HOT chain validation in verify_heapam()

From
Himanshu Upadhyaya
Date:
Hi,


On Fri, Dec 2, 2022 at 5:43 AM Andres Freund <andres@anarazel.de> wrote:

> +                     curr_htup = (HeapTupleHeader) PageGetItem(ctx.page, curr_lp);
> +                     curr_xmin = HeapTupleHeaderGetXmin(curr_htup);
> +                     xid_status = get_xid_status(curr_xmin, &ctx, &xid_commit_status);
> +                     if (!(xid_status == XID_BOUNDS_OK || xid_status == XID_INVALID))
> +                             continue;

Why can we even get here if the xid status isn't XID_BOUNDS_OK?



 @@ -504,9 +516,269 @@ verify_heapam(PG_FUNCTION_ARGS)
                        /* It should be safe to examine the tuple's header, at least */
                        ctx.tuphdr = (HeapTupleHeader) PageGetItem(ctx.page, ctx.itemid);
                        ctx.natts = HeapTupleHeaderGetNatts(ctx.tuphdr);
+                       lp_valid[ctx.offnum] = true;
 
                        /* Ok, ready to check this next tuple */
                        check_tuple(&ctx);

referring above code, check_tuple(&ctx); do have this check but we populate lp_valid before that call. 
Populating lp_valid before check_tuple() is intentional because even if we do changes to get the return status from check_tuple() to populate that in lp_valid, it will be hard to validate cases that are dependent on aborted transaction (like "tuple with aborted xmin %u was updated to produce a tuple at offset %u with committed xmin %u") because check_tuple_visibility is also looking for aborted xmin and return false if tuple's xmin is aborted, in fact we can add one more parameter to check_tuple and get the status of transaction if it is aborted and accordingly set lp_valid to true but that will add unnecessary complexity and don't find it convincing implementation. Alternatively, I found rechecking xid_status is simpler and straight.
 

> +                     if (ctx.offnum == 0)

For one, I think it'd be better to use InvalidOffsetNumber here. But more
generally, storing the predecessor in ctx.offnum seems quite confusing.

ok, I will change it to  InvalidOffsetNumber at all the places, we need ctx.offnum to have the value of the predecessor array as this will be internally used by report_corruption function to generate the message(eg. below), and the format of these message's seems more simple and meaningful to report corruption.

  report_corruption(&ctx,
                                                                  psprintf("heap-only update produced a non-heap only tuple at offset %u",
                                                                                   (unsigned) currentoffnum));
Here we don't need to mention ctx.offnum explicitly in the above message as this will be taken care of by the code below.

"report_corruption_internal(Tuplestorestate *tupstore, TupleDesc tupdesc,
                                                   BlockNumber blkno, OffsetNumber offnum,
                                                   AttrNumber attnum, char *msg)
{
        Datum           values[HEAPCHECK_RELATION_COLS] = {0};
        bool            nulls[HEAPCHECK_RELATION_COLS] = {0};
        HeapTuple       tuple;

        values[0] = Int64GetDatum(blkno);
        values[1] = Int32GetDatum(offnum);"
 
--
Regards,
Himanshu Upadhyaya
EnterpriseDB: http://www.enterprisedb.com

Re: HOT chain validation in verify_heapam()

From
Himanshu Upadhyaya
Date:


On Fri, Dec 2, 2022 at 5:43 AM Andres Freund <andres@anarazel.de> wrote:

> +             /* Loop over offsets and validate the data in the predecessor array. */
> +             for (OffsetNumber currentoffnum = FirstOffsetNumber; currentoffnum <= maxoff;
> +                      currentoffnum = OffsetNumberNext(currentoffnum))
> +             {
> +                     HeapTupleHeader pred_htup;
> +                     HeapTupleHeader curr_htup;
> +                     TransactionId pred_xmin;
> +                     TransactionId curr_xmin;
> +                     ItemId          pred_lp;
> +                     ItemId          curr_lp;
> +                     bool            pred_in_progress;
> +                     XidCommitStatus xid_commit_status;
> +                     XidBoundsViolation xid_status;
> +
> +                     ctx.offnum = predecessor[currentoffnum];
> +                     ctx.attnum = -1;
> +                     curr_lp = PageGetItemId(ctx.page, currentoffnum);
> +                     if (!lp_valid[currentoffnum] || ItemIdIsRedirected(curr_lp))
> +                             continue;

I don't think we should do PageGetItemId(ctx.page, currentoffnum); if !lp_valid[currentoffnum].

Fixed.

> +                     if (ctx.offnum == 0)

For one, I think it'd be better to use InvalidOffsetNumber here. But more
generally, storing the predecessor in ctx.offnum seems quite confusing.

changed all relevant places to use  InvalidOffsetNumber.

> +                     {
> +                             /*
> +                              * No harm in overriding value of ctx.offnum as we will always
> +                              * continue if we are here.
> +                              */
> +                             ctx.offnum = currentoffnum;
> +                             if (TransactionIdIsInProgress(curr_xmin) || TransactionIdDidCommit(curr_xmin))

Is it actually ok to call TransactionIdDidCommit() here? There's a reason
get_xid_status() exists after all. And we do have the xid status for xmin
already, so this could just check xid_commit_status, no?


I think it will be good to pass NULL to get_xid_status like "get_xid_status(curr_xmin, &ctx, NULL);" so that we can only check the xid status at the time when it is actually required. This way we can avoid checking xid status in cases when we simply 'continue' due to some check.


--
Regards,
Himanshu Upadhyaya
EnterpriseDB: http://www.enterprisedb.com
Attachment

Re: HOT chain validation in verify_heapam()

From
Aleksander Alekseev
Date:
Hi hackers,

> Fixed.

I noticed that this patch stuck a little and decided to take another look.

It seems to be well written, covered with tests and my understanding
is that all the previous feedback was accounted for. To your knowledge
is there anything that prevents us from moving it to "Ready for
Committer"?

-- 
Best regards,
Aleksander Alekseev



Re: HOT chain validation in verify_heapam()

From
Robert Haas
Date:
On Thu, Jan 19, 2023 at 8:55 AM Aleksander Alekseev
<aleksander@timescale.com> wrote:
> I noticed that this patch stuck a little and decided to take another look.
>
> It seems to be well written, covered with tests and my understanding
> is that all the previous feedback was accounted for. To your knowledge
> is there anything that prevents us from moving it to "Ready for
> Committer"?

Thanks for taking a look, and for pinging the thread.

I think that the handling of lp_valid[] in the loop that begins with
"Loop over offset and populate predecessor array from all entries that
are present in successor array" is very confusing. I think that
lp_valid[] should be answering the question "is the line pointer
basically sane?". That is, if it's a redirect, it needs to point to
something within the line pointer array (and we also check that it
must be an entry in the line pointer array that is used, which seems
fine). If it's not a redirect, it needs to point to space that's
entirely within the block, properly aligned, and big enough to contain
a tuple. We determine the answers to all of these questions in the
first loop, the one that starts with /* Perform tuple checks */.

Nothing that happens in the second loop, where we populate the
predecessor array, can reverse our previous conclusion that the line
pointer is valid, so this loop shouldn't be resetting entries in
lp_valid[] to false. The reason that it's doing so seems to be that it
wants to use lp_valid[] to control the behavior of the third loop,
where we perform checks against things that have entries in the
predecessor array. As written, the code ensures that we always set
lp_valid[nextoffnum] to false unless we set predecessor[nextoffnum] to
a value other than InvalidOffsetNumber. But that is needlessly
complex: the third loop doesn't need to look at lp_valid[] at all. It
can just check whether predecessor[currentoffnum] is valid. If it is,
perform checks. Otherwise, skip it. It seems to me that this would be
significantly simpler.

To put the above complaint another way, a variable shouldn't mean two
different things depending on where you are in the function. Right
now, at the end of the first loop, lp_valid[x] answers the question
"is line pointer x basically valid?". But by the end of the second
loop, it answers the question "is line pointer x valid and does it
also have a valid predecessor?". That kind of definitional change is
something to be avoided.

The test if (pred_in_progress || TransactionIdDidCommit(curr_xmin))
seems wrong to me. Shouldn't it be &&? Has this code been tested at
all?  It doesn't seem to have a test case. Some of these other errors
don't, either. Maybe there's some that we can't easily test in an
automated way, but we should test what we can. I guess maybe casual
testing wouldn't reveal the problem here because of the recheck, but
it's worrying to find logic that doesn't look right with no
corresponding comments or test cases.

Some error message kibitizing:

 psprintf("redirected tuple at line pointer offset %u is not heap only tuple",

It seems to me that this should say "redirected line pointer pointing
to a non-heap-only tuple at offset %u". There is no such thing as a
redirected tuple -- and even if there were, what we have here is
clearly a redirected line pointer.

psprintf("redirected tuple at line pointer offset %u is not heap only tuple",

And I think for the same reasons this one should say something like
"redirected line pointer pointing to a non-heap-only tuple at offset
%u".

 psprintf("redirected tuple at line pointer offset %u is not heap
updated tuple",

Possibly all of these would sound better with "points" rather than
"pointing" -- if so, we'd need to change an existing message in the
same way.

And this one should say something like "redirected line pointer
pointing to a tuple not produced by an update at offset %u".

 psprintf("tuple is root of chain but it is marked as heap-only tuple"));

I think this would sound better if you deleted the word "it".

I don't know whether it's worth arguing about -- it feels like we've
argued too much already about this sort of thing -- but I am not very
convinced by initializers like OffsetNumber
predecessor[MaxOffsetNumber] = {InvalidOffsetNumber}. That style is
only correct because InvalidOffsetNumber happens to be zero. If it
were up to me, I'd use memset to clear the predecessor array. I would
not bulk initialize sucessor and lp_valid but make sure that the first
loop always sets them, possibly by having the top of the loop set them
to InvalidOffsetNumber and false initially and then letting code later
in the loop change the value, or possibly in some other way.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: HOT chain validation in verify_heapam()

From
Himanshu Upadhyaya
Date:


On Fri, Jan 20, 2023 at 12:38 AM Robert Haas <robertmhaas@gmail.com> wrote:

I think that the handling of lp_valid[] in the loop that begins with
"Loop over offset and populate predecessor array from all entries that
are present in successor array" is very confusing. I think that
lp_valid[] should be answering the question "is the line pointer
basically sane?". That is, if it's a redirect, it needs to point to
something within the line pointer array (and we also check that it
must be an entry in the line pointer array that is used, which seems
fine). If it's not a redirect, it needs to point to space that's
entirely within the block, properly aligned, and big enough to contain
a tuple. We determine the answers to all of these questions in the
first loop, the one that starts with /* Perform tuple checks */.

Nothing that happens in the second loop, where we populate the
predecessor array, can reverse our previous conclusion that the line
pointer is valid, so this loop shouldn't be resetting entries in
lp_valid[] to false. The reason that it's doing so seems to be that it
wants to use lp_valid[] to control the behavior of the third loop,
where we perform checks against things that have entries in the
predecessor array. As written, the code ensures that we always set
lp_valid[nextoffnum] to false unless we set predecessor[nextoffnum] to
a value other than InvalidOffsetNumber. But that is needlessly
complex: the third loop doesn't need to look at lp_valid[] at all. It
can just check whether predecessor[currentoffnum] is valid. If it is,
perform checks. Otherwise, skip it. It seems to me that this would be
significantly simpler. 
I was trying to use lp_valid as I need to identify the root of the HOT chain and we are doing validation on the root of the HOT chain when we loop over the predecessor array. 
                       if (nextoffnum == InvalidOffsetNumber || !lp_valid[ctx.offnum] || !lp_valid[nextoffnum])
                        {
                                /*
                                 * Set lp_valid of nextoffnum to false if current tuple's
                                 * lp_valid is true. We don't add this to predecessor array as
                                 * it's of no use to validate tuple if its predecessor is
                                 * already corrupted but we need to identify all those tuple's
                                 * so that we can differentiate between all the cases of
                                 * missing offset in predecessor array, this will help in
                                 * validating the root of chain when we loop over predecessor
                                 * array.
                                 */
                               if (!lp_valid[ctx.offnum] && lp_valid[nextoffnum])
                                        lp_valid[nextoffnum] = false;
Was resetting lp_valid in the last patch because we don't add data to predecessor[] and while looping over the predecessor array we need to isolate (and identify) all cases of missing data in the predecessor array to exactly identify the root of HOT chain.
One solution is to always add data to predecessor array while looping over successor array and then while looping over predecessor array we can continue for other validation "if (lp_valid [predecessor[currentoffnum]] && lp_valid[currentoffnum]" is true but in this case also our third loop will also look at lp_valid[].

To put the above complaint another way, a variable shouldn't mean two
different things depending on where you are in the function. Right
now, at the end of the first loop, lp_valid[x] answers the question
"is line pointer x basically valid?". But by the end of the second
loop, it answers the question "is line pointer x valid and does it
also have a valid predecessor?". That kind of definitional change is
something to be avoided.

agree.
 
The test if (pred_in_progress || TransactionIdDidCommit(curr_xmin))
seems wrong to me. Shouldn't it be &&? Has this code been tested at
all?  It doesn't seem to have a test case. Some of these other errors
don't, either. Maybe there's some that we can't easily test in an
automated way, but we should test what we can. I guess maybe casual
testing wouldn't reveal the problem here because of the recheck, but
it's worrying to find logic that doesn't look right with no
corresponding comments or test cases.

This is totally my Mistake, apologies for that. I will fix this in my next patch. Regarding the missing test cases, I need one in-progress transaction for these test cases to be included in 004_verify_heapam.pl but I don't find a clear way to have an in-progress transaction(as per the design of 004_verify_heapam.pl ) that I can use in the test cases. I will be doing more research on a solution to add these missing test cases.  
Some error message kibitizing:

 psprintf("redirected tuple at line pointer offset %u is not heap only tuple",

It seems to me that this should say "redirected line pointer pointing
to a non-heap-only tuple at offset %u". There is no such thing as a
redirected tuple -- and even if there were, what we have here is
clearly a redirected line pointer.

psprintf("redirected tuple at line pointer offset %u is not heap only tuple",

And I think for the same reasons this one should say something like
"redirected line pointer pointing to a non-heap-only tuple at offset
%u".

 psprintf("redirected tuple at line pointer offset %u is not heap
updated tuple",

Possibly all of these would sound better with "points" rather than
"pointing" -- if so, we'd need to change an existing message in the
same way.

And this one should say something like "redirected line pointer
pointing to a tuple not produced by an update at offset %u".

 psprintf("tuple is root of chain but it is marked as heap-only tuple"));

I think this would sound better if you deleted the word "it".

Will change accordingly in my next patch. 
I don't know whether it's worth arguing about -- it feels like we've
argued too much already about this sort of thing -- but I am not very
convinced by initializers like OffsetNumber
predecessor[MaxOffsetNumber] = {InvalidOffsetNumber}. That style is
only correct because InvalidOffsetNumber happens to be zero. If it
were up to me, I'd use memset to clear the predecessor array. I would
not bulk initialize sucessor and lp_valid but make sure that the first
loop always sets them, possibly by having the top of the loop set them
to InvalidOffsetNumber and false initially and then letting code later
in the loop change the value, or possibly in some other way.

agree, will fix in my next patch 

--
Regards,
Himanshu Upadhyaya
EnterpriseDB: http://www.enterprisedb.com

Re: HOT chain validation in verify_heapam()

From
Robert Haas
Date:
On Sun, Jan 22, 2023 at 10:19 AM Himanshu Upadhyaya
<upadhyaya.himanshu@gmail.com> wrote:
> I was trying to use lp_valid as I need to identify the root of the HOT chain and we are doing validation on the root
ofthe HOT chain when we loop over the predecessor array. 
> Was resetting lp_valid in the last patch because we don't add data to predecessor[] and while looping over the
predecessorarray we need to isolate (and identify) all cases of missing data in the predecessor array to exactly
identifythe root of HOT chain. 
> One solution is to always add data to predecessor array while looping over successor array and then while looping
overpredecessor array we can continue for other validation "if (lp_valid [predecessor[currentoffnum]] &&
lp_valid[currentoffnum]"is true but in this case also our third loop will also look at lp_valid[]. 

I don't mind if the third loop looks at lp_valid if it has a reason to
do that, but I don't think we should be resetting values from true to
false. Once we know a line pointer to be valid, it doesn't stop being
valid later because we found out some other thing about something
else.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: HOT chain validation in verify_heapam()

From
Himanshu Upadhyaya
Date:
Hi Hackers,

On Sun, Jan 22, 2023 at 8:48 PM Himanshu Upadhyaya <upadhyaya.himanshu@gmail.com> wrote:

The test if (pred_in_progress || TransactionIdDidCommit(curr_xmin))
seems wrong to me. Shouldn't it be &&? Has this code been tested at
all?  It doesn't seem to have a test case. Some of these other errors
don't, either. Maybe there's some that we can't easily test in an
automated way, but we should test what we can. I guess maybe casual
testing wouldn't reveal the problem here because of the recheck, but
it's worrying to find logic that doesn't look right with no
corresponding comments or test cases.

This is totally my Mistake, apologies for that. I will fix this in my next patch. Regarding the missing test cases, I need one in-progress transaction for these test cases to be included in 004_verify_heapam.pl but I don't find a clear way to have an in-progress transaction(as per the design of 004_verify_heapam.pl ) that I can use in the test cases. I will be doing more research on a solution to add these missing test cases.  

I am trying to add test cases related to in-progress transactions in  
004_verify_heapam.pl but I am not able to find a proper way to achieve this. 
We have a logic where we manually corrupt each tuple.
Please refer to the code just after the below comment in  004_verify_heapam.pl 

"# Corrupt the tuples, one type of corruption per tuple.  Some types of
# corruption cause verify_heapam to skip to the next tuple without
# performing any remaining checks, so we can't exercise the system properly if
# we focus all our corruption on a single tuple."

Before this we stop the node by "$node->stop;" and then only we progress to 
manual corruption. This will abort all running/in-progress transactions.
So, if we create an in-progress transaction and comment "$node->stop;" 
then somehow all the code that we have for manual corruption does not work. 

I think it is required to stop the server and then only proceed for manual corruption?
If this is the case then please suggest if there is a way to get an in-progress transaction
that we can use for manual corruption.
--
Regards,
Himanshu Upadhyaya
EnterpriseDB: http://www.enterprisedb.com

Re: HOT chain validation in verify_heapam()

From
Robert Haas
Date:
On Mon, Jan 30, 2023 at 8:24 AM Himanshu Upadhyaya
<upadhyaya.himanshu@gmail.com> wrote:
> Before this we stop the node by "$node->stop;" and then only we progress to
> manual corruption. This will abort all running/in-progress transactions.
> So, if we create an in-progress transaction and comment "$node->stop;"
> then somehow all the code that we have for manual corruption does not work.
>
> I think it is required to stop the server and then only proceed for manual corruption?
> If this is the case then please suggest if there is a way to get an in-progress transaction
> that we can use for manual corruption.

How about using a prepared transaction?

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: HOT chain validation in verify_heapam()

From
Himanshu Upadhyaya
Date:


On Tue, Jan 31, 2023 at 7:20 PM Robert Haas <robertmhaas@gmail.com> wrote:
On Mon, Jan 30, 2023 at 8:24 AM Himanshu Upadhyaya
<upadhyaya.himanshu@gmail.com> wrote:
> Before this we stop the node by "$node->stop;" and then only we progress to
> manual corruption. This will abort all running/in-progress transactions.
> So, if we create an in-progress transaction and comment "$node->stop;"
> then somehow all the code that we have for manual corruption does not work.
>
> I think it is required to stop the server and then only proceed for manual corruption?
> If this is the case then please suggest if there is a way to get an in-progress transaction
> that we can use for manual corruption.

How about using a prepared transaction?

Thanks, yes it's working fine with Prepared Transaction.
Please find attached the v9 patch incorporating all the review comments.


--
Regards,
Himanshu Upadhyaya
EnterpriseDB: http://www.enterprisedb.com
Attachment

Re: HOT chain validation in verify_heapam()

From
Robert Haas
Date:
On Sun, Feb 5, 2023 at 3:57 AM Himanshu Upadhyaya
<upadhyaya.himanshu@gmail.com> wrote:
> Thanks, yes it's working fine with Prepared Transaction.
> Please find attached the v9 patch incorporating all the review comments.

I don't know quite how we're still going around in circles about this,
but this code makes no sense to me at all:

            /*
             * Add data to the predecessor array even if the current or
             * successor's LP is not valid. We will not process/validate these
             * offset entries while looping over the predecessor array but
             * having all entries in the predecessor array will help in
             * identifying(and validating) the Root of a chain.
             */
            if (!lp_valid[ctx.offnum] || !lp_valid[nextoffnum])
            {
                predecessor[nextoffnum] = ctx.offnum;
                continue;
            }

If the current offset number is not for a valid line pointer, then it
makes no sense to talk about the successor. An invalid redirected line
pointer is one that points off the end of the line pointer array, or
to before the beginning of the line pointer array, or to a line
pointer that is unused. An invalid line pointer that is LP_USED is one
which points to a location outside the page, or to a location inside
the page. In none of these cases does it make any sense to talk about
the next tuple. If the line pointer isn't valid, it's pointing to some
invalid location where there cannot possibly be a tuple. In other
words, if lp_valid[ctx.offnum] is false, then nextoffnum is a garbage
value, and therefore referencing predecessor[nextoffnum] is useless
and dangerous.

If the next offset number is not for a valid line pointer, we could in
theory still assign to the predecessor array, as you propose here. In
that case, the tuple or line pointer at ctx.offnum is pointing to the
line pointer at nextoffnum and that is all fine. But what is the
point? The comment claims that the point is that it will help us
identify and validate the root of the hot chain. But if the line
pointer at nextoffnum is not valid, it can't be the root of a hot
chain. When we're talking about the root of a HOT chain, we're
speaking about a tuple. If lp_valid[nextoffnum] is false, there is no
tuple. Instead of pointing to a tuple, that line pointer is pointing
to garbage.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: HOT chain validation in verify_heapam()

From
Himanshu Upadhyaya
Date:


On Wed, Feb 8, 2023 at 11:17 PM Robert Haas <robertmhaas@gmail.com> wrote:
On Sun, Feb 5, 2023 at 3:57 AM Himanshu Upadhyaya
<upadhyaya.himanshu@gmail.com> wrote:
> Thanks, yes it's working fine with Prepared Transaction.
> Please find attached the v9 patch incorporating all the review comments.

I don't know quite how we're still going around in circles about this,
but this code makes no sense to me at all:

            /*
             * Add data to the predecessor array even if the current or
             * successor's LP is not valid. We will not process/validate these
             * offset entries while looping over the predecessor array but
             * having all entries in the predecessor array will help in
             * identifying(and validating) the Root of a chain.
             */
            if (!lp_valid[ctx.offnum] || !lp_valid[nextoffnum])
            {
                predecessor[nextoffnum] = ctx.offnum;
                continue;
            }

If the current offset number is not for a valid line pointer, then it
makes no sense to talk about the successor. An invalid redirected line
pointer is one that points off the end of the line pointer array, or
to before the beginning of the line pointer array, or to a line
pointer that is unused. An invalid line pointer that is LP_USED is one
which points to a location outside the page, or to a location inside
the page. In none of these cases does it make any sense to talk about
the next tuple. If the line pointer isn't valid, it's pointing to some
invalid location where there cannot possibly be a tuple. In other
words, if lp_valid[ctx.offnum] is false, then nextoffnum is a garbage
value, and therefore referencing predecessor[nextoffnum] is useless
and dangerous.

If the next offset number is not for a valid line pointer, we could in
theory still assign to the predecessor array, as you propose here. In
that case, the tuple or line pointer at ctx.offnum is pointing to the
line pointer at nextoffnum and that is all fine. But what is the
point? The comment claims that the point is that it will help us
identify and validate the root of the hot chain. But if the line
pointer at nextoffnum is not valid, it can't be the root of a hot
chain. When we're talking about the root of a HOT chain, we're
speaking about a tuple. If lp_valid[nextoffnum] is false, there is no
tuple. Instead of pointing to a tuple, that line pointer is pointing
to garbage.


Initially while implementing logic to identify the root of the HOT chain
I was getting crash and regression failure's that time I thought of having
this check along with a few other changes that were required,
but you are right, it's unnecessary to add data to the predecessor
array(in this case) and is not required. I am removing this from the patch.

--
Regards,
Himanshu Upadhyaya
EnterpriseDB: http://www.enterprisedb.com
Attachment

Re: HOT chain validation in verify_heapam()

From
Robert Haas
Date:
On Thu, Feb 9, 2023 at 12:09 PM Himanshu Upadhyaya
<upadhyaya.himanshu@gmail.com> wrote:
> Initially while implementing logic to identify the root of the HOT chain
> I was getting crash and regression failure's that time I thought of having
> this check along with a few other changes that were required,
> but you are right, it's unnecessary to add data to the predecessor
> array(in this case) and is not required. I am removing this from the patch.

I finally found time to look at this today -- apologies for the long
delay -- and I don't think that it addresses my objections. When I
proposed lp_valid, I had a very simple idea in mind: it tells you
whether or not the line pointer is, at some basic level, valid. Like,
it contains numbers that could point to a tuple on the page, at least
hypothetically. But that is something that can be determined strictly
by inspecting the line pointer, and yet you have
check_tuple_visibility() changing the value based on the visibility
status of xmin. So it seems that we still don't have a patch where the
value of a variable called lp_valid corresponds to whether or not the
L.P. is valid.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: HOT chain validation in verify_heapam()

From
Robert Haas
Date:
On Mon, Mar 6, 2023 at 12:36 PM Robert Haas <robertmhaas@gmail.com> wrote:
> So it seems that we still don't have a patch where the
> value of a variable called lp_valid corresponds to whether or not the
> L.P. is valid.

Here's a worked-over version of this patch. Changes:

- I got rid of the code that sets lp_valid in funny places and instead
arranged to have check_tuple_visibility() pass up the information on
the XID status. That's important, because we can't casually apply
operations like TransactionIdIsCommitted() to XIDs that, for all we
know, might not even be in the range covered by CLOG. In such cases,
we should not perform any HOT chain validation because we can't do it
sensibly; the new code accomplishes this, and also reduces the number
of CLOG lookups as compared with your version.

- I moved most of the HOT chain checks from the loop over the
predecessor[] array to the loop over the successor[] array. It didn't
seem to have any value to put them in the third loop; it forces us to
expend extra code to distinguish between redirects and tuples,
information that we already had in the second loop. The only check
that seems to make sense to do in that last loop is the one for a HOT
chain that starts with a HOT tuple, which can't be done any earlier.

- I realized that your patch had a guard against setting the
predecessor[] when it was set already only for tuples, not for
redirects. That means if a redirect pointed into the middle of a HOT
chain we might not report corruption appropriately. I fixed this and
reworded the associated messages a bit.

- Assorted cosmetic and comment changes.

I think this is easier to follow and more nearly correct, but what do
you (and others) think?

--
Robert Haas
EDB: http://www.enterprisedb.com

Attachment

Re: HOT chain validation in verify_heapam()

From
Mark Dilger
Date:

> On Mar 7, 2023, at 10:16 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Mon, Mar 6, 2023 at 12:36 PM Robert Haas <robertmhaas@gmail.com> wrote:
>> So it seems that we still don't have a patch where the
>> value of a variable called lp_valid corresponds to whether or not the
>> L.P. is valid.
>
> Here's a worked-over version of this patch. Changes:
>
> - I got rid of the code that sets lp_valid in funny places and instead
> arranged to have check_tuple_visibility() pass up the information on
> the XID status. That's important, because we can't casually apply
> operations like TransactionIdIsCommitted() to XIDs that, for all we
> know, might not even be in the range covered by CLOG. In such cases,
> we should not perform any HOT chain validation because we can't do it
> sensibly; the new code accomplishes this, and also reduces the number
> of CLOG lookups as compared with your version.
>
> - I moved most of the HOT chain checks from the loop over the
> predecessor[] array to the loop over the successor[] array. It didn't
> seem to have any value to put them in the third loop; it forces us to
> expend extra code to distinguish between redirects and tuples,
> information that we already had in the second loop. The only check
> that seems to make sense to do in that last loop is the one for a HOT
> chain that starts with a HOT tuple, which can't be done any earlier.
>
> - I realized that your patch had a guard against setting the
> predecessor[] when it was set already only for tuples, not for
> redirects. That means if a redirect pointed into the middle of a HOT
> chain we might not report corruption appropriately. I fixed this and
> reworded the associated messages a bit.
>
> - Assorted cosmetic and comment changes.
>
> I think this is easier to follow and more nearly correct, but what do
> you (and others) think?

Thanks, Robert.  Quickly skimming over this patch, it looks like something reviewable.  Your changes to
t/004_verify_heapam.plappear to be consistent with how that test was intended to function. 

Note that I have not tried any of this yet.

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company






Re: HOT chain validation in verify_heapam()

From
Aleksander Alekseev
Date:
Hi,

> Note that I have not tried any of this yet.

I did, both with Meson and Autotools. All in all the patch looks very
good, but I have a few little nitpicks.

```
+                /* HOT chains should not intersect. */
+                if (predecessor[nextoffnum] != InvalidOffsetNumber)
+                {
+                    report_corruption(&ctx,
+                                      psprintf("redirect line pointer
points to offset %u, but offset %u also points there",
+                                               (unsigned) nextoffnum,
(unsigned) predecessor[nextoffnum]));
+                    continue;
+                }
```

This type of corruption doesn't seem to be test-covered.

```
+            /*
+             * If the next line pointer is a redirect, or if it's a tuple
+             * but the XMAX of this tuple doesn't match the XMIN of the next
+             * tuple, then the two aren't part of the same update chain and
+             * there is nothing more to do.
+             */
+            if (ItemIdIsRedirected(next_lp))
+                continue;
```

lcov shows that the `continue` path is never executed. This is
probably not a big deal however.

```
+$node->append_conf('postgresql.conf','max_prepared_transactions=100');
```

From what I can tell this line is not needed.

-- 
Best regards,
Aleksander Alekseev



Re: HOT chain validation in verify_heapam()

From
Robert Haas
Date:
On Wed, Mar 8, 2023 at 5:35 AM Aleksander Alekseev
<aleksander@timescale.com> wrote:
> I did, both with Meson and Autotools. All in all the patch looks very
> good, but I have a few little nitpicks.

Thank you for the nitpicks.

> +                /* HOT chains should not intersect. */
> +                if (predecessor[nextoffnum] != InvalidOffsetNumber)
> +                {
> +                    report_corruption(&ctx,
> +                                      psprintf("redirect line pointer
> points to offset %u, but offset %u also points there",
> +                                               (unsigned) nextoffnum,
> (unsigned) predecessor[nextoffnum]));
> +                    continue;
> +                }
> ```
>
> This type of corruption doesn't seem to be test-covered.

Himanshu, would you be able to try to write a test case for this? I
think you need something like this: update a tuple with a lower TID to
produce a tuple with a higher TID, e.g. (0,10) is updated to produce
(0,11). But then have a redirect line pointer that also points to the
result of the update, in this case (0,11).

> ```
> +            /*
> +             * If the next line pointer is a redirect, or if it's a tuple
> +             * but the XMAX of this tuple doesn't match the XMIN of the next
> +             * tuple, then the two aren't part of the same update chain and
> +             * there is nothing more to do.
> +             */
> +            if (ItemIdIsRedirected(next_lp))
> +                continue;
> ```
>
> lcov shows that the `continue` path is never executed. This is
> probably not a big deal however.

It might be good to have a negative test case for this, though. Let's
say we, e.g. update (0,1) to produce (0,2), but then abort. The page
is HOT-pruned. Then we add insert a new tuple at (0,2), HOT-update it
to produce (0,3), and commit. Then we HOT-prune again. Possibly we
could try to write a test case that verifies that this does NOT
produce any corruption indication.

> ```
> +$node->append_conf('postgresql.conf','max_prepared_transactions=100');
> ```
>
> From what I can tell this line is not needed.

That surprises me, because the new test cases involve preparing a
transaction, and by default max_prepared_transactions=0. So it seems
to me (without testing) that this ought to be required. Did you test
that it works without this setting?

The value of 100 seems a bit excessive, though. Most TAP tests seem to use 10.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: HOT chain validation in verify_heapam()

From
Aleksander Alekseev
Date:
Hi,

> > ```
> > +$node->append_conf('postgresql.conf','max_prepared_transactions=100');
> > ```
> >
> > From what I can tell this line is not needed.
>
> That surprises me, because the new test cases involve preparing a
> transaction, and by default max_prepared_transactions=0. So it seems
> to me (without testing) that this ought to be required. Did you test
> that it works without this setting?

Sorry, I was wrong the first time. The test fails without this line:

```
112/238 postgresql:pg_amcheck / pg_amcheck/004_verify_heapam ERROR
4.94s exit status 29
```

-- 
Best regards,
Aleksander Alekseev



Re: HOT chain validation in verify_heapam()

From
Himanshu Upadhyaya
Date:

On Wed, Mar 8, 2023 at 7:06 PM Robert Haas <robertmhaas@gmail.com> wrote:

> +                /* HOT chains should not intersect. */
> +                if (predecessor[nextoffnum] != InvalidOffsetNumber)
> +                {
> +                    report_corruption(&ctx,
> +                                      psprintf("redirect line pointer
> points to offset %u, but offset %u also points there",
> +                                               (unsigned) nextoffnum,
> (unsigned) predecessor[nextoffnum]));
> +                    continue;
> +                }
> ```
>
> This type of corruption doesn't seem to be test-covered.

Himanshu, would you be able to try to write a test case for this? I
think you need something like this: update a tuple with a lower TID to
produce a tuple with a higher TID, e.g. (0,10) is updated to produce
(0,11). But then have a redirect line pointer that also points to the
result of the update, in this case (0,11).

Sure Robert, I will work on this. 
> ```
> +            /*
> +             * If the next line pointer is a redirect, or if it's a tuple
> +             * but the XMAX of this tuple doesn't match the XMIN of the next
> +             * tuple, then the two aren't part of the same update chain and
> +             * there is nothing more to do.
> +             */
> +            if (ItemIdIsRedirected(next_lp))
> +                continue;
> ```
>
> lcov shows that the `continue` path is never executed. This is
> probably not a big deal however.

It might be good to have a negative test case for this, though. Let's
say we, e.g. update (0,1) to produce (0,2), but then abort. The page
is HOT-pruned. Then we add insert a new tuple at (0,2), HOT-update it
to produce (0,3), and commit. Then we HOT-prune again. Possibly we
could try to write a test case that verifies that this does NOT
produce any corruption indication.

will work on this too. 
> ```
> +$node->append_conf('postgresql.conf','max_prepared_transactions=100');
> ```
>
> From what I can tell this line is not needed.

That surprises me, because the new test cases involve preparing a
transaction, and by default max_prepared_transactions=0. So it seems
to me (without testing) that this ought to be required. Did you test
that it works without this setting?

The value of 100 seems a bit excessive, though. Most TAP tests seem to use 10.

We need this for prepare transaction, will change it to 10. 

--
Regards,
Himanshu Upadhyaya
EnterpriseDB: http://www.enterprisedb.com

Re: HOT chain validation in verify_heapam()

From
Himanshu Upadhyaya
Date:


On Wed, Mar 8, 2023 at 7:30 PM Himanshu Upadhyaya <upadhyaya.himanshu@gmail.com> wrote:
Please find the v11 patch with all these changes.

--
Regards,
Himanshu Upadhyaya
EnterpriseDB: http://www.enterprisedb.com
Attachment

Re: HOT chain validation in verify_heapam()

From
Aleksander Alekseev
Date:
Hi,

> On Wed, Mar 8, 2023 at 7:30 PM Himanshu Upadhyaya <upadhyaya.himanshu@gmail.com> wrote:
> Please find the v11 patch with all these changes.

The patch needed a rebase due to a4f23f9b. PFA v12.

--
Best regards,
Aleksander Alekseev

Attachment

Re: HOT chain validation in verify_heapam()

From
Robert Haas
Date:
On Fri, Mar 17, 2023 at 8:31 AM Aleksander Alekseev
<aleksander@timescale.com> wrote:
> The patch needed a rebase due to a4f23f9b. PFA v12.

I have committed this after tidying up a bunch of things in the test
case file that I found too difficult to understand -- or in some cases
just incorrect, like:

     elsif ($offnum == 35)
     {
-        # set xmax to invalid transaction id.
         $tup->{t_xmin} = $in_progress_xid;
         $tup->{t_infomask} &= ~HEAP_XMIN_COMMITTED;
         push @expected,

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: HOT chain validation in verify_heapam()

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> I have committed this after tidying up a bunch of things in the test
> case file that I found too difficult to understand -- or in some cases
> just incorrect, like:

My animal mamba doesn't like this one bit.

I suspect the reason is that it's big-endian (PPC) and the endianness
hacking in the test is simply wrong:

        syswrite($file,
                 pack("L", $ENDIANNESS eq 'little' ? 0x00010019 : 0x19000100))
            or BAIL_OUT("syswrite failed: $!");

pack's L code should already be performing an endianness swap, so why
are we doing another one in the argument?

            regards, tom lane



Re: HOT chain validation in verify_heapam()

From
Andres Freund
Date:
Hi,

On 2023-03-22 09:19:18 -0400, Robert Haas wrote:
> On Fri, Mar 17, 2023 at 8:31 AM Aleksander Alekseev
> <aleksander@timescale.com> wrote:
> > The patch needed a rebase due to a4f23f9b. PFA v12.
>
> I have committed this after tidying up a bunch of things in the test
> case file that I found too difficult to understand -- or in some cases
> just incorrect, like:

As noticed by Andrew
https://postgr.es/m/bfa5bd2b-c0e6-9d65-62ce-97f4766b1c42%40dunslane.net and
then reproduced on HEAD by me, there's something not right here.

At the very least there's missing verification that tids actually exists in the
"Update chain validation" loop, leading to:
TRAP: failed Assert("ItemIdHasStorage(itemId)"), File:
"../../../../home/andres/src/postgresql/src/include/storage/bufpage.h",Line: 355, PID: 645093
 

Which makes sense - the earlier loop adds t_ctid to the successor array, which
we then query without checking if there still is such a tid on the page.

I suspect we don't just need a !ItemIdIsUsed(), but also a check gainst the
max offset on the page.

WRT these failures:
non-heap-only update produced a heap-only tuple at offset 20

I think that's largely a consequence of HeapTupleHeaderIsHotUpdated()'s
definition:
/*
 * Note that we stop considering a tuple HOT-updated as soon as it is known
 * aborted or the would-be updating transaction is known aborted.  For best
 * efficiency, check tuple visibility before using this macro, so that the
 * INVALID bits will be as up to date as possible.
 */
#define HeapTupleHeaderIsHotUpdated(tup) \
( \
    ((tup)->t_infomask2 & HEAP_HOT_UPDATED) != 0 && \
    ((tup)->t_infomask & HEAP_XMAX_INVALID) == 0 && \
    !HeapTupleHeaderXminInvalid(tup) \
)


Currently the new verify_heapam() follows ctid chains when XMAX_INVALID is set
and expects to find an item it can dereference - but I don't think that's
something we can rely on: Afaics HOT pruning can break chains, but doesn't
reset xmax.

Greetings,

Andres Freund



Re: HOT chain validation in verify_heapam()

From
Peter Geoghegan
Date:
On Wed, Mar 22, 2023 at 1:45 PM Andres Freund <andres@anarazel.de> wrote:
> At the very least there's missing verification that tids actually exists in the
> "Update chain validation" loop, leading to:
> TRAP: failed Assert("ItemIdHasStorage(itemId)"), File:
"../../../../home/andres/src/postgresql/src/include/storage/bufpage.h",Line: 355, PID: 645093 
>
> Which makes sense - the earlier loop adds t_ctid to the successor array, which
> we then query without checking if there still is such a tid on the page.
>
> I suspect we don't just need a !ItemIdIsUsed(), but also a check gainst the
> max offset on the page.

We definitely need to do it that way, since a heap-only tuple's t_ctid
is allowed to point to almost anything. I guess it can't point to some
completely different heap block, but that's about the only
restriction. In particular, it can point to an item that's past the
end of the page following line pointer array truncation (truncation
can happen during pruning or when the second heap pass takes place in
VACUUM).

OTOH the rules for LP_REDIRECT items *are* very strict. They need to
be, since it's the root item of the HOT chain, referenced by TIDs in
indexes, and have no heap tuple header metadata to use in cross-checks
that take place during HOT chain traversal (traversal by code in
places such as heap_hot_search_buffer).

> WRT these failures:
> non-heap-only update produced a heap-only tuple at offset 20
>
> I think that's largely a consequence of HeapTupleHeaderIsHotUpdated()'s
> definition:

That has to be a problem for verify_heapam.

> Currently the new verify_heapam() follows ctid chains when XMAX_INVALID is set
> and expects to find an item it can dereference - but I don't think that's
> something we can rely on: Afaics HOT pruning can break chains, but doesn't
> reset xmax.

I think that we need two passes to be completely thorough. An initial
pass, that works pretty much as-is, plus a second pass that locates
any orphaned heap-only tuples -- heap-only tuples that were not deemed
part of a valid HOT chain during the first pass. These remaining
orphaned heap-only tuples should be verified as having come from
now-aborted transactions (they should definitely be fully DEAD) --
otherwise we have corruption.

That's what my abandoned patch to make heap pruning more robust did,
you'll recall.

--
Peter Geoghegan



Re: HOT chain validation in verify_heapam()

From
Peter Geoghegan
Date:
On Wed, Mar 22, 2023 at 2:14 PM Peter Geoghegan <pg@bowt.ie> wrote:
> > Currently the new verify_heapam() follows ctid chains when XMAX_INVALID is set
> > and expects to find an item it can dereference - but I don't think that's
> > something we can rely on: Afaics HOT pruning can break chains, but doesn't
> > reset xmax.
>
> I think that we need two passes to be completely thorough. An initial
> pass, that works pretty much as-is, plus a second pass that locates
> any orphaned heap-only tuples -- heap-only tuples that were not deemed
> part of a valid HOT chain during the first pass. These remaining
> orphaned heap-only tuples should be verified as having come from
> now-aborted transactions (they should definitely be fully DEAD) --
> otherwise we have corruption.

I see that there is a second pass over the heap page in
verify_heapam(), in fact. Kind of. I'm referring to this loop:

        /*
         * An update chain can start either with a non-heap-only tuple or with
         * a redirect line pointer, but not with a heap-only tuple.
         *
         * (This check is in a separate loop because we need the predecessor
         * array to be fully populated before we can perform it.)
         */
        for (ctx.offnum = FirstOffsetNumber;
             ctx.offnum <= maxoff;
             ctx.offnum = OffsetNumberNext(ctx.offnum))
        {
            if (xmin_commit_status_ok[ctx.offnum] &&
                (xmin_commit_status[ctx.offnum] == XID_COMMITTED ||
                 xmin_commit_status[ctx.offnum] == XID_IN_PROGRESS) &&
                predecessor[ctx.offnum] == InvalidOffsetNumber)
            {
                ItemId      curr_lp;

                curr_lp = PageGetItemId(ctx.page, ctx.offnum);
                if (!ItemIdIsRedirected(curr_lp))
                {
                    HeapTupleHeader curr_htup;

                    curr_htup = (HeapTupleHeader)
                        PageGetItem(ctx.page, curr_lp);
                    if (HeapTupleHeaderIsHeapOnly(curr_htup))
                        report_corruption(&ctx,
                                          psprintf("tuple is root of
chain but is marked as heap-only tuple"));
                }
            }
        }


However, this "second pass over page" loop has roughly the same
problem as the nearby HeapTupleHeaderIsHotUpdated() coding pattern: it
doesn't account for the fact that a tuple whose xmin was
XID_IN_PROGRESS a little earlier on may not be in that state once we
reach the second pass loop. Concurrent transaction abort needs to be
accounted for. The loop needs to recheck xmin status, at least in the
initially-XID_IN_PROGRESS-xmin case.

--
Peter Geoghegan



Re: HOT chain validation in verify_heapam()

From
Andres Freund
Date:
Hi,

On 2023-03-22 13:45:52 -0700, Andres Freund wrote:
> On 2023-03-22 09:19:18 -0400, Robert Haas wrote:
> > On Fri, Mar 17, 2023 at 8:31 AM Aleksander Alekseev
> > <aleksander@timescale.com> wrote:
> > > The patch needed a rebase due to a4f23f9b. PFA v12.
> >
> > I have committed this after tidying up a bunch of things in the test
> > case file that I found too difficult to understand -- or in some cases
> > just incorrect, like:
>
> As noticed by Andrew
> https://postgr.es/m/bfa5bd2b-c0e6-9d65-62ce-97f4766b1c42%40dunslane.net and
> then reproduced on HEAD by me, there's something not right here.
>
> At the very least there's missing verification that tids actually exists in the
> "Update chain validation" loop, leading to:
> TRAP: failed Assert("ItemIdHasStorage(itemId)"), File:
"../../../../home/andres/src/postgresql/src/include/storage/bufpage.h",Line: 355, PID: 645093
 
>
> Which makes sense - the earlier loop adds t_ctid to the successor array, which
> we then query without checking if there still is such a tid on the page.

It's not quite so simple - I see now that the lp_valid check tries to prevent
that. However, it's not sufficient, because there is no guarantee that
lp_valid[nextoffnum] is initialized. Consider what happens if t_ctid of a heap
tuple points to beyond the end of the item array - lp_valid[nextoffnum] won't
be initialized.

Why are redirections now checked in two places? There already was a
ItemIdIsUsed() check in the "/* Perform tuple checks */" loop, but now there's
the ItemIdIsRedirected() check in the "Update chain validation." loop as well
- and the output of that is confusing, because it'll just mention the target
of the redirect.


I also think it's not quite right that some of checks inside if
(ItemIdIsRedirected()) continue in case of corruption, others don't. While
there's a later continue, that means the corrupt tuples get added to the
predecessor array.  Similarly, in the non-redirect portion, the successor
array gets filled with corrupt tuples, which doesn't seem quite right to me.


A patch addressing some, but not all, of those is attached. With that I don't
see any crashes or false-positives anymore.

Greetings,

Andres Freund

Attachment

Re: HOT chain validation in verify_heapam()

From
Andres Freund
Date:
Hi,

On 2023-03-22 14:56:22 -0700, Andres Freund wrote:
> A patch addressing some, but not all, of those is attached. With that I don't
> see any crashes or false-positives anymore.

That patch missed that, as committed, the first if (ItemIdIsRedirected())
check sets lp_valid[n] = true even if the target of the redirect is unused.

With that fixed, 004_verify_heapam doesn't cause crash anymore - it doesn't
pass though, because there's a bunch of unadjusted error messages.

Andres

Attachment

Re: HOT chain validation in verify_heapam()

From
Andres Freund
Date:
Hi,

On 2023-03-22 13:45:52 -0700, Andres Freund wrote:
> On 2023-03-22 09:19:18 -0400, Robert Haas wrote:
> > On Fri, Mar 17, 2023 at 8:31 AM Aleksander Alekseev
> > <aleksander@timescale.com> wrote:
> > > The patch needed a rebase due to a4f23f9b. PFA v12.
> >
> > I have committed this after tidying up a bunch of things in the test
> > case file that I found too difficult to understand -- or in some cases
> > just incorrect, like:
> 
> As noticed by Andrew
> https://postgr.es/m/bfa5bd2b-c0e6-9d65-62ce-97f4766b1c42%40dunslane.net and
> then reproduced on HEAD by me, there's something not right here.

skink / valgrind reported in a while back and found another issue:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink&dt=2023-03-22%2021%3A53%3A41

==2490364== VALGRINDERROR-BEGIN
==2490364== Conditional jump or move depends on uninitialised value(s)
==2490364==    at 0x11D459F2: check_tuple_visibility (verify_heapam.c:1379)
==2490364==    by 0x11D46262: check_tuple (verify_heapam.c:1812)
==2490364==    by 0x11D46FDF: verify_heapam (verify_heapam.c:535)
==2490364==    by 0x3D5B2C: ExecMakeTableFunctionResult (execSRF.c:235)
==2490364==    by 0x3E8225: FunctionNext (nodeFunctionscan.c:95)
==2490364==    by 0x3D6685: ExecScanFetch (execScan.c:133)
==2490364==    by 0x3D6709: ExecScan (execScan.c:182)
==2490364==    by 0x3E813A: ExecFunctionScan (nodeFunctionscan.c:270)
==2490364==    by 0x3D31C4: ExecProcNodeFirst (execProcnode.c:464)
==2490364==    by 0x3FF7E7: ExecProcNode (executor.h:262)
==2490364==    by 0x3FFB15: ExecNestLoop (nodeNestloop.c:160)
==2490364==    by 0x3D31C4: ExecProcNodeFirst (execProcnode.c:464)
==2490364==  Uninitialised value was created by a stack allocation
==2490364==    at 0x11D45325: check_tuple_visibility (verify_heapam.c:994)
==2490364== 
==2490364== VALGRINDERROR-END
==2490364== VALGRINDERROR-BEGIN
==2490364== Conditional jump or move depends on uninitialised value(s)
==2490364==    at 0x11D45AC6: check_tuple_visibility (verify_heapam.c:1379)
==2490364==    by 0x11D46262: check_tuple (verify_heapam.c:1812)
==2490364==    by 0x11D46FDF: verify_heapam (verify_heapam.c:535)
==2490364==    by 0x3D5B2C: ExecMakeTableFunctionResult (execSRF.c:235)
==2490364==    by 0x3E8225: FunctionNext (nodeFunctionscan.c:95)
==2490364==    by 0x3D6685: ExecScanFetch (execScan.c:133)
==2490364==    by 0x3D6709: ExecScan (execScan.c:182)
==2490364==    by 0x3E813A: ExecFunctionScan (nodeFunctionscan.c:270)
==2490364==    by 0x3D31C4: ExecProcNodeFirst (execProcnode.c:464)
==2490364==    by 0x3FF7E7: ExecProcNode (executor.h:262)
==2490364==    by 0x3FFB15: ExecNestLoop (nodeNestloop.c:160)
==2490364==    by 0x3D31C4: ExecProcNodeFirst (execProcnode.c:464)
==2490364==  Uninitialised value was created by a stack allocation
==2490364==    at 0x11D45325: check_tuple_visibility (verify_heapam.c:994)
==2490364==

Greetings,

Andres Freund



Re: HOT chain validation in verify_heapam()

From
Himanshu Upadhyaya
Date:


On Thu, Mar 23, 2023 at 2:15 AM Andres Freund <andres@anarazel.de> wrote:

Currently the new verify_heapam() follows ctid chains when XMAX_INVALID is set
and expects to find an item it can dereference - but I don't think that's
something we can rely on: Afaics HOT pruning can break chains, but doesn't
reset xmax.

We have below code which I think takes care of xmin and xmax matching and if they match then only we add them to the predecessor array.
                        /*
                         * If the next line pointer is a redirect, or if it's a tuple
                         * but the XMAX of this tuple doesn't match the XMIN of the next
                         * tuple, then the two aren't part of the same update chain and
                         * there is nothing more to do.
                         */
                        if (ItemIdIsRedirected(next_lp))
                                continue;
                        curr_htup = (HeapTupleHeader) PageGetItem(ctx.page, curr_lp);
                        curr_xmax = HeapTupleHeaderGetUpdateXid(curr_htup);
                        next_htup = (HeapTupleHeader) PageGetItem(ctx.page, next_lp);
                        next_xmin = HeapTupleHeaderGetXmin(next_htup);
                        if (!TransactionIdIsValid(curr_xmax) ||
                                !TransactionIdEquals(curr_xmax, next_xmin))
                                continue;

--
Regards,
Himanshu Upadhyaya
EnterpriseDB: http://www.enterprisedb.com

Re: HOT chain validation in verify_heapam()

From
Robert Haas
Date:
On Wed, Mar 22, 2023 at 3:27 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> My animal mamba doesn't like this one bit.
>
> I suspect the reason is that it's big-endian (PPC) and the endianness
> hacking in the test is simply wrong:
>
>         syswrite($file,
>                  pack("L", $ENDIANNESS eq 'little' ? 0x00010019 : 0x19000100))
>             or BAIL_OUT("syswrite failed: $!");
>
> pack's L code should already be performing an endianness swap, so why
> are we doing another one in the argument?

(Apologies for having missed responding to this yesterday afternoon.)

Hmph. I didn't think very hard about that code and just assumed
Himanshu had tested it. I don't have convenient access to a Big-endian
test machine myself. Are you able to check whether using 0x00010019
unconditionally works?

I think part of the reason that I thought this looked OK was because
there are other places in this test case that do stuff differently
based on endian-ness, but now that you mention it, I realize that
stuff has to do with the varlena representation, which is
endian-dependent in a way that the ItemIdData representation is not.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: HOT chain validation in verify_heapam()

From
Robert Haas
Date:
On Wed, Mar 22, 2023 at 4:45 PM Andres Freund <andres@anarazel.de> wrote:
> At the very least there's missing verification that tids actually exists in the
> "Update chain validation" loop, leading to:
> TRAP: failed Assert("ItemIdHasStorage(itemId)"), File:
"../../../../home/andres/src/postgresql/src/include/storage/bufpage.h",Line: 355, PID: 645093 
>
> Which makes sense - the earlier loop adds t_ctid to the successor array, which
> we then query without checking if there still is such a tid on the page.

Ah, crap. If the /* Perform tuple checks loop */ finds a redirect line
pointer, it verifies that the target is between FirstOffsetnNumber and
maxoff before setting lp_valid[ctx.offnum] = true. But in the case
where it's a CTID link, the equivalent checks are missing. We could
fix that like this:

--- a/contrib/amcheck/verify_heapam.c
+++ b/contrib/amcheck/verify_heapam.c
@@ -543,7 +543,8 @@ verify_heapam(PG_FUNCTION_ARGS)
              */
             nextblkno = ItemPointerGetBlockNumber(&(ctx.tuphdr)->t_ctid);
             nextoffnum = ItemPointerGetOffsetNumber(&(ctx.tuphdr)->t_ctid);
-            if (nextblkno == ctx.blkno && nextoffnum != ctx.offnum)
+            if (nextblkno == ctx.blkno && nextoffnum != ctx.offnum &&
+                nextoffnum >= FirstOffsetNumber && nextoffnum <= maxoff)
                 successor[ctx.offnum] = nextoffnum;
         }

> I suspect we don't just need a !ItemIdIsUsed(), but also a check gainst the
> max offset on the page.

I don't see why we need an ItemIdIsUsed check any place where we don't
have one already. lp_valid[x] can't be true if the item x isn't used,
unless we're referencing off the initialized portion of the array,
which we shouldn't do.

> WRT these failures:
> non-heap-only update produced a heap-only tuple at offset 20
>
> I think that's largely a consequence of HeapTupleHeaderIsHotUpdated()'s
> definition:
> /*
>  * Note that we stop considering a tuple HOT-updated as soon as it is known
>  * aborted or the would-be updating transaction is known aborted.  For best
>  * efficiency, check tuple visibility before using this macro, so that the
>  * INVALID bits will be as up to date as possible.
>  */
> #define HeapTupleHeaderIsHotUpdated(tup) \
> ( \
>         ((tup)->t_infomask2 & HEAP_HOT_UPDATED) != 0 && \
>         ((tup)->t_infomask & HEAP_XMAX_INVALID) == 0 && \
>         !HeapTupleHeaderXminInvalid(tup) \
> )

Yeah, it's not good that we're looking at the hint bit or the xmin
there -- it should just be checking the infomask2 bit and nothing
else, I think.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: HOT chain validation in verify_heapam()

From
Robert Haas
Date:
On Wed, Mar 22, 2023 at 5:56 PM Andres Freund <andres@anarazel.de> wrote:
> Why are redirections now checked in two places? There already was a
> ItemIdIsUsed() check in the "/* Perform tuple checks */" loop, but now there's
> the ItemIdIsRedirected() check in the "Update chain validation." loop as well
> - and the output of that is confusing, because it'll just mention the target
> of the redirect.

ctx.offnum is reported as part of the context message, and doesn't
need to be duplicate in the message itself. Any other offset numbers
do need to be mentioned in the message itself. I have attempted to
make sure that the code consistently follows this rule but if you see
a case where I have failed to do so, let me know.

> I also think it's not quite right that some of checks inside if
> (ItemIdIsRedirected()) continue in case of corruption, others don't. While
> there's a later continue, that means the corrupt tuples get added to the
> predecessor array.  Similarly, in the non-redirect portion, the successor
> array gets filled with corrupt tuples, which doesn't seem quite right to me.

I'm not entirely sure I agree with this. I mean, we should skip
further checks of the data is so corrupt that further checks are not
sensible e.g. if the line pointer is just gibberish we can't validate
anything about the tuple, because we can't even find the tuple.
However, we don't want to skip further checks as soon as we see any
kind of a problem whatsoever. For instance, even if the tuple data is
utter gibberish, that should not and does not keep us from checking
whether the update chain looks sane. If the tuple header is garbage
(e.g. xmin and xmax are outside the bounds of clog) then at least some
of the update-chain checks are not possible, because we can't know the
commit status of the tuples, but garbage in the tuple data isn't
really a problem for update chain validation per se.

So I think the places that lack a continue are making a judgement that
continuing with further checks is reasonable in those scenarios, and
it's not obvious to me that those judgements are wrong. For instance,
suppose we reach the "Can only redirect to a HOT tuple." case. If we
continue there, we won't check whether the HEAP_UPDATED is set on the
successor tuple, and we won't perform the intersecting-HOT-chain
check, and we won't set the predecessor[] array, and I don't really
see why we shouldn't do those things. In my view, the fact that the
redirected line pointer points to a tuple that is not marked as HOT
does mean we have corruption, but it doesn't mean that we should give
up on thinking of those two offset numbers as part of an update chain.

> A patch addressing some, but not all, of those is attached. With that I don't
> see any crashes or false-positives anymore.

+                               else if (ItemIdIsDead(rditem))
+                                       report_corruption(&ctx,
+
   psprintf("line pointer redirection to dead item at offset %u",
+
                    (unsigned) rdoffnum));
+                               else if (ItemIdIsRedirected(rditem))
+                                       report_corruption(&ctx,
+
   psprintf("line pointer redirection to another redirect at offset
%u",
+
                    (unsigned) rdoffnum));

Hmm, the first of these definitely seems like a good additional check.
The second one moves an existing check from the subsequent loop to
this loop, which if we're going to add that additional check makes
sense to do for consistency.

+            /* the current line pointer may not have a successor */
+            if (nextoffnum == InvalidOffsetNumber)
+                continue;
+
             /*
-             * The current line pointer may not have a successor, either
-             * because it's not valid or because it didn't point to anything.
-             * In either case, we have to give up.
-             *
-             * If the current line pointer does point to something, it's
-             * possible that the target line pointer isn't valid. We have to
-             * give up in that case, too.
+             * The successor is located beyond the end of the line item array,
+             * which can happen when the array is truncated.
              */
-            if (nextoffnum == InvalidOffsetNumber || !lp_valid[nextoffnum])
+            if (nextoffnum > maxoff)
+                continue;
+
+            /* the successor is not valid, have to give up */
+            if (!lp_valid[nextoffnum])
                 continue;

I don't agree with this part -- I think we should instead do what I
proposed before and prevent the successor[] array from ever getting
entries that are out of bounds.

-                 * Redirects are created by updates, so successor should be
-                 * the result of an update.
+                 * Redirects are created by HOT updates, so successor should
+                 * be the result of an HOT update.
+                 *
+                 * XXX: HeapTupleHeaderIsHeapOnly() should always imply
+                 * HEAP_UPDATED. This should be checked even when the tuple
+                 * isn't a target of a redirect.

Hmm, OK. So the question is where to put this check. Maybe inside
check_tuple_header(), making it independent of the update chain
validation stuff?

+             *
+             * NB: Can't use HeapTupleHeaderIsHotUpdated() as it checks if
+             * hint bits indicate xmin/xmax aborted.
              */
-            if (!HeapTupleHeaderIsHotUpdated(curr_htup) &&
+            if (!(curr_htup->t_infomask2 & HEAP_HOT_UPDATED) &&
                 HeapTupleHeaderIsHeapOnly(next_htup))
             {
                 report_corruption(&ctx,
                                   psprintf("non-heap-only update
produced a heap-only tuple at offset %u",
                                            (unsigned) nextoffnum));
             }
-            if (HeapTupleHeaderIsHotUpdated(curr_htup) &&
+            if ((curr_htup->t_infomask2 & HEAP_HOT_UPDATED) &&

Makes sense.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: HOT chain validation in verify_heapam()

From
Robert Haas
Date:
On Wed, Mar 22, 2023 at 5:42 PM Peter Geoghegan <pg@bowt.ie> wrote:
> However, this "second pass over page" loop has roughly the same
> problem as the nearby HeapTupleHeaderIsHotUpdated() coding pattern: it
> doesn't account for the fact that a tuple whose xmin was
> XID_IN_PROGRESS a little earlier on may not be in that state once we
> reach the second pass loop. Concurrent transaction abort needs to be
> accounted for. The loop needs to recheck xmin status, at least in the
> initially-XID_IN_PROGRESS-xmin case.

I don't understand why it would need to do that. If the transaction
has subsequently committed, it doesn't change anything: we'll get the
same report we would have gotten anyway. If the transaction has
subsequently aborted, we'll get a report about corruption that would
not have been reported if the abort had occurred slightly earlier.
However, the abort doesn't remove the corruption, just our ability to
detect it.

Consider a page where TID 1 is a redirect to TID 4; TID 2 is dead; and
TIDs 3 and 4 are heap-only tuples. Any other line pointers on the page
are unused. The only way this can validly happen is if there was a
tuple at TID 2 and it got updated to produce the tuple at TID 3 and
then that transaction aborted. Then it got updated again and produced
the tuple at TID 4 and that transaction was committed. But this
implies that the xmin of TID 3 must be aborted. If we observe that
it's in-progress, we know that the transaction that created TID 3 was
still running after TID 4 had already shown up, which should be
impossible, and so it's fair to report corruption. If the xmin of TID
3 then goes on to abort, a future attempt to verify this page won't be
able to notice the corruption any more, because it won't be able to
prove that TID 3's xmin aborted after TID 4's xmin committed. But a
current attempt to verify this page that has seen TID 3's xmin as
in-progress at any point after locking the page knows for sure that
TID 4 showed up before TID 3's inserter aborted, and that's
inconsistent with any legal order of operations.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: HOT chain validation in verify_heapam()

From
Robert Haas
Date:
On Thu, Mar 23, 2023 at 9:42 AM Robert Haas <robertmhaas@gmail.com> wrote:
> Hmph. I didn't think very hard about that code and just assumed
> Himanshu had tested it. I don't have convenient access to a Big-endian
> test machine myself. Are you able to check whether using 0x00010019
> unconditionally works?

Oh, I see now you already pushed a fix. Thanks.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: HOT chain validation in verify_heapam()

From
Andres Freund
Date:
Hi,

On 2023-03-23 11:41:52 -0400, Robert Haas wrote:
> On Wed, Mar 22, 2023 at 5:56 PM Andres Freund <andres@anarazel.de> wrote:
> > I also think it's not quite right that some of checks inside if
> > (ItemIdIsRedirected()) continue in case of corruption, others don't. While
> > there's a later continue, that means the corrupt tuples get added to the
> > predecessor array.  Similarly, in the non-redirect portion, the successor
> > array gets filled with corrupt tuples, which doesn't seem quite right to me.
> 
> I'm not entirely sure I agree with this. I mean, we should skip
> further checks of the data is so corrupt that further checks are not
> sensible e.g. if the line pointer is just gibberish we can't validate
> anything about the tuple, because we can't even find the tuple.
> However, we don't want to skip further checks as soon as we see any
> kind of a problem whatsoever. For instance, even if the tuple data is
> utter gibberish, that should not and does not keep us from checking
> whether the update chain looks sane. If the tuple header is garbage
> (e.g. xmin and xmax are outside the bounds of clog) then at least some
> of the update-chain checks are not possible, because we can't know the
> commit status of the tuples, but garbage in the tuple data isn't
> really a problem for update chain validation per se.

The cases I was complaining about where metadata that's important. We
shouldn't enter a tuple into lp_valid[] or successor[] if it failed validity
checks - the subsequent error reports that that generates won't be helpful -
or we'll crash.

E.g. continuing after:

                rditem = PageGetItemId(ctx.page, rdoffnum);
                if (!ItemIdIsUsed(rditem))
                    report_corruption(&ctx,
                                      psprintf("line pointer redirection to unused item at offset %u",
                                               (unsigned) rdoffnum));

means we'll look into the tuple in the "update chain validation" loop for
unused items. Where it afaict will lead to a crash or bogus results, because:
                /* Can only redirect to a HOT tuple. */
                next_htup = (HeapTupleHeader) PageGetItem(ctx.page, next_lp);
                if (!HeapTupleHeaderIsHeapOnly(next_htup))
                {
                    report_corruption(&ctx,
                                      psprintf("redirected line pointer points to a non-heap-only tuple at offset %u",
                                               (unsigned) nextoffnum));
                }

will just dereference the unused item.



> -                 * Redirects are created by updates, so successor should be
> -                 * the result of an update.
> +                 * Redirects are created by HOT updates, so successor should
> +                 * be the result of an HOT update.
> +                 *
> +                 * XXX: HeapTupleHeaderIsHeapOnly() should always imply
> +                 * HEAP_UPDATED. This should be checked even when the tuple
> +                 * isn't a target of a redirect.
> 
> Hmm, OK. So the question is where to put this check. Maybe inside
> check_tuple_header(), making it independent of the update chain
> validation stuff?

Yes, check_tuple_header sounds sensible to me.

Greetings,

Andres Freund



Re: HOT chain validation in verify_heapam()

From
Robert Haas
Date:
On Thu, Mar 23, 2023 at 1:26 PM Andres Freund <andres@anarazel.de> wrote:
> E.g. continuing after:
>
>                                 rditem = PageGetItemId(ctx.page, rdoffnum);
>                                 if (!ItemIdIsUsed(rditem))
>                                         report_corruption(&ctx,
>                                                                           psprintf("line pointer redirection to
unuseditem at offset %u", 
>                                                                                            (unsigned) rdoffnum));
>
> means we'll look into the tuple in the "update chain validation" loop for
> unused items.

Ah, yes, that's a goof for sure.

> > -                 * Redirects are created by updates, so successor should be
> > -                 * the result of an update.
> > +                 * Redirects are created by HOT updates, so successor should
> > +                 * be the result of an HOT update.
> > +                 *
> > +                 * XXX: HeapTupleHeaderIsHeapOnly() should always imply
> > +                 * HEAP_UPDATED. This should be checked even when the tuple
> > +                 * isn't a target of a redirect.
> >
> > Hmm, OK. So the question is where to put this check. Maybe inside
> > check_tuple_header(), making it independent of the update chain
> > validation stuff?
>
> Yes, check_tuple_header sounds sensible to me.

OK, let me spend some more time on this and I'll post a patch (or
patches) in a bit.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: HOT chain validation in verify_heapam()

From
Andres Freund
Date:
Hi,

On 2023-03-23 11:20:04 +0530, Himanshu Upadhyaya wrote:
> On Thu, Mar 23, 2023 at 2:15 AM Andres Freund <andres@anarazel.de> wrote:
> 
> >
> > Currently the new verify_heapam() follows ctid chains when XMAX_INVALID is
> > set
> > and expects to find an item it can dereference - but I don't think that's
> > something we can rely on: Afaics HOT pruning can break chains, but doesn't
> > reset xmax.
> >
> > We have below code which I think takes care of xmin and xmax matching and
> if they match then only we add them to the predecessor array.
>                         /*
>                          * If the next line pointer is a redirect, or if
> it's a tuple
>                          * but the XMAX of this tuple doesn't match the
> XMIN of the next
>                          * tuple, then the two aren't part of the same
> update chain and
>                          * there is nothing more to do.
>                          */
>                         if (ItemIdIsRedirected(next_lp))
>                                 continue;
>                         curr_htup = (HeapTupleHeader) PageGetItem(ctx.page,
> curr_lp);
>                         curr_xmax = HeapTupleHeaderGetUpdateXid(curr_htup);
>                         next_htup = (HeapTupleHeader) PageGetItem(ctx.page,
> next_lp);
>                         next_xmin = HeapTupleHeaderGetXmin(next_htup);
>                         if (!TransactionIdIsValid(curr_xmax) ||
>                                 !TransactionIdEquals(curr_xmax, next_xmin))
>                                 continue;

The problem is that that doesn't help if the tuple points to past maxoff,
because we can't even fetch the tuple and thus won't even reach these
checks. But Robert now put in defenses against that.

Greetings,

Andres Freund



Re: HOT chain validation in verify_heapam()

From
Robert Haas
Date:
On Thu, Mar 23, 2023 at 1:34 PM Robert Haas <robertmhaas@gmail.com> wrote:
> OK, let me spend some more time on this and I'll post a patch (or
> patches) in a bit.

All right, here are some more fixups.

--
Robert Haas
EDB: http://www.enterprisedb.com

Attachment

Re: HOT chain validation in verify_heapam()

From
Robert Haas
Date:
On Wed, Mar 22, 2023 at 8:38 PM Andres Freund <andres@anarazel.de> wrote:
> skink / valgrind reported in a while back and found another issue:
>
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink&dt=2023-03-22%2021%3A53%3A41
>
> ==2490364== VALGRINDERROR-BEGIN
> ==2490364== Conditional jump or move depends on uninitialised value(s)
> ==2490364==    at 0x11D459F2: check_tuple_visibility (verify_heapam.c:1379)
...
> ==2490364==  Uninitialised value was created by a stack allocation
> ==2490364==    at 0x11D45325: check_tuple_visibility (verify_heapam.c:994)

OK, so this is an interesting one. It's complaining about switch
(xmax_status), because the get_xid_status(xmax, ctx, &xmax_status)
used in the previous switch might not actually initialize xmax_status,
and apparently didn't in this case. get_xid_status() does not set
xmax_status except when it returns XID_BOUNDS_OK, and the previous
switch falls through both in that case and also when get_xid_status()
returns XID_INVALID. That seems like it must be the issue here. As far
as I can see, this isn't related to any of the recent changes but has
been like this since this code was introduced, so I'm a little
confused about why it's only causing a problem now.

Nonetheless, here's a patch. I notice that there's a similar problem
in another place, too. get_xid_status() is called a total of five
times and it looks like only three of them got it right. I suppose
that if this is correct we should back-patch it.

--
Robert Haas
EDB: http://www.enterprisedb.com

Attachment

Re: HOT chain validation in verify_heapam()

From
Andres Freund
Date:
Hi,

On 2023-03-23 15:37:15 -0400, Robert Haas wrote:
> On Wed, Mar 22, 2023 at 8:38 PM Andres Freund <andres@anarazel.de> wrote:
> > skink / valgrind reported in a while back and found another issue:
> >
> > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink&dt=2023-03-22%2021%3A53%3A41
> >
> > ==2490364== VALGRINDERROR-BEGIN
> > ==2490364== Conditional jump or move depends on uninitialised value(s)
> > ==2490364==    at 0x11D459F2: check_tuple_visibility (verify_heapam.c:1379)
> ...
> > ==2490364==  Uninitialised value was created by a stack allocation
> > ==2490364==    at 0x11D45325: check_tuple_visibility (verify_heapam.c:994)
> 
> OK, so this is an interesting one. It's complaining about switch
> (xmax_status), because the get_xid_status(xmax, ctx, &xmax_status)
> used in the previous switch might not actually initialize xmax_status,
> and apparently didn't in this case. get_xid_status() does not set
> xmax_status except when it returns XID_BOUNDS_OK, and the previous
> switch falls through both in that case and also when get_xid_status()
> returns XID_INVALID. That seems like it must be the issue here. As far
> as I can see, this isn't related to any of the recent changes but has
> been like this since this code was introduced, so I'm a little
> confused about why it's only causing a problem now.

Could it be that the tests didn't exercise the path before?


> Nonetheless, here's a patch. I notice that there's a similar problem
> in another place, too. get_xid_status() is called a total of five
> times and it looks like only three of them got it right. I suppose
> that if this is correct we should back-patch it.

Yea, I think you're right.


> +            report_corruption(ctx,
> +                              pstrdup("xmin is invalid"));

Not a correctnes issue: Nearly all callers to report_corruption() do a
psprintf(), the remaining a pstrdup(), as here. Seems like it'd be cleaner to
just make report_corruption() accept a format string?

Greetings,

Andres Freund



Re: HOT chain validation in verify_heapam()

From
Robert Haas
Date:
On Thu, Mar 23, 2023 at 4:36 PM Andres Freund <andres@anarazel.de> wrote:
> Could it be that the tests didn't exercise the path before?

Hmm, perhaps.

> > Nonetheless, here's a patch. I notice that there's a similar problem
> > in another place, too. get_xid_status() is called a total of five
> > times and it looks like only three of them got it right. I suppose
> > that if this is correct we should back-patch it.
>
> Yea, I think you're right.

OK.

> > +                     report_corruption(ctx,
> > +                                                       pstrdup("xmin is invalid"));
>
> Not a correctnes issue: Nearly all callers to report_corruption() do a
> psprintf(), the remaining a pstrdup(), as here. Seems like it'd be cleaner to
> just make report_corruption() accept a format string?

Meh.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: HOT chain validation in verify_heapam()

From
Robert Haas
Date:
On Thu, Mar 23, 2023 at 3:06 PM Robert Haas <robertmhaas@gmail.com> wrote:
> On Thu, Mar 23, 2023 at 1:34 PM Robert Haas <robertmhaas@gmail.com> wrote:
> > OK, let me spend some more time on this and I'll post a patch (or
> > patches) in a bit.
>
> All right, here are some more fixups.

It looks like e88754a1965c0f40a723e6e46d670cacda9e19bd make skink
happy (although Peter Geoghegan has spotted a problem with it, see the
thread that begins with the commit email) so I went ahead and
committed these fixups. Hopefully that won't again make the buildfarm
unhappy, but I guess we'll see.

--
Robert Haas
EDB: http://www.enterprisedb.com